Hi Thomas,
Thanks for your reply.
Here’s the config for the web server:
architecture: x86_64
config:
cloud-init.user-data: |
#cloud-config
package_reboot_if_required: true
package_update: true
package_upgrade: true
packages:
- iputils-ping
- nano
- ufw
- zfsutils-linux
users:
- name: devops
shell: /bin/bash
groups: sudo, docker
sudo: ALL=(ALL) PASSWD:ALL
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDH5ITVOPpD/x6hebcJrw2hwE4SuafO1lQ3yYuOp/cFadFK8VGSGsowoU71+YihsjyzX94dp9CIvIya3ioTsJWxgA2aM7iCSNUDEZVrFKl1jsh6LRO5r6TsJXau7V+nD0cvs99YYPvbO1mIovr5h1hW5aZeV146ZWJDjH0Jx95ZiFAadohumG4E+H2JdDzVutyVrPxm7qSjo699bvsl1ZZGeQOWi4sZhRTb9UfFbbAPAbLe2JRBh9QyRgaMsocC3H5orfxd2Js7/R+VluKkQmSEg3v4UL1XUCZjXuB0yG/OQ/+95A4PYZu/n1lJL8WpKuZ6O0q0Yx38f8tgJXYLe5H9 devops
write_files:
- path: /etc/docker/daemon.json
content: |
{
"storage-driver": "zfs",
}
defer: true
chpasswd:
users:
- name: devops
password: devopspassword
type: text
runcmd:
- echo AllowUsers devops >> /etc/ssh/sshd_config
- echo Protocol 2 >> /etc/ssh/sshd_config
- echo PermitRootLogin no >> /etc/ssh/ssh_config
- echo umask 066 >> /etc/profile
- ufw allow OpenSSH
- ufw allow in on lxdbr0
- ufw route allow in on lxdbr0
- ufw route allow out on lxdbr0
- ufw logging on
- ufw allow in from 10.186.40.127 to any port 8080
- ufw default deny incoming
- ufw disable
- snap install docker
- systemctl restart sshd
- systemctl restart docker
# - docker run -d --name weather-api -p 8080:8080 uamoti/weather-api
image.architecture: amd64
image.description: ubuntu 24.04 LTS amd64 (minimal release) (20251001)
image.label: minimal release
image.os: ubuntu
image.release: noble
image.serial: "20251001"
image.type: disk1.img
image.version: "24.04"
limits.cpu: "1"
limits.memory: 1GiB
volatile.base_image: 685c736c43c3855c96d3c00c8def22d6c848998c64d8202fcfebd8b7f2b4994b
volatile.cloud-init.instance-id: e9f5f1f8-1969-4b10-bbd7-b4770c64fb16
volatile.eth0.host_name: tape02e4f43
volatile.eth0.hwaddr: 00:16:3e:06:14:2e
volatile.last_state.power: RUNNING
volatile.last_state.ready: "false"
volatile.uuid: f7b18fa5-77b2-4c21-8352-f031b9a5e169
volatile.uuid.generation: f7b18fa5-77b2-4c21-8352-f031b9a5e169
volatile.vsock_id: "3708137788"
devices:
cloud-init:
source: cloud-init:config
type: disk
eth0:
ipv4.address: 10.186.40.69
name: eth0
network: lxdbr0
type: nic
root:
path: /
pool: default
type: disk
ephemeral: false
profiles:
- webbserver
- webbserver-0
stateful: false
description: ""
And the load balancer (config):
architecture: x86_64
config:
cloud-init.user-data: |
#cloud-config
package_reboot_if_required: true
package_update: true
package_upgrade: true
packages:
- iputils-ping
- nano
- ufw
- fail2ban
- zfsutils-linux
users:
- name: devops
shell: /bin/bash
groups: sudo
sudo: ALL=(ALL) PASSWD:ALL
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDH5ITVOPpD/x6hebcJrw2hwE4SuafO1lQ3yYuOp/cFadFK8VGSGsowoU71+YihsjyzX94dp9CIvIya3ioTsJWxgA2aM7iCSNUDEZVrFKl1jsh6LRO5r6TsJXau7V+nD0cvs99YYPvbO1mIovr5h1hW5aZeV146ZWJDjH0Jx95ZiFAadohumG4E+H2JdDzVutyVrPxm7qSjo699bvsl1ZZGeQOWi4sZhRTb9UfFbbAPAbLe2JRBh9QyRgaMsocC3H5orfxd2Js7/R+VluKkQmSEg3v4UL1XUCZjXuB0yG/OQ/+95A4PYZu/n1lJL8WpKuZ6O0q0Yx38f8tgJXYLe5H9 devops
write_files:
- path: /opt/nginx/conf.d/lastbalans.conf
content: |
upstream webb {
server 10.186.40.69;
server 10.186.40.73;
}
server {
listen 80;
location / {
proxy_pass http://webb;
}
}
owner: devops
defer: true
- path: /etc/fail2ban/jail.local
content: |
[DEFAULT]
ignoreip: 127.0.0.1/8 192.168.1.151/24 10.186.40.1/16
bantime: 30m
maxretry: 5
banaction: ufw
banaction_allports: ufw
owner: devops
defer: true
- path: /etc/docker/daemon.json
content: |
{
"storage-driver": "overlay2",
}
defer: true
chpasswd:
users:
- name: devops
password: devopspassword
type: text
runcmd:
- echo AllowUsers devops >> /etc/ssh/sshd_config
- echo Protocol 2 >> /etc/ssh/sshd_config
- echo PermitRootLogin no >> /etc/ssh/ssh_config
- echo umask 066 >> /etc/profile
- ufw allow OpenSSH
- ufw allow in on lxdbr0
- ufw route allow in on lxdbr0
- ufw route allow out on lxdbr0
- ufw allow 'Nginx HTTP'
- ufw logging on
- ufw disable
- snap install docker
- systemctl restart sshd
- systemctl restart docker
# - docker run -d --name load-balancer -p 8081:80 -v /opt/nginx/conf.d/lastbalans.conf:/etc/nginx/conf.d/lastbalans.conf:ro nginx
image.architecture: amd64
image.description: ubuntu 24.04 LTS amd64 (minimal release) (20251001)
image.label: minimal release
image.os: ubuntu
image.release: noble
image.serial: "20251001"
image.type: disk1.img
image.version: "24.04"
limits.cpu: "2"
limits.memory: 1GiB
security.nesting: "true"
security.privileged: "true"
volatile.base_image: 685c736c43c3855c96d3c00c8def22d6c848998c64d8202fcfebd8b7f2b4994b
volatile.cloud-init.instance-id: de0e1371-8a85-4020-aada-40151fdbff94
volatile.eth0.host_name: tap881ea7b5
volatile.eth0.hwaddr: 00:16:3e:3e:f8:f1
volatile.last_state.power: RUNNING
volatile.last_state.ready: "false"
volatile.uuid: 2222ec61-b400-4298-b410-2aa8e666dc8f
volatile.uuid.generation: 2222ec61-b400-4298-b410-2aa8e666dc8f
volatile.vsock_id: "1415678609"
devices:
eth0:
ipv4.address: 10.186.40.127
name: eth0
network: lxdbr0
type: nic
root:
path: /
pool: default
type: disk
ephemeral: false
profiles:
- lastbalans
stateful: false
description: ""
List during second VM start up:
+---------------+---------+------------------------+------+-----------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+---------+------------------------+------+-----------------+-----------+
| load-balancer | RUNNING | 10.186.40.127 (enp5s0) | | VIRTUAL-MACHINE | 0 |
+---------------+---------+------------------------+------+-----------------+-----------+
| web-server-0 | RUNNING | 172.17.0.1 (docker0) | | VIRTUAL-MACHINE | 0 |
| | | 10.186.40.69 (enp5s0) | | | |
+---------------+---------+------------------------+------+-----------------+-----------+
After a while the second VM goes into error state, followed by the first one.
+---------------+-------+------+------+-----------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+-------+------+------+-----------------+-----------+
| load-balancer | ERROR | | | VIRTUAL-MACHINE | 0 |
+---------------+-------+------+------+-----------------+-----------+
| web-server-0 | ERROR | | | VIRTUAL-MACHINE | 0 |
+---------------+-------+------+------+-----------------+-----------+
I’ve been debugging with Gemini to no avail.
I’ve tried a few different things:
- Specifying the storage driver in
/etc/docker/daemon.json as overlay2 or btrfs
- Specifying in
/etc/docker/daemon.json - "bridge": "none", "iptables": false, "ip-forward": false, "ip-masq": false. In the same file: "default-address-pools": [{"base": 192.168.20.0/24", "size": 24}]
- Installing Docker as a snap instead of using cloud-init’s
packages
- Using the standard Ubuntu image instead of the minimal one
I did see Stéphane’s video mentioning the potential issues with ZFS and using btrfs. The video is some 4 years old though, and I assume things have improved since then. Furthermore, it seems to be more specific for LXD containers, not VMs, and the current Docker docs mention ZFS as supported. In any case, I had security.nesting: true and security.privileged: true in my cloud-init.user-data at some point to no avail.
I did find by accident that the issue seems to revolve around having two VMs running Docker. As part of debugging, I removed the Docker installation from one VM and succeeded in having 2 VMs running at the same time. I then SSH’ed into the one without Docker and ran apt install docker.io; I lost connection with the terminal showing Unpacking docker.io ...
And all this happens without even running a container. The first VM has Docker installed without running anything, and simply installing it on the second one causes the crash.
I’m leaning more towards some network problem (than storage) given that simply installing Docker - and therefore getting an IP address - is enough to trigger the problem. I’m assuming the storage issue would be more relevant if we had images or containers using the file system.
Update: I’ve just managed to get both VMs running by using a btrfs pool for the second VM.
In both VMs, docker0 has IP 172.17.0.1, which could be another indication of network conflicts.