VM in ERROR state

I’m getting a lxd VM in ERROR state but the logs from lxc info --show-log are empty, and the lxd.log doesn’t have anything unusal either – and the VM itself won’t start up.

This is a throw away VM so I don’t need to salvage it, but I’d be curious on what went wrong, tips on troubleshooting / gathering more diagnostics?

For context, I’m deploying a large-ish juju bundle inside that VM using containers, could be resource shortage. The VM is running noble, as are the containers inside. LXD version 5.21.2-2f4ba6b

The main idea would be checking the kernel logs (dmesg) after trying to start the instance. There could be something helpful from QEMU there.

Run lxc monitor --pretty while trying to start the instance to check for events on LXD, could me more informative than lxd.log. Trying to start the instance with lxc start <vm-name> --debug doesn’t hurt, although I don’t think this contains any info that isn’t already on lxc monitor.

If it is a problem with resource limits, try resizing the VM with time lxc config device set c root size="100GiB" or if that fails lxc config device override c root size="100GiB". Make sure to adjust the size of your storage pool if needed. If you have a full storage pool, this can be checked with lxc storage info <pool_name>.

If none of that helps you could try making a backup from the instance with lxc export vm-name ./out.tar.xz --optimized-storage and then impoting it with lxc import ./out.tar.xz test-vm to check for disk integrity. If this doesn’t help recovering the instance at least it could fail in a more informative way. If the export succeds and import doesn’t, importing it back on another storage pool or even other LXD could also be valid (if doing so use the same storage driver as the original instance).
.
Lastly, what exactly happens when attempting to start the instance? If the process just hangs, stracing the qemu process (if it is spawned) could be an option as well.

Hi,

I’m in a similar situation, trying to run containers inside the VMs.
I have to create 2 VMs for web servers and one for a load balancer - and another one for a back end server but I haven’t gotten that far yet. These services should run as containers inside the VMs.

Initially I had a strange bug in that cloud-init.user-data wasn’t being detected/parsed in a specific LXD profile. But the main underlying bug has been that I can’t get two VMs to run at the same time.

Let’s say I launch the load balancer VM; after a while I see two IP addresses (enp5s0 and docker0), so I’m assuming it started fine and is running the Docker container.
I then start another VM. It also gets its enp5s0 IP address but then fails after some time. Both VMs go into error state and nothing is logged to lxc monitor --pretty.

I assume it’s related to Docker networking but I’m having a hard time finding where to start. Any tip is highly appreciated.

Please can you show output of lxc list and lxc config show --expanded <instance> for each of the problem instances.

Hi Thomas,

Thanks for your reply.

Here’s the config for the web server:

architecture: x86_64
config:
  cloud-init.user-data: |
    #cloud-config
    package_reboot_if_required: true
    package_update: true
    package_upgrade: true
    packages:
      - iputils-ping
      - nano
      - ufw
      - zfsutils-linux
    users:
      - name: devops
        shell: /bin/bash
        groups: sudo, docker
        sudo: ALL=(ALL) PASSWD:ALL
        ssh_authorized_keys:
          - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDH5ITVOPpD/x6hebcJrw2hwE4SuafO1lQ3yYuOp/cFadFK8VGSGsowoU71+YihsjyzX94dp9CIvIya3ioTsJWxgA2aM7iCSNUDEZVrFKl1jsh6LRO5r6TsJXau7V+nD0cvs99YYPvbO1mIovr5h1hW5aZeV146ZWJDjH0Jx95ZiFAadohumG4E+H2JdDzVutyVrPxm7qSjo699bvsl1ZZGeQOWi4sZhRTb9UfFbbAPAbLe2JRBh9QyRgaMsocC3H5orfxd2Js7/R+VluKkQmSEg3v4UL1XUCZjXuB0yG/OQ/+95A4PYZu/n1lJL8WpKuZ6O0q0Yx38f8tgJXYLe5H9 devops
    write_files:
      - path: /etc/docker/daemon.json
        content: |
          {
            "storage-driver": "zfs",
          }
        defer: true
    chpasswd:
      users:
        - name: devops
          password: devopspassword
          type: text
    runcmd:
      - echo AllowUsers devops >> /etc/ssh/sshd_config
      - echo Protocol 2 >> /etc/ssh/sshd_config
      - echo PermitRootLogin no >> /etc/ssh/ssh_config
      - echo umask 066 >> /etc/profile
      - ufw allow OpenSSH
      - ufw allow in on lxdbr0
      - ufw route allow in on lxdbr0
      - ufw route allow out on lxdbr0
      - ufw logging on
      - ufw allow in from 10.186.40.127 to any port 8080
      - ufw default deny incoming
      - ufw disable
      - snap install docker
      - systemctl restart sshd
      - systemctl restart docker
      # - docker run -d --name weather-api -p 8080:8080 uamoti/weather-api
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20251001)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20251001"
  image.type: disk1.img
  image.version: "24.04"
  limits.cpu: "1"
  limits.memory: 1GiB
  volatile.base_image: 685c736c43c3855c96d3c00c8def22d6c848998c64d8202fcfebd8b7f2b4994b
  volatile.cloud-init.instance-id: e9f5f1f8-1969-4b10-bbd7-b4770c64fb16
  volatile.eth0.host_name: tape02e4f43
  volatile.eth0.hwaddr: 00:16:3e:06:14:2e
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: f7b18fa5-77b2-4c21-8352-f031b9a5e169
  volatile.uuid.generation: f7b18fa5-77b2-4c21-8352-f031b9a5e169
  volatile.vsock_id: "3708137788"
devices:
  cloud-init:
    source: cloud-init:config
    type: disk
  eth0:
    ipv4.address: 10.186.40.69
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- webbserver
- webbserver-0
stateful: false
description: ""

And the load balancer (config):

architecture: x86_64
config:
  cloud-init.user-data: |
    #cloud-config
    package_reboot_if_required: true
    package_update: true
    package_upgrade: true
    packages:
      - iputils-ping
      - nano
      - ufw
      - fail2ban
      - zfsutils-linux
    users:
      - name: devops
        shell: /bin/bash
        groups: sudo
        sudo: ALL=(ALL) PASSWD:ALL
        ssh_authorized_keys:
          - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDH5ITVOPpD/x6hebcJrw2hwE4SuafO1lQ3yYuOp/cFadFK8VGSGsowoU71+YihsjyzX94dp9CIvIya3ioTsJWxgA2aM7iCSNUDEZVrFKl1jsh6LRO5r6TsJXau7V+nD0cvs99YYPvbO1mIovr5h1hW5aZeV146ZWJDjH0Jx95ZiFAadohumG4E+H2JdDzVutyVrPxm7qSjo699bvsl1ZZGeQOWi4sZhRTb9UfFbbAPAbLe2JRBh9QyRgaMsocC3H5orfxd2Js7/R+VluKkQmSEg3v4UL1XUCZjXuB0yG/OQ/+95A4PYZu/n1lJL8WpKuZ6O0q0Yx38f8tgJXYLe5H9 devops
    write_files:
      - path: /opt/nginx/conf.d/lastbalans.conf
        content: |
          upstream webb {
            server 10.186.40.69;
            server 10.186.40.73;
          }
          server {
            listen 80;
            location / {
              proxy_pass http://webb;
            }
          }
        owner: devops
        defer: true
      - path: /etc/fail2ban/jail.local
        content: |
          [DEFAULT]
          ignoreip: 127.0.0.1/8 192.168.1.151/24 10.186.40.1/16
          bantime: 30m
          maxretry: 5
          banaction: ufw
          banaction_allports: ufw
        owner: devops
        defer: true
      - path: /etc/docker/daemon.json
        content: |
          {
            "storage-driver": "overlay2",
          }
        defer: true
    chpasswd:
      users:
        - name: devops
          password: devopspassword
          type: text
    runcmd:
      - echo AllowUsers devops >> /etc/ssh/sshd_config
      - echo Protocol 2 >> /etc/ssh/sshd_config
      - echo PermitRootLogin no >> /etc/ssh/ssh_config
      - echo umask 066 >> /etc/profile
      - ufw allow OpenSSH
      - ufw allow in on lxdbr0
      - ufw route allow in on lxdbr0
      - ufw route allow out on lxdbr0
      - ufw allow 'Nginx HTTP'
      - ufw logging on
      - ufw disable
      - snap install docker
      - systemctl restart sshd
      - systemctl restart docker
      # - docker run -d --name load-balancer -p 8081:80 -v /opt/nginx/conf.d/lastbalans.conf:/etc/nginx/conf.d/lastbalans.conf:ro nginx
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20251001)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20251001"
  image.type: disk1.img
  image.version: "24.04"
  limits.cpu: "2"
  limits.memory: 1GiB
  security.nesting: "true"
  security.privileged: "true"
  volatile.base_image: 685c736c43c3855c96d3c00c8def22d6c848998c64d8202fcfebd8b7f2b4994b
  volatile.cloud-init.instance-id: de0e1371-8a85-4020-aada-40151fdbff94
  volatile.eth0.host_name: tap881ea7b5
  volatile.eth0.hwaddr: 00:16:3e:3e:f8:f1
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 2222ec61-b400-4298-b410-2aa8e666dc8f
  volatile.uuid.generation: 2222ec61-b400-4298-b410-2aa8e666dc8f
  volatile.vsock_id: "1415678609"
devices:
  eth0:
    ipv4.address: 10.186.40.127
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- lastbalans
stateful: false
description: ""

List during second VM start up:

+---------------+---------+------------------------+------+-----------------+-----------+
|     NAME      |  STATE  |          IPV4          | IPV6 |      TYPE       | SNAPSHOTS |
+---------------+---------+------------------------+------+-----------------+-----------+
| load-balancer | RUNNING | 10.186.40.127 (enp5s0) |      | VIRTUAL-MACHINE | 0         |
+---------------+---------+------------------------+------+-----------------+-----------+
| web-server-0  | RUNNING | 172.17.0.1 (docker0)   |      | VIRTUAL-MACHINE | 0         |
|               |         | 10.186.40.69 (enp5s0)  |      |                 |           |
+---------------+---------+------------------------+------+-----------------+-----------+

After a while the second VM goes into error state, followed by the first one.

+---------------+-------+------+------+-----------------+-----------+
|     NAME      | STATE | IPV4 | IPV6 |      TYPE       | SNAPSHOTS |
+---------------+-------+------+------+-----------------+-----------+
| load-balancer | ERROR |      |      | VIRTUAL-MACHINE | 0         |
+---------------+-------+------+------+-----------------+-----------+
| web-server-0  | ERROR |      |      | VIRTUAL-MACHINE | 0         |
+---------------+-------+------+------+-----------------+-----------+

I’ve been debugging with Gemini to no avail.
I’ve tried a few different things:

  • Specifying the storage driver in /etc/docker/daemon.json as overlay2 or btrfs
  • Specifying in /etc/docker/daemon.json - "bridge": "none", "iptables": false, "ip-forward": false, "ip-masq": false. In the same file: "default-address-pools": [{"base": 192.168.20.0/24", "size": 24}]
  • Installing Docker as a snap instead of using cloud-init’s packages
  • Using the standard Ubuntu image instead of the minimal one

I did see Stéphane’s video mentioning the potential issues with ZFS and using btrfs. The video is some 4 years old though, and I assume things have improved since then. Furthermore, it seems to be more specific for LXD containers, not VMs, and the current Docker docs mention ZFS as supported. In any case, I had security.nesting: true and security.privileged: true in my cloud-init.user-data at some point to no avail.

I did find by accident that the issue seems to revolve around having two VMs running Docker. As part of debugging, I removed the Docker installation from one VM and succeeded in having 2 VMs running at the same time. I then SSH’ed into the one without Docker and ran apt install docker.io; I lost connection with the terminal showing Unpacking docker.io ...
And all this happens without even running a container. The first VM has Docker installed without running anything, and simply installing it on the second one causes the crash.

I’m leaning more towards some network problem (than storage) given that simply installing Docker - and therefore getting an IP address - is enough to trigger the problem. I’m assuming the storage issue would be more relevant if we had images or containers using the file system.

Update: I’ve just managed to get both VMs running by using a btrfs pool for the second VM.

In both VMs, docker0 has IP 172.17.0.1, which could be another indication of network conflicts.

So the general way to diagnose technical faults is to find a place where things work and then step forward changing one thing at a time until it stops working.

So can I suggest that you start with booting the VMs with docker disabled inside them.
If they both appear stable at that point you can try enabling Docker and then if things crash you have a data point suggesting Docker starting is causing the issue, and if the VMs aren’t stable even with Docker disabled you have a data point that its not Docker related. Both useful.

I also suggest you try bumping the VM’s memory as its currently set to limits.memory: 1GiB and that seems rather low if you’re planning on running containers inside them.

There’s a few more things that look odd to me:

  1. Your lxc list shows both instances as VMs, however one if your instance is using container specific config keys, such as:
  security.nesting: "true"
  security.privileged: "true"

What does lxc config show <instance> (without the --expanded show)?

  1. Your cloud-init seems to be referencing ufw rules for lxdbr0 even though there shouldn’t be a LXD inside your VMs (as Ubuntu 24.04 doesn’t come with LXD out of the box).

This has actually been my approach.
I’m working on the second part of a project. On the first part, I had to create 4 VMs: one for the application server, two for web servers and another as load balancer. The UFW rules come from this part, as it was (and still is) required to have a firewall enabled. The mentions to lxdbr0 come from the official documentation (and Gemini), but perhaps I got confused and carried what should be done on my machine to the VMs.
So the set up was working: I could SSH into the VMs, ping each other, cloud-init executed nicely and the load balancer was working.

On this second part, the services should run in containers instead of on the VM itself. So the major change to the structure has been Docker. I think another hint is that I cannot have 2 VMs with Docker enabled; even trying to install Docker on the second VM crashes the system. And it was interesting to see it working if I use different storage pools.
But I haven’t tried installing and disabling Docker. Thanks for the tip, I’ll see if that goes through.

I had thought about memory as well but I don’t think it’s an issue at this point. I’m not really running anything in the VMs but I’ll keep it in mind for the final stage.
Thanks, I had forgotten to remove the container-specific options - done in the info below.

Load balancer configuration:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20251001)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20251001"
  image.type: disk1.img
  image.version: "24.04"
  volatile.base_image: 685c736c43c3855c96d3c00c8def22d6c848998c64d8202fcfebd8b7f2b4994b
  volatile.cloud-init.instance-id: 41217243-2f9d-4d2b-aea2-869e56fd8628
  volatile.eth0.host_name: tap5f858ae6
  volatile.eth0.hwaddr: 00:16:3e:31:36:59
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 41e7b4ee-0809-4aaa-9792-400d621a853a
  volatile.uuid.generation: 41e7b4ee-0809-4aaa-9792-400d621a853a
  volatile.vsock_id: "2653257896"
devices:
  root:
    path: /
    pool: docker
    type: disk
ephemeral: false
profiles:
- lastbalans
stateful: false
description: ""

Web server configuration:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (minimal release) (20251001)
  image.label: minimal release
  image.os: ubuntu
  image.release: noble
  image.serial: "20251001"
  image.type: disk1.img
  image.version: "24.04"
  volatile.base_image: 685c736c43c3855c96d3c00c8def22d6c848998c64d8202fcfebd8b7f2b4994b
  volatile.cloud-init.instance-id: 527ed8dd-f391-4038-961c-981d7af968ad
  volatile.eth0.host_name: tapa0c9939e
  volatile.eth0.hwaddr: 00:16:3e:e2:0a:71
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: 54ec9b4b-5948-4fed-ab14-5a250e82f26e
  volatile.uuid.generation: 54ec9b4b-5948-4fed-ab14-5a250e82f26e
  volatile.vsock_id: "4127070636"
devices: {}
ephemeral: false
profiles:
- webbserver
- webbserver-0
stateful: false
description: ""

I would try this first as if your VMs are becoming unresponsive when running more applications on them (docker) then this could easily be a memory issue.

1 Like

You might have nailed it :wink:

Initially, the error was still happening. I then initialised the VMs with Docker stopped and could get multiple VMs running. I started Docker and they kept running :pray:
When inspecting one of the instances, I noted a memory usage of ~1.2 GB, so the previous 1 GB limit might have been the cause.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.