LXD with nvidia-runtime fails on some nodes in a cluster

I have a fresh LXD cluster with three nodes, two of them have GPUs. But containers with nvidia.runtime=true only start on one of them, not the other one. On the second one they fail with the common messages:

$ lxc start test-cuda
Error: Failed to run: /snap/lxd/current/sbin/lxd forkstart test-cuda /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/test-cuda/lxc.conf: exit status 1
Try `lxc info --show-log test-cuda` for more info

$ lxc info --show-log test-cuda
Name: test-cuda
Status: STOPPED
Type: container
Architecture: x86_64
Location: turing
Created: 2025/01/27 15:08 CET
Last Used: 2025/01/27 15:19 CET

Log:

lxc test-cuda 20250127141936.616 ERROR    utils - ../src/src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc test-cuda 20250127141936.618 ERROR    conf - ../src/src/lxc/conf.c:lxc_setup:3940 - Failed to run mount hooks
lxc test-cuda 20250127141936.618 ERROR    start - ../src/src/lxc/start.c:do_start:1273 - Failed to setup container "test-cuda"
lxc test-cuda 20250127141936.619 ERROR    sync - ../src/src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc test-cuda 20250127141936.621 ERROR    lxccontainer - ../src/src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc test-cuda 20250127141936.657 ERROR    start - ../src/src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "test-cuda"
lxc test-cuda 20250127141936.658 WARN     start - ../src/src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 334745
lxc 20250127141936.101 ERROR    af_unix - ../src/src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20250127141936.101 ERROR    commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"

This is just a testing container created with lxc launch ubuntu:24.04 test-cuda -c nvidia.runtime=true and nothing else, but the result is the very same with the other real containers: It starts on one GPU node, but not on the other one. All cluster nodes are freshly installed Ubuntu 24.04 with (hopefully) the same environment. The containers tested include also older versions of Ubuntu. I started with LXD 5.21 LTS and after fighting these issues tried upgrading to LXD 6.2 from the latest stable channel (now, I am unable to downgrade again, which does not make me feel comfortable, since this is going to be a production cluster). The situation is the same with 6.2 and 5.21.

The only potential problem I can think of is a possible result of the problems I had when trying to join this node into the cluster. I succeeded after some 10-15 failed trials of lxd init with and without a join token, when I finally gave up providing a DNS name of the node (which worked for the first two nodes) in the question What IP address or DNS name should be used to reach this server? and let the init use the suggested IPv4 address of the new node. On the surface, I can see the three nodes working well and nothing else, but when I had a look at sudo lxd cluster edit after I broke the cluster by downgrading back to 5.21, I saw four members listed there - the third one appearing twice, once with its DNS name and once with the IPv4 address used now. I didn’t dare to change anything to avoid making some irreversible damage and rather upgraded back to 6.2, which made the cluster working again. Could this somehow be connected to my issue described above?

Hi,

Are you able to try using the CDI mode for GPU container passthrough in LXD 6.2?

https://documentation.ubuntu.com/lxd/en/latest/reference/devices_gpu/#cdi-mode

I could try, but it would probably mean I would loose the flexibility to move the container, since the two nodes have different GPUs (and even different number of GPUs).

However, as shown above, the container fails to start before I even pass any GPU device at all. Passing nvidia.runtime is enough to make it fail on the second node, while still starting well on the first one.

The problem is not passing a GPU, but passing the runtime environment.

Problem solved, although I am not really sure about the actual cause. On the problematic machine, I installed CUDA and drivers from the NVIDIA distribution (instead of the default Ubuntu packages) and removed a setting for nvidia.driver.capabilities from the LXD configuration (I think I have tried this before without any success).

I still wonder why it has been working on the first machine without these modifications all the time.

This is just ridiculous. Now the nvidia.runtime stopped working on the first node, while it continues to work on the second one. So the opposite situation than before. The only change I did was that I installed the NVIDIA packages on the first node instead of the Ubuntu ones, as well.

I tried (re)adding the option nvidia.driver.capabilities=all and nothing happened. But after removing that option again… the container suddenly starts again on the first node as well. I am really puzzled what kind of race condition is involved here. Maybe, the configuration needs just some change to refresh the nvidia.runtime…?

So, not anymore. No “magic” helps this time. No more starting on the first node, just the second one.