I have a fresh LXD cluster with three nodes, two of them have GPUs. But containers with nvidia.runtime=true
only start on one of them, not the other one. On the second one they fail with the common messages:
$ lxc start test-cuda
Error: Failed to run: /snap/lxd/current/sbin/lxd forkstart test-cuda /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/test-cuda/lxc.conf: exit status 1
Try `lxc info --show-log test-cuda` for more info
$ lxc info --show-log test-cuda
Name: test-cuda
Status: STOPPED
Type: container
Architecture: x86_64
Location: turing
Created: 2025/01/27 15:08 CET
Last Used: 2025/01/27 15:19 CET
Log:
lxc test-cuda 20250127141936.616 ERROR utils - ../src/src/lxc/utils.c:run_buffer:571 - Script exited with status 1
lxc test-cuda 20250127141936.618 ERROR conf - ../src/src/lxc/conf.c:lxc_setup:3940 - Failed to run mount hooks
lxc test-cuda 20250127141936.618 ERROR start - ../src/src/lxc/start.c:do_start:1273 - Failed to setup container "test-cuda"
lxc test-cuda 20250127141936.619 ERROR sync - ../src/src/lxc/sync.c:sync_wait:34 - An error occurred in another process (expected sequence number 4)
lxc test-cuda 20250127141936.621 ERROR lxccontainer - ../src/src/lxc/lxccontainer.c:wait_on_daemonized_start:837 - Received container state "ABORTING" instead of "RUNNING"
lxc test-cuda 20250127141936.657 ERROR start - ../src/src/lxc/start.c:__lxc_start:2114 - Failed to spawn container "test-cuda"
lxc test-cuda 20250127141936.658 WARN start - ../src/src/lxc/start.c:lxc_abort:1037 - No such process - Failed to send SIGKILL via pidfd 17 for process 334745
lxc 20250127141936.101 ERROR af_unix - ../src/src/lxc/af_unix.c:lxc_abstract_unix_recv_fds_iov:218 - Connection reset by peer - Failed to receive response
lxc 20250127141936.101 ERROR commands - ../src/src/lxc/commands.c:lxc_cmd_rsp_recv_fds:128 - Failed to receive file descriptors for command "get_init_pid"
This is just a testing container created with lxc launch ubuntu:24.04 test-cuda -c nvidia.runtime=true
and nothing else, but the result is the very same with the other real containers: It starts on one GPU node, but not on the other one. All cluster nodes are freshly installed Ubuntu 24.04 with (hopefully) the same environment. The containers tested include also older versions of Ubuntu. I started with LXD 5.21 LTS and after fighting these issues tried upgrading to LXD 6.2 from the latest stable channel (now, I am unable to downgrade again, which does not make me feel comfortable, since this is going to be a production cluster). The situation is the same with 6.2 and 5.21.
The only potential problem I can think of is a possible result of the problems I had when trying to join this node into the cluster. I succeeded after some 10-15 failed trials of lxd init
with and without a join token, when I finally gave up providing a DNS name of the node (which worked for the first two nodes) in the question What IP address or DNS name should be used to reach this server?
and let the init use the suggested IPv4 address of the new node. On the surface, I can see the three nodes working well and nothing else, but when I had a look at sudo lxd cluster edit
after I broke the cluster by downgrading back to 5.21, I saw four members listed there - the third one appearing twice, once with its DNS name and once with the IPv4 address used now. I didn’t dare to change anything to avoid making some irreversible damage and rather upgraded back to 6.2, which made the cluster working again. Could this somehow be connected to my issue described above?