Unable to use nvidia GPU in container (intel built in GPU works)

I have an instance based on profile:

config:
  environment.DISPLAY: :1
  environment.PULSE_SERVER: unix:/var/pulse-native
  nvidia.driver.capabilities: all
  nvidia.runtime: true
  security.nesting: true
  cloud-init.user-data: |
    #cloud-config
    package_upgrade: true
    runcmd:
      - 'apt-get update'
      - 'apt-get install -y x11-apps'
      - 'apt-get install -y mesa-utils'
      - 'apt-get install -y pulseaudio'
      - 'apt-get install -y pulseaudio-utils'
      - 'apt-get install -y dbus-x11'
      - 'apt-get install -y vulkan-tools'
      - 'sed -i "s/; enable-shm = yes/enable-shm = no/g" /etc/pulse/client.conf'
      - 'echo export PULSE_SERVER=unix:/var/pulse-native | tee --append /home/ubuntu/.profile'
      - 'apt-get install -y -f'
description: Steam LXD profile
devices:
  PASocket:
    bind: container
    connect: unix:/run/user/1001/pulse/native
    listen: unix:/var/pulse-native
    security.gid: "1001"
    security.uid: "1001"
    uid: "1000"
    gid: "1000"
    mode: "0777"
    type: proxy
  X0Socket:
    bind: container
    connect: unix:/tmp/.X11-unix/X2
    listen: unix:/tmp/.X11-unix/X1
    security.gid: "1001"
    security.uid: "1001"
    uid: "1000"
    gid: "1000"
    mode: "0777"
    type: proxy
  mygpu:
    type: gpu
    gid: 44
name: steam

I have two GPUs: builtin intel gpu and nvidia gpu. Vulkaninfo on host shows:

GPU0:
	apiVersion         = 4206830 (1.3.238)
	driverVersion      = 96468996 (0x5c00004)
	vendorID           = 0x8086
	deviceID           = 0xa7a0
	deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
	deviceName         = Intel(R) Graphics (RPL-P)
	driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
	driverName         = Intel open-source Mesa driver
	driverInfo         = Mesa 23.0.4-0ubuntu1~22.04.1
	conformanceVersion = 1.3.0.0
	deviceUUID         = d05c91e4-0bd3-f728-1afe-79d1db4dec74
	driverUUID         = 49579592-3e5c-2e53-be19-7f6d726063a9
GPU1:
	apiVersion         = 4206834 (1.3.242)
	driverVersion      = 2246476096 (0x85e68140)
	vendorID           = 0x10de
	deviceID           = 0x28e0
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = NVIDIA GeForce RTX 4060 Laptop GPU
	driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
	driverName         = NVIDIA
	driverInfo         = 535.154.05
	conformanceVersion = 1.3.5.0
	deviceUUID         = e3f4e87b-d44a-d7cb-cbf1-e93f7a8eaab3
	driverUUID         = 02b61036-1a0b-5721-99e2-071d493de8ce
GPU2:
	apiVersion         = 4206830 (1.3.238)
	driverVersion      = 1 (0x0001)
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 23.0.4-0ubuntu1~22.04.1 (LLVM 15.0.7)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3233-2e30-2e34-2d3075627500
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

However the vulkaninfo on instance is:

GPU0:
	apiVersion         = 4206830 (1.3.238)
	driverVersion      = 96468996 (0x5c00004)
	vendorID           = 0x8086
	deviceID           = 0xa7a0
	deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
	deviceName         = Intel(R) Graphics (RPL-P)
	driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
	driverName         = Intel open-source Mesa driver
	driverInfo         = Mesa 23.0.4-0ubuntu1~22.04.1
	conformanceVersion = 1.3.0.0
	deviceUUID         = d05c91e4-0bd3-f728-1afe-79d1db4dec74
	driverUUID         = 49579592-3e5c-2e53-be19-7f6d726063a9
GPU1:
	apiVersion         = 4206830 (1.3.238)
	driverVersion      = 1 (0x0001)
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 23.0.4-0ubuntu1~22.04.1 (LLVM 15.0.7)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3233-2e30-2e34-2d3075627500
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

Also this vkcube --gpu_number 0 works but this vkcube --gpu_number 1 does not.

Is it that my GPU device rule in the profile does not pass gpus properly or is this a driver issue? Installing nvidia drivers fails on instance (I tried to install the same drivers as on the host). Any advice how to debug this issue?

nividia-smi in instance is:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   30C    P0              N/A /  60W |     14MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

vkcube on host:

❯ vkcube --gpu_number 0
Selected GPU 0: Intel(R) Graphics (RPL-P), type: 1
❯ vkcube --gpu_number 1
Selected GPU 1: NVIDIA GeForce RTX 4060 Laptop GPU, type: 2

vkcube on instance:

ubuntu@steam:~$ vkcube --gpu_number 0
Selected GPU 0: Intel(R) Graphics (RPL-P), type: 1
ubuntu@steam:~$ vkcube --gpu_number 1
Selected GPU 1: llvmpipe (LLVM 15.0.7, 256 bits), type: 4

Host has installed:

libnvidia-cfg1-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-common-535/jammy-updates,jammy-updates,jammy-security,jammy-security,now 535.154.05-0ubuntu0.22.04.1 all [installed,automatic]
libnvidia-compute-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-compute-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 i386 [installed,automatic]
libnvidia-decode-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-decode-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 i386 [installed,automatic]
libnvidia-encode-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-encode-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 i386 [installed,automatic]
libnvidia-extra-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-fbc1-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-fbc1-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 i386 [installed,automatic]
libnvidia-gl-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
libnvidia-gl-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 i386 [installed,automatic]
linux-modules-nvidia-535-6.5.0-14-generic/jammy-updates,jammy-security,now 6.5.0-14.14~22.04.1+5 amd64 [installed,automatic]
linux-modules-nvidia-535-6.5.0-15-generic/jammy-updates,jammy-security,now 6.5.0-15.15~22.04.1+1 amd64 [installed,automatic]
linux-modules-nvidia-535-generic-hwe-22.04/jammy-updates,jammy-security,now 6.5.0-15.15~22.04.1+1 amd64 [installed]
linux-objects-nvidia-535-6.5.0-14-generic/jammy-updates,jammy-security,now 6.5.0-14.14~22.04.1+5 amd64 [installed,automatic]
linux-objects-nvidia-535-6.5.0-15-generic/jammy-updates,jammy-security,now 6.5.0-15.15~22.04.1+1 amd64 [installed,automatic]
linux-signatures-nvidia-6.5.0-14-generic/jammy-updates,jammy-security,now 6.5.0-14.14~22.04.1+5 amd64 [installed,automatic]
linux-signatures-nvidia-6.5.0-15-generic/jammy-updates,jammy-security,now 6.5.0-15.15~22.04.1+1 amd64 [installed,automatic]
nvidia-compute-utils-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-driver-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed]
nvidia-firmware-535-535.154.05/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-common-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-kernel-source-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
nvidia-prime/jammy,jammy,now 0.8.17.1 all [installed]
nvidia-settings/jammy,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]

Do I need something else on host?

I don’t have nvidia card to test this, but maybe try passing GPU with specified pci address. On host run:

lshw -C display

Look for bus info: of nvidia card in the output. It should look something like that:

bus info: pci@0000:05:00.0

Grab that number and add GPU to container with pci option:

mygpu:
    gid: "44"
    pci: "0000:05:00.0"
    type: gpu

Thanks for the suggestion but the result is still the same: vulkan does not seem to recognize nvidia GPU properly.

Host lshw:

lshw -C display
  *-display                 
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=nvidia latency=0 mode=2560x1600 visual=truecolor xres=2560 yres=1600
       resources: iomemory:600-5ff iomemory:620-61f irq:193 memory:5f000000-5fffffff memory:6000000000-61ffffffff memory:6200000000-6201ffffff ioport:3000(size=128) memory:60000000-6007ffff
  *-display
       description: VGA compatible controller
       product: Intel Corporation
       vendor: Intel Corporation
       physical id: 2
       bus info: pci@0000:00:02.0
       logical name: /dev/fb0
       version: 04
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress msi pm vga_controller bus_master cap_list rom fb
       configuration: depth=32 driver=i915 latency=0 resolution=2560,1600
       resources: iomemory:620-61f iomemory:400-3ff irq:168 memory:6202000000-6202ffffff memory:4000000000-400fffffff ioport:4000(size=64) memory:c0000-dffff memory:4010000000-4016ffffff memory:4020000000-40ffffffff

Change in profile:

mygpu:
    type: gpu
    gid: 44
    pci: 0000:01:00.0

Vulkaninfo in the instance:

GPU0:
	apiVersion         = 4206830 (1.3.238)
	driverVersion      = 1 (0x0001)
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 23.0.4-0ubuntu1~22.04.1 (LLVM 15.0.7)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3233-2e30-2e34-2d3075627500
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

Now it shows only one GPU in the instance (it must the nvidia GPU) but vulkan somehow does not recognize it properly. Also glxgear shows wrong information but it shows the gears on display.

glxgears -info
MESA: error: Failed to query drm device.
libGL error: glx: failed to create dri3 screen
libGL error: failed to load driver: iris
libGL error: failed to open /dev/dri/card0: No such file or directory
libGL error: failed to load driver: iris
GL_RENDERER   = llvmpipe (LLVM 15.0.7, 256 bits)
GL_VERSION    = 4.5 (Compatibility Profile) Mesa 23.0.4-0ubuntu1~22.04.1
GL_VENDOR     = Mesa

lshw -C display on instance shows the same information as in the host.

Can you show the output of this command in container?

ls -alF /dev/dri/

It should be something like this:

crw-rw----  1 root video 226,   0 sty 26 21:08 card0
crw-rw----  1 root video 226, 128 sty 26 21:08 renderD128

Does your user in container belong to video group?
getent group | grep video

ubuntu@steam:~$ ls -alF /dev/dri/
total 0
drwxr-xr-x  2 root root        80 Jan 26 19:01 ./
drwxr-xr-x 10 root root       660 Jan 26 19:01 ../
crw-rw----  1 root video 226,   1 Jan 26 19:01 card1
crw-rw----  1 root video 226, 129 Jan 26 19:01 renderD129

ubuntu user (the user in the instance) belongs to groups:

ubuntu@steam:~$ groups
ubuntu adm dialout cdrom floppy sudo audio dip video plugdev netdev lxd

I probably found out the problem. I am missing the file: /usr/share/vulkan/icd.d/nvidia_icd.json

When I copied that file from host, then vkcube started to work on nvidia gpu and vulkaninfo shows now:

GPU0:
	apiVersion         = 4206834 (1.3.242)
	driverVersion      = 2246476096 (0x85e68140)
	vendorID           = 0x10de
	deviceID           = 0x28e0
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = NVIDIA GeForce RTX 4060 Laptop GPU
	driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
	driverName         = NVIDIA
	driverInfo         = 535.154.05
	conformanceVersion = 1.3.5.0
	deviceUUID         = e3f4e87b-d44a-d7cb-cbf1-e93f7a8eaab3
	driverUUID         = 02b61036-1a0b-5721-99e2-071d493de8ce
GPU1:
	apiVersion         = 4206830 (1.3.238)
	driverVersion      = 1 (0x0001)
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 23.0.4-0ubuntu1~22.04.1 (LLVM 15.0.7)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3233-2e30-2e34-2d3075627500
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

Do you know why this file is missing in instance? Should I install something nvidia stuff on the instance? Or is it easier to just mount these from host?

Probably something similar is missing also in opengl side?

I don’t know why it’s missing, but a quick search showed that this was a problem for docker too.

Maybe try installing libext6 like in this issue. They also mention /usr/share/glvnd/egl_vendor.d/10_nvidia.json file. Is glxgears (part of mesa-utils) working in your container?

This issue mentions two more files beside nvidia_icd.json:

/usr/share/vulkan/icd.d/nvidia_icd.json
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.*
/usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.*

Edit: If you’re interested in enabling wayland in container, here is my profile that does that, beside X11 and pulseaudio. There’s also a profile for Arch containers if you need the most up-to-date packages.

glxgears “works” in that sense that it always displays the gears but might print errors depending on value of DRI_PRIME. Also glxgears -info never says that it uses nvidia as a renderer.

DRI_PRIME=1 glxgears -info
libGL error: glx: failed to create dri3 screen
libGL error: failed to load driver: nouveau
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
GL_RENDERER   = Mesa Intel(R) Graphics (RPL-P)
GL_VERSION    = 4.6 (Compatibility Profile) Mesa 23.0.4-0ubuntu1~22.04.1
GL_VENDOR     = Intel
DRI_PRIME=0 glxgears -info
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
GL_RENDERER   = Mesa Intel(R) Graphics (RPL-P)
GL_VERSION    = 4.6 (Compatibility Profile) Mesa 23.0.4-0ubuntu1~22.04.1
GL_VENDOR     = Intel

glxgears seems to try using nouveau drivers but I have nvidia-driver-535. At least I have that driver on host and part of that is going to instance because of option nvidia.runtime: true. Installing nvidia-driver-535 fails on the instance so I cannot install that manually to the instance.

The missing files are from package libnvidia-gl-535:

> dpkg -S /usr/share/vulkan/icd.d/nvidia_icd.json
libnvidia-gl-535:amd64: /usr/share/vulkan/icd.d/nvidia_icd.json

> dpkg -S /usr/share/glvnd/egl_vendor.d/10_nvidia.json
libnvidia-gl-535:amd64: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
> apt search libnvidia-gl-535
Sorting... Done
Full Text Search... Done
libnvidia-gl-530/jammy-updates,jammy-security 535.154.05-0ubuntu0.22.04.1 amd64
  Transitional package for libnvidia-gl-535

libnvidia-gl-535/jammy-updates,jammy-security,now 535.154.05-0ubuntu0.22.04.1 amd64 [installed,automatic]
  NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD

libnvidia-gl-535-server/jammy-updates,jammy-security 535.154.05-0ubuntu0.22.04.1 amd64
  NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD

On instance:

ubuntu@steam:~$ xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x47 cap: 0x9, Source Output, Sink Offload crtcs: 4 outputs: 5 associated providers: 1 name:modesetting
Provider 1: id: 0x219 cap: 0x2, Sink Output crtcs: 4 outputs: 3 associated providers: 1 name:NVIDIA-G0

This is the same listing as in the host. Note: in this setup I passthrough all gpus to the instance.

I digged a little deeper the status of package libnvidia-gl-535 on the instance. Note: I have not installed this package on instance. Most parts of this package are mounted like:

> cat /proc/mounts
...
/dev/mapper/vgubuntu-root /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.535.154.05 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0
/dev/mapper/vgubuntu-root /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.535.154.05 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0
/dev/mapper/vgubuntu-root /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.535.154.05 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0
/dev/mapper/vgubuntu-root /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.535.154.05 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0
...

I guess this is the result of option: nvidia.runtime: true. However all the files under /usr/share are missing. Seems like a bug in nvidia.runtime option?

I will work on vulkan tools next but I thought I would at least make sure that device passing and the nvidia runtime option worked. I have a physical machine that has a built-in GPU (Matrox) and an extra NVIDIA card with the NVIDIA driver up and running. I have tried this with LXD 5.0.2 and 5.20 and they both produce the same results. I created 3 containers where I passed the first GPU device in one, the second in the other and both of them in the third. Below are the commands I used to select devices using their PCI ids. Listing available devices in /dev/dri and running the nvidia-smi command does what is expected.

Hope this can help while I look at the vulkan stuff.

ubuntu@sm:~$ sudo snap install lxd --channel=5.0
lxd (5.0/stable) 5.0.2-d4d8da9 from Canonicalâś“ installed
ubuntu@sm:~$ lxd init --auto
ubuntu@sm:~$ lspci | grep VGA
03:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
09:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
ubuntu@sm:~$ ls -l /dev/dri
total 0
drwxr-xr-x  2 root root        100 Jan 31 17:13 by-path
crw-rw----+ 1 root video  226,   0 Jan 31 17:13 card0
crw-rw----+ 1 root video  226,   1 Jan 31 17:13 card1
crw-rw----+ 1 root render 226, 128 Jan 31 17:13 renderD128
ubuntu@sm:~$ lxc init ubuntu:jammy c1
Creating c1
ubuntu@sm:~$ lxc init ubuntu:jammy c2
Creating c2
ubuntu@sm:~$ lxc init ubuntu:jammy c3
Creating c3
ubuntu@sm:~$ lxc config device add c1 gpu0 gpu gputype=physical pci=0000:09:01.0
Device gpu0 added to c1
ubuntu@sm:~$ lxc config device add c2 gpu1 gpu gputype=physical pci=0000:03:00.0
Device gpu1 added to c2
ubuntu@sm:~$ lxc config device add c3 gpu0 gpu gputype=physical pci=0000:09:01.0
Device gpu0 added to c3
ubuntu@sm:~$ lxc config device add c3 gpu1 gpu gputype=physical pci=0000:03:00.0
Device gpu1 added to c3
ubuntu@sm:~$ lxc config set c1 nvidia.runtime=true
ubuntu@sm:~$ lxc config set c2 nvidia.runtime=true
ubuntu@sm:~$ lxc config set c3 nvidia.runtime=true
ubuntu@sm:~$ lxc start c1 c2 c3
ubuntu@sm:~$ lxc exec c1 -- ls -l /dev/dri
total 0
crw-rw---- 1 root root 226, 0 Feb  1 17:41 card0
ubuntu@sm:~$ lxc exec c2 -- ls -l /dev/dri
total 0
crw-rw---- 1 root root 226,   1 Feb  1 17:41 card1
crw-rw---- 1 root root 226, 128 Feb  1 17:41 renderD128
ubuntu@sm:~$ lxc exec c3 -- ls -l /dev/dri
total 0
crw-rw---- 1 root root 226,   0 Feb  1 17:41 card0
crw-rw---- 1 root root 226,   1 Feb  1 17:41 card1
crw-rw---- 1 root root 226, 128 Feb  1 17:41 renderD128
ubuntu@sm:~$ lxc exec c1 -- nvidia-smi
No devices were found
ubuntu@sm:~$ lxc exec c2 -- nvidia-smi
Thu Feb  1 17:42:17 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off | 00000000:03:00.0 Off |                  N/A |
| 49%   25C    P8               5W / 120W |      2MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
ubuntu@sm:~$ lxc exec c3 -- nvidia-smi
Thu Feb  1 17:42:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off | 00000000:03:00.0 Off |                  N/A |
| 49%   25C    P8               5W / 120W |      2MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
ubuntu@sm:~$ 

Thanks for your help. You are correct. GPU pass trough works as expected.

The problem is in options:

nvidia.driver.capabilities: all
nvidia.runtime: true

These options does not mount all the nvidia driver files from host to instance.

For example nvidia driver package: libnvidia-gl-535. The files in /usr/share are not mounted to instance. However the files in /usr/lib/x86_64-linux-gnu are mounted e.g.

ubuntu@steam:~$ less /proc/mounts
...
/dev/mapper/vgubuntu-root /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.154.05 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0
...

This is the reason why vulkaninfo does not show nvidia GPU properly or that vkcube cannot use nvidia GPU. I can circumvent this issues by adding manually devices:

devices:
  egl:
    source: /usr/share/egl
    path: /usr/share/egl
    type: disk
    readonly: true
  glvnd:
    source: /usr/share/glvnd
    path: /usr/share/glvnd
    type: disk
    readonly: true
  nvidia:
    source: /usr/share/nvidia
    path: /usr/share/nvidia
    type: disk
    readonly: true
  vulkan:
    source: /usr/share/vulkan
    path: /usr/share/vulkan
    type: disk
    readonly: true

After these mounts vulkaninfo and vkcube works as expected.

> lxd --version
5.20

Vulkaninfo show correct information when adding previous devices. However glxinfo still seems to show invalid information for nvidia:

ubuntu@steam:~$ DRI_PRIME=1 LIBGL_DEBUG=verbose glxinfo -B
name of display: :1
libGL: using driver nvidia-drm for 6
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: pci id for fd 6: 10de:28e0, driver nouveau
libGL: MESA-LOADER: dlopen(/usr/lib/x86_64-linux-gnu/dri/nouveau_dri.so)
libGL: using driver i915 for 5
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: using driver i915 for 5
libGL: pci id for fd 5: 8086:a7a0, driver iris
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL error: glx: failed to create dri3 screen
libGL error: failed to load driver: nouveau
libGL: using driver i915 for 4
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: using driver i915 for 4
libGL: pci id for fd 4: 8086:a7a0, driver iris
libGL: MESA-LOADER: dlopen(/usr/lib/x86_64-linux-gnu/dri/iris_dri.so)
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Can't open configuration file /etc/drirc: No such file or directory.
libGL: Can't open configuration file /home/ubuntu/.drirc: No such file or directory.
libGL: Using DRI2 for screen 0
display: :1  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) Graphics (RPL-P) (0xa7a0)
    Version: 23.0.4
    Accelerated: yes
    Video memory: 31808MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) Graphics (RPL-P)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 23.0.4-0ubuntu1~22.04.1
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 23.0.4-0ubuntu1~22.04.1
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 23.0.4-0ubuntu1~22.04.1
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

I do not know why it tries to use nouveau or cannot create dri3 screen or show intel as a renderer. These work if I use intel gpu (DRI_PRIME=0). Am I missing some other driver files on instance?

ubuntu@steam:~$ xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x47 cap: 0x9, Source Output, Sink Offload crtcs: 4 outputs: 5 associated providers: 1 name:modesetting
Provider 1: id: 0x211 cap: 0x2, Sink Output crtcs: 4 outputs: 3 associated providers: 1 name:NVIDIA-G0

I came across this thread relating to nVidia troubleshooting, and they did set up xorg.conf by hand. Maybe that will help with nouveau driver issue.

$ cat /usr/share/X11/xorg.conf.d/xorg.conf

Section "Device"
	Identifier  "RTX3060"
	Driver      "nvidia"
	BusID       "PCI:7:0:0"
EndSection

With regards to mounting repos, we rely on the nvidia tools to do this. As we will be moving to the CDI approach as per this issue, I believe your workaround of manually mounting missing repos will be the interim solution.

I wrote a small script which installs the missing nvidia configuration files directly from nvidia library:

https://github.com/hkorpi/lxd-steam

When these files are added both opengl and vulkan applications work in this container with nvidia gpu. Also steam works inside this container using nvidia gpu or internal gpu.

1 Like