Device took too long to activate for sriov infiniband device

Perhaps someone can shed some light.
I am trying to configure a VF on an IB card with SRIOV as a device in an LXC vm.
My command to add the device is:
lxc config device add vm1 ib0 infiniband nictype=sriov parent=ibp4s0
However, when I try starting the vm, I get:
Error: Failed to start device “ib0”: Device took too long to activate at “/sys/bus/pci/drivers/vfio-pci/0000:04:00.2”

The PCI address is one of the VFs. I know the VFs are good as I have succeeded in configuring them on the host (for testing) with ipoib and they are accessible from other nodes on the IB network.
Any insight is greatly appreciated.
Thanks in advance,
Brian Andrus

Try adding this to /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on amd_iommu=on pci=assign-busses pcie_aspm=off iommu=1"

and run sudo update-grub

This can help with rebinding specific VFs to the vfio-pci driver that is used to pass it to the VM guest.

Also capturing the output of sudo dmesg may help identify the issue

Thanks for the insight. It got me digging more and I think I found the issue (only to hit another).
I had to disable probing before enumerating the VFs:

echo 0 > /sys/class/infiniband/mlx5_0/device/sriov_drivers_autoprobe
echo 2 > /sys/class/infiniband/mlx5_0/device/sriov_numvfs

Then lxc was able to start and the VF showed up.
Now I am running into the mlnx_ib driver won’t load, which seems to be known issue with lxd :frowning:

1 Like

Is this inside the guest VM?

Try switching to sudo apt-get install linux-image-generic inside the guest.

sudo apt-get install linux-image-virtual is usually a better fit for VM guests as it avoids the extra dependency on Intel and AMD microcode for example.

2 Likes

I figured it out. The doca drivers will not fully build for the kernel I was installing.
Updated to 6.11.0-21-generic so I could get the gpu-operator to install, but that seems to break doca.
On the plus side, sticking to a more standard 6.8.0-58-generic seems to work.
Now I do find that lxd doesn’t automatically release the VF when a vm is stopped. It does if it is a physical device, but not sriov. New topic to come on that as I dig.

Seems like this card behaves quite differently to the others we’ve tested it.

We have pre-release testing for an Intel SR-IOV card, and it does release the VF indeed.

Disabling sriov_drivers_autoprobe may be the issue causing it to not be rebound to the host.

Ah. This is an older Mellanox ConnectX-4 MCX455A-ECAT infiniband card.
I had to disable autoprobe to get lxc to grab the device. Otherwise it would not find a device to use.

There’s various functions in LXD to manage SR-IOV VFs:

Example with a Intel Corporation I350 4 port card:

root:~# cat /sys/class/net/enp3s0f0/device/sriov_numvfs
0

root:~# cat /sys/class/net/enp3s0f0/device/sriov_totalvfs 
7

# No VFs activated yet
ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 98:ee:cb:f0:24:e2 brd ff:ff:ff:ff:ff:ff
3: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:6f:cb:ac brd ff:ff:ff:ff:ff:ff
4: enp3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:6f:cb:ad brd ff:ff:ff:ff:ff:ff
5: enp3s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:6f:cb:ae brd ff:ff:ff:ff:ff:ff
6: enp3s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:6f:cb:af brd ff:ff:ff:ff:ff:ff
7: wlp1s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 74:4c:a1:86:78:05 brd ff:ff:ff:ff:ff:ff
8: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:ec:f8:db brd ff:ff:ff:ff:ff:ff

Lets set up LXD:

snap install lxd
lxd (5.21/stable) 5.21.3-c5ae129 from Canonical✓ installed
lxd init --auto

lxc init ubuntu:24.04 v1 --vm
lxc config device add v1 eth0 nic nictype=sriov parent=enp3s0f0
lxc start v1

# VFs activated now, with one assigned to the VM (we can see the MAC has been set and MAC spoof checking disabled).
3: enp3s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:6f:cb:ac brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 4     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 5     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 6     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

# Working inside the VM guest:
 lxc exec v1 -- ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff

# Stop the VM
lxc stop v1


# VF returned to host (spoof checking re-enabled and the VF interface has re-appeared):
ip l
3: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether a0:36:9f:6f:cb:ac brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 4     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 5     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 6     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

...

16: enp3s0f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff permaddr 42:f4:96:bc:21:88

Auto probe is enabled:

cat /sys/class/net/enp3s0f0/device/sriov_drivers_autoprobe 
1

Even with auto probe disabled it still seems to release OK on my test system.

I suspect my issue may be because I am doing infiniband rather than ethernet. I have to set the uid/gid before trying to start a vm (they both default to all zeros).

Wow… So I just did some more testing. I am using MAAS along with LXC for the vms.
Apparently, if I deploy using the MAAS ubuntu image, it works. If I use the LXC ubuntu image, it does not.
This is getting more and more complicated. For the time, I am also able to remove all the VFs in order to free them up when I need to reboot a VM.
At least I have gotten far enough to find a path to what I needed.
Thanks for the insights.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.