Perhaps someone can shed some light.
I am trying to configure a VF on an IB card with SRIOV as a device in an LXC vm.
My command to add the device is: lxc config device add vm1 ib0 infiniband nictype=sriov parent=ibp4s0
However, when I try starting the vm, I get: Error: Failed to start device “ib0”: Device took too long to activate at “/sys/bus/pci/drivers/vfio-pci/0000:04:00.2”
The PCI address is one of the VFs. I know the VFs are good as I have succeeded in configuring them on the host (for testing) with ipoib and they are accessible from other nodes on the IB network.
Any insight is greatly appreciated.
Thanks in advance,
Brian Andrus
Thanks for the insight. It got me digging more and I think I found the issue (only to hit another).
I had to disable probing before enumerating the VFs:
sudo apt-get install linux-image-virtual is usually a better fit for VM guests as it avoids the extra dependency on Intel and AMD microcode for example.
I figured it out. The doca drivers will not fully build for the kernel I was installing.
Updated to 6.11.0-21-generic so I could get the gpu-operator to install, but that seems to break doca.
On the plus side, sticking to a more standard 6.8.0-58-generic seems to work.
Now I do find that lxd doesn’t automatically release the VF when a vm is stopped. It does if it is a physical device, but not sriov. New topic to come on that as I dig.
Ah. This is an older Mellanox ConnectX-4 MCX455A-ECAT infiniband card.
I had to disable autoprobe to get lxc to grab the device. Otherwise it would not find a device to use.
There’s various functions in LXD to manage SR-IOV VFs:
Example with a Intel Corporation I350 4 port card:
root:~# cat /sys/class/net/enp3s0f0/device/sriov_numvfs
0
root:~# cat /sys/class/net/enp3s0f0/device/sriov_totalvfs
7
# No VFs activated yet
ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 98:ee:cb:f0:24:e2 brd ff:ff:ff:ff:ff:ff
3: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether a0:36:9f:6f:cb:ac brd ff:ff:ff:ff:ff:ff
4: enp3s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether a0:36:9f:6f:cb:ad brd ff:ff:ff:ff:ff:ff
5: enp3s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether a0:36:9f:6f:cb:ae brd ff:ff:ff:ff:ff:ff
6: enp3s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether a0:36:9f:6f:cb:af brd ff:ff:ff:ff:ff:ff
7: wlp1s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 74:4c:a1:86:78:05 brd ff:ff:ff:ff:ff:ff
8: lxdbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 00:16:3e:ec:f8:db brd ff:ff:ff:ff:ff:ff
Lets set up LXD:
snap install lxd
lxd (5.21/stable) 5.21.3-c5ae129 from Canonical✓ installed
lxd init --auto
lxc init ubuntu:24.04 v1 --vm
lxc config device add v1 eth0 nic nictype=sriov parent=enp3s0f0
lxc start v1
# VFs activated now, with one assigned to the VM (we can see the MAC has been set and MAC spoof checking disabled).
3: enp3s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether a0:36:9f:6f:cb:ac brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 5 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 6 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
# Working inside the VM guest:
lxc exec v1 -- ip l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff
# Stop the VM
lxc stop v1
# VF returned to host (spoof checking re-enabled and the VF interface has re-appeared):
ip l
3: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether a0:36:9f:6f:cb:ac brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 4 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 5 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 6 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
...
16: enp3s0f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 96:e8:81:ba:30:db brd ff:ff:ff:ff:ff:ff permaddr 42:f4:96:bc:21:88
I suspect my issue may be because I am doing infiniband rather than ethernet. I have to set the uid/gid before trying to start a vm (they both default to all zeros).
Wow… So I just did some more testing. I am using MAAS along with LXC for the vms.
Apparently, if I deploy using the MAAS ubuntu image, it works. If I use the LXC ubuntu image, it does not.
This is getting more and more complicated. For the time, I am also able to remove all the VFs in order to free them up when I need to reboot a VM.
At least I have gotten far enough to find a path to what I needed.
Thanks for the insights.