OVN DPU acceleration with Mellanox/Nvidia Bluefield cards

Project LXD
Status Drafting
Author(s) @tomparrott
Approver(s) @egelinas
Release 6.x
Internal ID LX077

Abstract

Add support for using Bluefield DPU (Data Processing Unit) NIC cards for acceleration of LXD OVN networks.

Rationale

LXD already supports accelerating OVN networking flows when using an SR-IOV card that is compatible with switchdev mode. See SR-IOV hardware acceleration.

Existing SR-IOV OVN acceleration:

However in this mode the LXD host(s) still need to run the ovn-controller and have access to an Open VSwitch OVN integration bridge on each host.

With OVN DPU acceleration it becomes possible to shift these components onto the DPU card that is attached to the LXD host. This provides both additional offloading of work away from the LXD host(s) as well as improved security/isolation, because there are fewer services running on the LXD host(s).

The Bluefield 2 card is a NIC and a separate ARM computer combined. It is connected to the LXD host using the PCIe bus.

On the LXD host the network interfaces from the card do not represent the physical ports on the card.
Instead there are physical functions (PFs) and associated virtual functions (VFs) which are joined to associated PF and VF “representor” interfaces on the DPU card itself. In this way packets can flow between the host and the DPU card, and the “representor” ports can then be connected to bridges on the DPU card to have their flows offloaded to the NIC.

In this scenario the LXD host(s) will only see the SR-IOV PF and VF interfaces and will pass them to the instances as needed. LXD will still communicate with the OVN northbound and southbound database services, but they do not necessarily need to be running on the same host.

Proposed DPU OVN acceleration:

Specification

Instance NIC connectivity

1. LXD needs to be told hold how to communicate with OVN southbound database

Currently LXD is told how to communicate with the OVN northbound database by way of the server setting network.ovn.northbound_connection. In order for it to communicate with the OVN southbound database it takes this from the ovs-vsctl get open_vswitch . external_ids:ovn-remote command. This is because each OpenVswitch chassis must be configured to communicate with the OVN southbound database and so there was no need to duplicate that configuration.

However in DPU acceleration mode, OpenVSwitch will not be running on the LXD host(s), so we will need a new setting, this is being proposed as network.ovn.southbound_connection.

2. The ovn NIC type will need to have a new acceleration mode value

Currently ovn NICs support two acceleration modes: sriov and vdpa.

We will need to extend this option to support a proposed dpu mode in order to be able to indicate that a specific instance NIC should use the DPU acceleration mode.

3. The associated ovn network will need to have a new per-member setting to indicate physical function (PF) interface

Currently when an ovn NIC has acceleration enabled the candidate VFs are selected by way of requiring its PF interface(s) to be connected to the host’s OVN integration bridge.

Set up OVS by enabling hardware offload and adding the PF NIC to the integration bridge (normally called br-int):
ovs-vsctl set open_vswitch . other_config:hw-offload=true
systemctl restart openvswitch-switch
ovs-vsctl add-port br-int enp9s0f0np0

This is problematic because the operator may want to use the PF interface for some other purpose (perhaps for host networking), and requiring it be added to the OVN integration bridge is restrictive.

So when using the acceleration setting we need a better way for LXD to know which VFs from which PFs are candidates for use with instance ovn NICs.

The proposal is to add a per-member setting to ovn networks, named either parent.<mode>, which will indicate which PFs to use for VF allocation when ovn NICs connected to that network have acceleration enabled in that mode.

This way it will be possible for ovn NICs in the same network to use a mixture of acceleration modes.

E.g. lxc network set ovn1 parent.dpu=enp130s0f0np0

4. The DPU card will need to be configured (by the operator)

On the DPU card, OpenVswitch will need to be configured (by the operator) as follows

  1. Enable hardware offload:
ovs-vsctl set open_vswitch . other_config:hw-offload=true
  1. Store the DPU card’s serial number
    This will be automatically synced to the OVN’s southbound database’s chassis table.
    It can be retrieved using lspci -vv looking for the [SN] Serial number value from the Capabilities: [48] Vital Product Data section.
ovs-vsctl set open_vswitch . \
    external_ids:ovn-cms-options=card_serial_number=<DPU_SERIAL_NUMBER>
  1. Connect it to OVN:
    The steps from from https://documentation.ubuntu.com/lxd/en/latest/howto/network_ovn_setup/#set-up-a-lxd-cluster-on-ovn need to be followed to connect the DPU to OVN.

5. Instance ovn NIC start procedure

When LXD starts an instance with an ovn NIC configured with acceleration=dpu, it will consult the NIC’s ovn network’s settings and look for the parent.dpu PF for that cluster member.

If none is found then acceleration cannot be used, so it should fail to start.

If a matching PF is found, LXD will need to check if there is a free VF, and if not try and activate VFs by modifying /sys/class/net/<PF_Interface>/device/sriov_numvfs as it does today for sriov mode.

LXD will then also need to instruct OVN to schedule the logical switch port on the associated DPU card and connect the VF’s representor port to the integration bridge (br-int) on the DPU card.

This can be done as follows:

  1. On the LXD host parse /sys/class/net/<PF interface>/device/uevent and extract PCI_SLOT_NAME setting.
  2. Get DPU card’s serial number using lspci -s <PCI_SLOT_NAME> -vv.
    I’ve not found a good way to extract this in machine readable format.
    But we are looking to get the [SN] Serial number value from the Capabilities: [48] Vital Product Data section.
    This data is also available in /sys/class/net/<PF interface>/device/vpd but would need to be decoded somehow. See hexdump -C /sys/class/net/<PF_Interface>/device/vpd.
  3. Consult the OVN southbound database’s chassis table to find the matching ovn-controller chassis running on the DPU card. E.g.
ovn-sbctl find chassis \
    external_ids:ovn-cms-options="card_serial_number\=<DPU_SERIAL_NUMBER>"
  1. LXD will then need to get the PF interface’s MAC address.
  2. The logical switch port then needs to be configured by LXD as follows:
sudo ovn-nbctl set logical_switch_port 
    <LOGICAL_SWITCH_PORT_NAME> \
    requested-chassis="<OVN_DPU_CHASSIS_NAME>" \
    options:"vif-plug\:representor\:pf-mac"="<PF_MAC_ADDRESS>" \
    options:"vif-plug\:representor\:vf-num"=<VF_NUMBER> \
    options:"vif-plug-type"=representor \
    options:"vif-plug-mtu-request"=1500

If all is well, the VF interface on the host will now be connected to the OVN logical switch port by way of the VF’s representor port on the DPU card having been connected to the OVN integration bridge on the DPU card and being scheduled by OVN on the DPU’s ovn-controller chassis.

At this point the VF interface on the host can be passed into the instance as usual.

Uplink network connectivity

The above gets instances connected to the OVN network using a DPU.
However it does not add support for using a DPU card’s physical port to provide external uplink connectivity for ovn networks. Nor does it actually allow an ovn network to be created because LXD requires a functional uplink network to exist on the LXD host.

To get support for this the proposal is to add a new type of network called dpu that can be used as an ovn network uplink. This new dpu network type would be very basic and would only contain IP settings (similar to the physical network type today, but without the parent or vlan settings).

It would require the operator to configure an OpenVswitch bridge on the DPU card and connect physical port(s) to it, and then setup the bridge to uplink provider name mappings:

ovs-vsctl set open_vswitch . \
    external-ids:ovn-bridge-mappings=<uplinkNetName>-<bridgeName>

API changes

An API extension will be added called ovn_nic_acceleration_dpu to indicate support for the new network.ovn.southbound_connection server setting, ovn NIC acceleration mode and the new dpu network type.

CLI changes

None

Database changes

None

Upgrade handling

None

Further information

Diagrams reproduced and modified with the permission of @fnordahl from a presentation given at Open Infrastructure Summit.