NVIDIA GPUDirect over Infiniband Migration Paths

Starting with the 6.8 kernel, Ubuntu will be making a change to the support for NVIDIA GPUDirect over Infiniband. Recent Linux kernels now support the “dma-buf” API, which provides a native interface for GPU/Infiniband memory sharing. Ubuntu is deprecating the legacy “nvidia-peermem” interface and recommending users transition to the “dma-buf” interface where possible. This interface is not available for users of Volta and earlier architectures, but other options remain available.

Users of NVIDIA GPUs based on newer architectures such as Ampere and Hopper will continue to be supported, but may require updates to your environment. This impacts users of the Ubuntu 22.04 HWE kernel who will be upgraded from a 6.5 kernel to a 6.8 kernel with the release of Ubuntu LTS 22.04.5, as well as users of Ubuntu 24.04 LTS.

This change does not impact users of the Ubuntu 22.04 LTS kernel based on Linux 5.15 or users of the Ubuntu 20.04 LTS kernel based on Linux 5.4.

Options for users of Volta and earlier NVIDIA GPU architectures

Users of Volta and earlier NVIDIA GPU architectures who wish to use NVIDIA GPUDirect over Infiniband with a 6.8 or newer kernel have the following options:

  • Migrate to the NVIDIA optimized kernel. The NVIDIA optimized kernel, provided by the linux-nvidia and linux-nvidia-hwe-22.04 packages, retains support for NVIDIA GPUDirect over Infiniband.
  • Install the Mellanox OFED stack from NVIDIA. The Mellanox OFED stack applied to the Ubuntu 6.8 kernel will retain support for NVIDIA GPUDirect. However, Canonical does not provide support for the Mellanox OFED stack.

Options for users of newer NVIDIA GPU architectures

Users of newer NVIDIA GPU architectures such as Ampere and Hopper who wish to use NVIDIA GPUDirect over Infiniband with a 6.8 or newer kernel have the following options:

  • Update your dependency stack to versions that support the “dma-buf” interface. This requires the following software versions:
    • CUDA toolkit >= 11.7
    • GPU driver branch >= 515. This must be the “open” variant of the GPU driver, which only supports later NVIDIA GPU architectures.

Ubuntu provides signed “open” variants of the GPU drivers for the generic kernel in the “linux-modules-nvidia--server-open-” packages.

  • Machine learning applications must use NCCL >= 2.13.4.

  • HPC applications must use UCX >= 1.15.0.

  • All of the options in the section above for users of Volta and earlier NVIDIA GPU architectures are also available to users of later NVIDIA GPU architectures.

1 Like