Dell PowerFlex storage driver

Project LXD
Status Implemented
Author(s) @jpelizaeus
Approver(s) @tomp
Release 5.21.0
Internal ID LX043

Abstract

To add a Dell PowerFlex storage pool driver (powerflex) to LXD that will interact with the PowerFlex API(s) in order to manage storage volumes on the Dell storage platform.

Rationale

There are various enablement activities between Dell and Canonical as a part of an ongoing partnership. The latest of them is adding support for LXD to interface directly with its PowerFlex service in order to allow LXD instances to be run on its platform. This would offer an alternate remote storage option for enterprise use cases, where currently supported storage drivers may not be sufficient.

Due to its design, PowerFlex will be another LXD storage driver offering remote storage capabilities similar to the already existing implementation for Ceph RBD. Unlike the other remote drivers, not all of its features will fall under the category of optimized storage. See the limitations section for more granular details.

Specification

Design

LXD will grow support for PowerFlex by offering a new storage pool driver called powerflex. To prevent importing proprietary software from the vendor into the LXD snap package, the driver uses the tooling from the underlying host if necessary.

Two different modes (NVMe/TCP and SDC) are planned for the driver to allow the selection of the most fitting technology when connecting to the various PowerFlex services. The design covers the detailed specifics in regards to required tooling and lifecycle management of the storage driver. The selection of the mode is performed during the creation of the driver through the respective configuration keys. This specification and the first release of the driver focuses on the NVMe/TCP mode only.

The storage driver expects that the user has already set up a PowerFlex protection domain and storage pool. The configuration of those does not fit into the LXD storage interface. See the official resources from Dell on how to setup both protection domain and storage pool: https://www.dell.com/support/manuals/de-de/scaleio/flex-software-to-45x/storage-definitions?guid=guid-453eaecb-558e-48df-b277-ec73c8841d12&lang=en-us.

The hosts on which the PowerFlex storage driver is used need to have additional software installed. In case of NVMe/TCP two additional kernel modules are required. See the official resources from Dell on how to install those: https://www.dell.com/support/manuals/de-de/scaleio/powerflex_install_upgrade_guide_4.5.x/configure-nvme-initiators-on-hosts-for-linux-based-systems?guid=guid-d776b85a-7e95-4038-aeef-08cab27decfd&lang=en-us.

In the first release of the LXD storage driver the supported version of Dell PowerFlex is 4.X.

Pool Creation

The hierarchy of entities in PowerFlex will be used in order to auto discover the necessary resources based on the chosen mode. Using only the PowerFlex API, LXD can lookup all the necessary information to configure both of the modes on the underlying host. A user has to provide at least the following parameters in order to create a new PowerFlex storage pool:

lxc storage create <name> powerflex
    powerflex.gateway=<address>
    powerflex.user.name=<user>
    powerflex.user.password=<password>
    (powerflex.pool=<poolID> | powerflex.pool=<poolName>
    powerflex.domain=<protectionDomain>)
    [powerflex.mode=<nvme>]
    [powerflex.sdt=<sdt>]
    [powerflex.gateway.verify=<true|false>]
    [powerflex.clone_copy=<true|false>
    [volume.size=<size>]

The name relates to a user defined string for the specific storage pool. The gateway, user and password are the mandatory configuration settings. They allow LXD to discover the remainder of the configuration in the background.

By specifying only powerflex.pool without a domain, LXD interprets the value as a PowerFlex storage pool ID. For usability reasons also pool and domain can be set if the usage of names is preferred. LXD will then search in the protection domain after a pool with the provided name. By specifying a unique ID for the storage pool the parent protection domain can be easily discovered.

LXD can discover available modes by looking up the requirements for the underlying system. A user can override this behavior by explicitly setting the desired mode. This spec covers the NVMe/TCP mode only since SDC won’t be supported in the first release. The discovery shall run in the following order to prefer the NVMe/TCP option if the tooling for both is present on the system:

  • NVMe/TCP: Check if both the nvme_fabrics and nvme_tcp kernel modules are present. If yes, select this mode.
  • SDC: Check if the /bin/emc/scaleio directory exists and if it contains both the drv_cfg binary and scini.ko kernel module. If yes, select this mode
  • Default: Report an error

Having the unique ID of the storage pool and protection domain, LXD can now lookup the mode specific addresses in order to configure the storage pool on the LXD host.
When the nvme mode is selected, LXD has to lookup the IP of at least one PowerFlex SDT service. Since the protection domain contains a list of all the SDT services, a single API call to the PowerFlex API will also resolve this information. The nvme kernel module takes care about connecting also to the other SDT if the connection to one of them was established.

The underlying tooling (regardless of the selected mode) doesn’t need to be configured as part of the storage pool creation. As soon as the first volume gets created, the current host needs to be configured in order to talk with the PowerFlex systems upstream.

Volume Creation

When creating a new volume, it’s mandatory to first check its size. PowerFlex only accepts units in GiB which are multiples of 8. Since the default size of a storage volume in LXD is 10 GiB, this value needs to be changed automatically if nothing else is specified. When creating a new storage pool without setting the size configuration key (or volume.size from the pool’s config), it gets set to 8 GiB.

lxc storage volume create <poolName> <volumeName>
    [size=<size>]
    [block.type=<thin|thick>]

The interface for creating any type of volume is equivalent to the other storage drivers. LXD uses the API to create the new volume in PowerFlex. Before being able to mount the volume, LXD has to configure the NVMe/TCP subsystem. This happens based on the discovered information which is stored as part of the storage pool. In the next step LXD will ensure that there is a host entry in PowerFlex for the current host. Now a mapping can be established to the volume in PowerFlex. Both the NVMe/TCP and SDC modes will discover this mapping and add the volume as a new drive to the current system. LXD has to wait for this by continuously checking the list of available devices on the system in a loop. As soon as the disk is there, the regular processes in LXD apply for volume handling.

For the NVMe/TCP mode LXD checks /dev/disk/by-id/* and looks for any disk that matches the nvme-eui.* pattern. Since disk partitions also get an entry under this directory, the ones ending on -partX need to be excluded to mount the actual parent disk.

In PowerFlex snapshots are fully usable volumes. Therefore the logic for volumes also applies for snapshots. The only difference is their name and the relation in the PowerFlex VTree.

By default all new volumes will be thin provisioned by LXD. A user can explicitly specify block.type=thick to overwrite this behavior.

Volume Deletion

The deletion of a volume consists of two steps. First the mapping between each host and the volume needs to be removed. Second, the volume can be removed on the PowerFlex side using the ONLY_ME option. This will delete only the specified volume without its childs or parents. Both the NVMe/TCP and SDC mode take care of removing the actual disk from the system. In case of NVMe/TCP this is handled by the respective kernel module.

Snapshot Restore

In case of restore there needs to be a volume or snapshot from which we want to restore and a target volume. Using the PowerFlex API LXD can trigger an overwrite of the target volume using the contents of the snapshot. The operation happens exclusively on the PowerFlex side within the same VTree.

Volume Copy

In case of volume copy LXD has to create a new volume and copy the contents from the source to the target. This cannot be performed in PowerFlex since two different VTrees are involved. The same applies to snapshots of the source volume which also need to be copied to a snapshot in the target volume’s VTree if the volume gets copied with snapshots.

There is one exception in which LXD will create a snapshot of the source volume instead of copying over the contents manually. This is the case if the source volume doesn’t have any snapshots, if the copy operation is marked to not copy any snapshots and powerflex.clone_copy is set to false. By setting powerflex.clone_copy to true a user can overwrite this behavior.

Creating new instances from images will always assume powerflex.clone_copy is set to true to be able to create an arbitrary number of instances without hitting the limit of 126 snapshots within a VTree. This results in the PowerFlex storage driver not having support for the optimized image storage.

Virtual Machines

For each virtual machine LXD needs to provision two storage volumes. One for the root filesystem and one for the additional configuration drive. Other LXD storage drivers allocate around 100MiB for the configuration drive whereas the minimal size of a storage volume in PowerFlex has to be 8GiB. This results in two 8GiB volumes which results in a minimal required storage size of 16GiB for a single VM.

Modes

NVMe/TCP

This is the preferred mode since it doesn’t interfere with proprietary software on the underlying host and requires only software to be present which can be consumed through official Ubuntu sources (nvme-cli and the respective kernel modules nvme_fabrics and nvme_tcp).

LXD requires the nvme-cli package to be installed alongside the snap so that it can configure the NVMe/TCP subsystems based on the information gathered during storage pool creation. This change was already performed with https://github.com/canonical/lxd-pkg-snap/pull/182.

Using NVMe/TCP the host requires a unique UUID in order to create the mapping to a storage volume. On a normal system this UUID is stored in /etc/nvme/hostnqn after installing the nvme-cli package. Since the package is already contained in the LXD snap, it would be odd to also require the package to be present on the host itself. Therefore we make use of the LXD servers UUID that gets introduced alongside this driver which allows us to identify each and every LXD host against PowerFlex by reading the UUID from the file. This change already got introduced with https://github.com/canonical/lxd/pull/12544.

In order to connect to all the available subsystems, LXD uses a single SDT that is available for the given storage pool in PowerFlex. Using nvme connect-all -t tcp -a {SDT} all the available target systems are discovered and the host will try to establish a connection to each of them. As soon as a new mapping is created in PowerFlex between the host and the target storage volume, the volume gets added as a disk to the host.

Packaging Changes

Snap

For the NVMe/TCP mode we have to extend the snap to contain the nvme-cli apt package.

Limitations

Snapshots

Any volume can only be snapshotted 126 times due to an internal limitation in the PowerFlex VTree. This also applies to any child snapshots of any of the 126 snapshots. When creating new volumes, specify powerflex.clone_copy=true in order to not snapshot the volume but to create an actual copy which uses a different PowerFlex VTree. This copy is created by mounting the parent volume and the copy on the current host and copying over the contents from one volume to the other. Currently there is no option to perform this operation in PowerFlex directly.

Images

We cannot use the optimized image storage due to the snapshot limitation in PowerFlex. Therefore the images aren’t stored in PowerFlex but instead are copied on demand from a local copy into the instances root volume. This brings the benefit of not having to transfer the images data over the network and copying it back into the actual instances volume.

Instance Copy

There are two ways an instance can be copied when backed by the PowerFlex driver. Since PowerFlex does not support creating a standalone copy on the storage side, LXD copies the instances volumes on the local system. This implies that if an instance gets copied from one LXD host to another, the volume contents get transferred over the network from the source to the target instance. This results in writing the same contents three times over the network:

  1. Get the data for the source volume from PowerFlex
  2. Copy the data from the source to the target host
  3. Write the data into the target volume in PowerFlex

Volume Size

A PowerFlex volume needs to be at least 8 GiB in size. This results in having two 8 GiB volumes for virtual machine images (and instances) since both the root and config drive need to be at least 8 GiB in size. The smallest possible virtual machine will occupy 16 GiB of storage space. However the volumes are thin provisioned by default.

Volume Names

In PowerFlex the name of a volume cannot exceed 31 characters. Therefore LXD has to grow support for a new volatile.uuid volume config key. This key has to be passed into the storage driver so that the PowerFlex driver can generated fixed sized volume names based on this UUID. This approach is taken from the Dell PowerFlex driver for OpenStack Cinder which creates a base64 from the byte representation of the UUID. This string always has a fixed length of 24 characters which allows us to add proper identifiers for the different volume types.

A UUID of 5a2504b0-6a6c-4849-8ee7-ddb0b674fd14 will render to the base64 encoded string WiUEsGpsSEmO592wtnT9FA==.

To identify those volumes a bit better, the following identifiers get appended to the base64 encoded names:

Volume Type Identifier Example
Container c_ c_WiUEsGpsSEmO592wtnT9FA==
Virtual Machine v_ v_WiUEsGpsSEmO592wtnT9FA==.b
Image (ISO) i_ i_WiUEsGpsSEmO592wtnT9FA==.i
Custom u_ u_WiUEsGpsSEmO592wtnT9FA==

This naming concept also fits for VMs that require two volumes to be present on the storage system. One for the root volume and the other one for the VMs filesystem volume. By having the the .b identifier for block volumes, the same volumes UUID can be used to represent both volumes:

  • WiUEsGpsSEmO592wtnT9FA== for the root volume
  • WiUEsGpsSEmO592wtnT9FA==.b for the filesystem volume

Daemon Changes

UUID File

For the NVMe/TCP mode every LXD host needs to be identified in PowerFlex using a unique NQN ID. This ID consists of a UUIDv4 and a special NVMe/TCP prefix. To be independent from the system every LXD daemon has to write a server.uuid file into its VarPath() which contains a unique UUID. The storage driver can then use this UUID when creating a new host entry in PowerFlex by reading from the daemons state.

The UUID gets exported as an additional field of the daemon so that downstream code can easily access it without needing to load the actual file from storage. The change got introduced with https://github.com/canonical/lxd/pull/12544.

API Changes

No API changes expected.

CLI Changes

Driver Configuration

Hide the PowerFlex gateway password when viewing the configuration of the storage pool using lxc storage show {pool} or lxc storage get {pool} powerflex.user.password.

Database Changes

No database changes expected.

The driver itself gets introduced to LXD with https://github.com/canonical/lxd/pull/12304.

Hi Julian,

Can I ask has this been tested with MicroCloud? And could this be used as an alternate storage to Ceph?,

Thanks,

Ruairi

MicroCloud uses LXD under the hood, so it would be compatible for use with MicroCloud instances instead of Ceph.

The spec got updated to the latest state of using the volume’s volatile.uuid for PowerFlex volume names.

We might consider reverting back https://github.com/canonical/lxd/pull/12568 as this isn’t anymore required by the PowerFlex driver due to its name generation from the volume’s UUID.

Lets leave it as is in case we do need it with PF in the future. It works fine with other drivers too.