Improve the VM live migration process

gabrielmougard · January 25, 2024, 2:05pm

Project	LXD
Status	Rejected
Author(s)	@gabrielmougard
Approver(s)	@tomp
Release
Internal ID	LX066

Improve the VM live migration process

Disclaimer: due to the lack of LaTeX support in this forum, I attached the paper version resuming this discussion here. I recommend reading it first as it explains the mathematical intuitions behind the proposed strategy.

Therefore, each time I will mention a reference to the paper (because some formulas are involved) in this Discourse discussion, I will annotate the word with a “^{* (paper section ref)}”

Thanks for the understanding. Happy reading!

Abstract

In this study, we’ll aim at improving the VM live migration from a user experience perspective. We’ll particularly focus on the live migration of VM with non-shared storage (i.e, the source VM and the destination VM do not share the same storage through a NFS technology like Ceph for example)

In order to enable the live migration for a VM, the user must specify a size.state parameter for the instance root disk. This parameter is used for two things:

In case of a stateful stop of the instance during the migration process, the instance need to save its current memory pages (plus CPU state) to the disk in order to be able to resume its execution after a stateful start. This state dump is saved in the root disk config volume, thus justifying the size.state parameter being at least equal to the amount of memory allocated to the instance in order for the root disk config volume to be able to store the state dump.
Why do we need ‘at least’ the VM memory size and not an ‘equal’ amount for our size.state parameter? Well, the stored state of the VM also contain other data. Indeed, during the live migration process, we let QEMU mirror the storage device live-writes to a copy-on-write file: a .qcow2 file. This file is written on the root disk config volume.

Rationale

Manually setting this VM live migration parameter for each VM is a tedious and error-prone task. Instead, we want to eliminate this need for the user to enter the parameter and introduce a management runtime that handle this state management in an adaptive way. This could be paving the road toward more ‘intelligent’ (by intelligent, you can understand ‘adaptive’) migration techniques (network-aware, disk IOPS-aware and even “history”-aware in the sense that some statistical models show faster migration time when we have former migration experience data points).

The following of this study will propose a strategy describing how we could get rid of the size.state parameter and make the live migration process more adaptive and user-friendly. In a second study, we’ll cover how we can get rid of the migration.stateful parameter so that the user does not have to specify any parameter at all to enable the live migration of its VM.

Specification

^{* (see the “Strategy” section to have the formatting)}

Let’s have a new shared volume (the term “buffer” will be used interchangeably but it designate the same thing) managed by LXD with a fixed size. This volume will be hidden. The size of this volume is LXD specific and could be a modifiable server parameter but will come with a chosen default value (e.g, 10GiB). The purpose of this shared volume is to act as a temporary buffer in order to store the .qcow2 files of the running VM live migration processes (we could name this volume migration-cow-buffer. This volume could be backed by ZFS by default).

I don’t think there would be a need to store any VM memory dumps in this volume because each memory dump has a determined upper bound for its size (unlike the live-writes .qcow2 file), so it is easy to resize the VM root disk at the pre-migration phase to make sure the root disk config volume is big enough to store the memory dump and the CPU state. We’d just need to resize the root disk on the target node to its original size after the migration process is done.

The volume would be partitioned into multiple sub-volumes; a partitioning function will be involved for that. To be clear, this “partitioning” logic is only conceptual. This is an algorithm to track the segmentation of the concurrent migrations. There is no physical disk partitioning operations involved. We’ll come back to this partitioning function a bit later; for now, let’s assume it exist. There would be a 1:1 mapping between the sub-volumes and the VM live migration processes so that each partition holds one .qcow2 file.

Each partition tied to a migration would have an associated throttle rate. The throttle rate is a variable limit that will be applied to a QEMU storage device through QMP (we can also use the -drive option with throttling.* parameters to set IOPS limits on the disk. For example: -drive file=/path/to/disk.img,throttling.bps-write=10485760,... would limit the write bandwidth to 10 MB/s.)

In the computation of this throttle rate, let’s point out that the current speed of the consuming link (the link between the source and the destination nodes) is also used to that it is network aware.

Objective: Our goal is to update the throttle rates (in the following pictures, these are the “gamma” functions) of each VM live migration process so that all the .qcow2 files contained in the partitions of the global shared volume keep a decent IOPS without the total size of the sharedd volume being entirely filled up.

Partitioning logic

^{* (see the “Pi : The partitioning function” section to have the formatting)}

We need to design an ideal partitioning function for dynamically managing sub-spaces (or shortly put, ‘bins’. We’ll use these two terms interchangeably) within our shared volume for .qcow2 files during multiple live migrations. This function needs to balance the requirements of each migration while ensuring the total fixed size of the shared volume is not exceeded. Let’s define a few principles and then propose a partitioning strategy.

Principles

Dynamic allocation: the partitioning function should allocate sub-spaces based on current requirements and available space, adjusting as migrations start and end.
Fairness: each migration process should get a fair share of the buffer, ideally based on its needs and the total available space.
Buffer limitation: the sum of all partitions should never exceed the size of the shared buffer.
Non-reduction of allocated space: once a space is allocated to a .qcow2 file within a partition, it can’t be reduced (this will lead to data loss / corruption).

Proposed function

Let N be the number of ongoing migrations at time t. The function can be defined as follows:

Starting a new migration - initial allocation:

When a new migration starts, allocate an initial partition size based on the available space and the number of ongoing migrations.

Completed migration - reclaiming space:

When a migration completes (one of the ‘bin’ on the diagram above is no longer needed because its underlying .qcow2 has been removed), the other ‘bins’ can reclaim the space allocated to the completed migration. The reclaimed space is then redistributed to the remaining ongoing migrations. The redistribution is done by increasing the size of each remaining ‘bin’ by the same amount. The amount is calculated by dividing the reclaimed space by the number of ongoing migrations:

‘Bins’ overlapping

Now, you are probably wondering what’s happening in the case where a new migration causes a bin to be full or to be less than the associated size of the CoW file (this is because a bin’s size is divided when a new migration process occurs). Or, what’s happening if a migration is hogging its bin and the other migrations are starving or just simply haven’t filled up their respective bins?

In such a situation, we won’t just give up and mark migration 0 as failed: instead we’ll “shard the extra data on the other sub-spaces”. ^{* (see the “3. ‘Bins’ overlapping” section in the paper to get the explanation)}

Then the bin will be marked as full and no underlying data will be grown inside it (until a migration process finishes and some space is reclaimed).

In case of “sharded” migrations, we’ll introduce the notion of priority so that this migration has a penalized throttle rate compared to the one living in its original bin. ^{* (see the “the throttle rate (sharded behavior)” section in the paper to get the explanation)}

the throttle rate (non-sharded behavior)

For a non-sharded migration, we propose the following heuristic to adjust the throttle rate ^{* (see the “the throttle rate (non-sharded behavior)” to see the formula}.

Let’s explain the terms and the underlying idea behind this heuristic alongside the following picture:

The red curve is modeling the left part of the min(.) operator and the blue curve is modeling the right part.

The blue curve is a simple, yet in practice, close to the reality model of a network bandwidth (a transitive state followed by a steady state) evolution divided by N which is constant most of the time (it is a ‘stair’ function to be exact).
The red curve show the evolution of the throttling rate with the size of the CoW file in the x-axis. At first, the throttling rate is around the value of the maximum partition size, which is, in this case, above the allocated network bandwidth U / N allocated for this bin. Then we use the value of the blue curve to adjust the throttling rate. The more the bin is filling up, the closest the red curve get to the blue curve and eventually, exponential throttling rate will be applied without reaching 0 MB/S because of the nominal throttle rate constant epsilon = 0.01 (in MB/s).

So if the network bandwidth is large, the throttling rate will very quickly follow the red curve law and will start to decrease in a somewhat linear way until the late exponential behavior appears at around 60% of the bin size (auto-convergence behavior).

If the network bandwidth is small, we won’t be able to transfer the .qcow2 data at a fast pace anyway so the throttle rate will be in phase with the blue curve.

The interesting fact about that heuristic is that this behavior happens for whatever the size of the bin (in the above picture, the bin size for this migration is 100MB) because of the exponential term.

the throttle rate (sharded behavior)

As explained in the partitioning function section, we can have a migration process that is ‘sharded’ across multiple bins. We also spoke about some priorities that some migrations have over others in the case of a bin being ‘sharded’. Let’s now explain how these priorities are expressed and explain how we can adjust the throttle rate of a migration process that is ‘sharded’ across multiple bins.

The chosen heuristic is surprisingly simple, because it is based on the same principle as the non-sharded behavior though it contains a simple addition.

Let’s picture this situation:

We have 4 concurrent migration. In this scenario, because each migration evolved differently, this translated to having 4 bins, among 2 of them are frozen.

Migration 0 is sharded across 3 bins: its ‘main’ bin b_0 (all the migrations have at least one shard) and on b_1 and b_3.
Migration 2 is also sharded across 3 bins: its ‘main’ bin b_2 and on b_1 and b_3.

You notice that the order of the shards is not the same for both migrations: In b_1, the green shard is on top of the orange one and on b_3, the orange shard is on top of the green one. Each time a shard is added to a bin, it is added on top of the other shards. The bottom shard is the one with the highest priority and then the order of priority is from bottom to top.

^{* (see the “the throttle rate (sharded behavior)” to see the formula to compute the priorities and how a throttle rate is modified when it’s penalized by this priority}

Here is what a penalized throttle rate looks like when it is sharded with its computed mean priority:

The black curve is the ‘penalized’ version of the red curve (the non-sharded behavior). We can clearly see that the effect on the throttling rate is really noticeable for this mean priority (equal to 2. ^{* (see the paper to understand how it is computed)}
) after around 40% of the capacity of the sharded bins (the non-frozen ones), the throttling rate has been halved compared to the non-sharded behavior, letting an other migration whose the bin is the main one, more room to fill it up.

Conclusion on the heuristic

Through the monitoring of the bytes being written on the .qcow2 files of concurrent live VM migrations and a live measurement of the network bandwidth between the source and its destination nodes, we can adjust the IOPS throttle rate of a VM’s disk devices. This buffer technology is network aware and adaptive to the host fixed resources while giving a fair share of live-write IOPS per concurrent migration.

It’ll allow a LXD server to manage a centralized, hidden volume of a fixed size (decided during the lxd init phase, or set by default to 10GiB with a possibility to be increased / shrink by an admin) and to allow this scheduling logic to automatically throttle any number of concurrent live migrations without the user having to specify any size.state parameter.

Now, this was the theory.

Let’s see what we’d need to introduce more practically speaking to enable that buffer management technology.

Daemon changes

The first thing is to create a new volume on the LXD management pool to allocate this fixed size volume (e.g, migration-cow-buffer)
We’ll also need a connection tracking system to evaluate the speed of the consuming link between the source and the target node during the migration, in order to make our throttle rates network aware.
We’ll need a filesystem tracker to actively monitor the size of the CoW files.
Because the .qcow2 files won’t be stored on the instance config volume anymore, we’ll need an other way to send the .qcow2 bytes over the network because we won’t be able to use QEMU’s NBD server anymore for that purpose.

CLI changes

This will need to introduce a new configuration step in lxd init (with a default value if the --auto flag is set) in order to ask the user how much space will be allocated for this volume.
The second change, and the most notable one, is the removal of the size.state option parameter.

API changes

None.