Project | LXD |
---|---|
Status | Rejected |
Author(s) | @gabrielmougard |
Approver(s) | @tomp |
Release | |
Internal ID | LX066 |
Improve the VM live migration process
Disclaimer: due to the lack of LaTeX support in this forum, I attached the paper version resuming this discussion here. I recommend reading it first as it explains the mathematical intuitions behind the proposed strategy.
Therefore, each time I will mention a reference to the paper (because some formulas are involved) in this Discourse discussion, I will annotate the word with a â* (paper section ref)â
Thanks for the understanding. Happy reading!
Abstract
In this study, weâll aim at improving the VM live migration from a user experience perspective. Weâll particularly focus on the live migration of VM with non-shared storage (i.e, the source VM and the destination VM do not share the same storage through a NFS technology like Ceph
for example)
In order to enable the live migration for a VM, the user must specify a size.state
parameter for the instance root disk. This parameter is used for two things:
- In case of a stateful
stop
of the instance during the migration process, the instance need to save its current memory pages (plus CPU state) to the disk in order to be able to resume its execution after a statefulstart
. This state dump is saved in the root disk config volume, thus justifying thesize.state
parameter being at least equal to the amount of memory allocated to the instance in order for the root disk config volume to be able to store the state dump. - Why do we need âat leastâ the VM memory size and not an âequalâ amount for our
size.state
parameter? Well, the stored state of the VM also contain other data. Indeed, during the live migration process, we let QEMU mirror the storage device live-writes to a copy-on-write file: a.qcow2
file. This file is written on the root disk config volume.
Rationale
Manually setting this VM live migration parameter for each VM is a tedious and error-prone task. Instead, we want to eliminate this need for the user to enter the parameter and introduce a management runtime that handle this state management in an adaptive way. This could be paving the road toward more âintelligentâ (by intelligent, you can understand âadaptiveâ) migration techniques (network-aware, disk IOPS-aware and even âhistoryâ-aware in the sense that some statistical models show faster migration time when we have former migration experience data points).
The following of this study will propose a strategy describing how we could get rid of the size.state
parameter and make the live migration process more adaptive and user-friendly. In a second study, weâll cover how we can get rid of the migration.stateful
parameter so that the user does not have to specify any parameter at all to enable the live migration of its VM.
Specification
* (see the âStrategyâ section to have the formatting)
Letâs have a new shared volume (the term âbufferâ will be used interchangeably but it designate the same thing) managed by LXD with a fixed size. This volume will be hidden. The size of this volume is LXD specific and could be a modifiable server parameter but will come with a chosen default value (e.g, 10GiB
). The purpose of this shared volume is to act as a temporary buffer in order to store the .qcow2
files of the running VM live migration processes (we could name this volume migration-cow-buffer
. This volume could be backed by ZFS by default).
I donât think there would be a need to store any VM memory dumps in this volume because each memory dump has a determined upper bound for its size (unlike the live-writes .qcow2
file), so it is easy to resize the VM root disk at the pre-migration phase to make sure the root disk config volume is big enough to store the memory dump and the CPU state. Weâd just need to resize the root disk on the target node to its original size after the migration process is done.
The volume would be partitioned into multiple sub-volumes; a partitioning function will be involved for that. To be clear, this âpartitioningâ logic is only conceptual. This is an algorithm to track the segmentation of the concurrent migrations. There is no physical disk partitioning operations involved. Weâll come back to this partitioning function a bit later; for now, letâs assume it exist. There would be a 1:1 mapping between the sub-volumes and the VM live migration processes so that each partition holds one .qcow2
file.
Each partition tied to a migration would have an associated throttle rate. The throttle rate is a variable limit that will be applied to a QEMU storage device through QMP (we can also use the -drive
option with throttling.*
parameters to set IOPS limits on the disk. For example: -drive file=/path/to/disk.img,throttling.bps-write=10485760,...
would limit the write bandwidth to 10 MB/s
.)
In the computation of this throttle rate, letâs point out that the current speed of the consuming link (the link between the source and the destination nodes) is also used to that it is network aware.
Objective: Our goal is to update the throttle rates (in the following pictures, these are the âgammaâ functions) of each VM live migration process so that all the .qcow2
files contained in the partitions of the global shared volume keep a decent IOPS without the total size of the sharedd volume being entirely filled up.
Partitioning logic
* (see the âPi : The partitioning functionâ section to have the formatting)
We need to design an ideal partitioning function for dynamically managing sub-spaces (or shortly put, âbinsâ. Weâll use these two terms interchangeably) within our shared volume for .qcow2
files during multiple live migrations. This function needs to balance the requirements of each migration while ensuring the total fixed size of the shared volume is not exceeded. Letâs define a few principles and then propose a partitioning strategy.
Principles
-
Dynamic allocation: the partitioning function should allocate sub-spaces based on current requirements and available space, adjusting as migrations start and end.
-
Fairness: each migration process should get a fair share of the buffer, ideally based on its needs and the total available space.
-
Buffer limitation: the sum of all partitions should never exceed the size of the shared buffer.
-
Non-reduction of allocated space: once a space is allocated to a
.qcow2
file within a partition, it canât be reduced (this will lead to data loss / corruption).
Proposed function
Let N be the number of ongoing migrations at time t. The function can be defined as follows:
- Starting a new migration - initial allocation:
When a new migration starts, allocate an initial partition size based on the available space and the number of ongoing migrations.
- Completed migration - reclaiming space:
When a migration completes (one of the âbinâ on the diagram above is no longer needed because its underlying .qcow2
has been removed), the other âbinsâ can reclaim the space allocated to the completed migration. The reclaimed space is then redistributed to the remaining ongoing migrations. The redistribution is done by increasing the size of each remaining âbinâ by the same amount. The amount is calculated by dividing the reclaimed space by the number of ongoing migrations:
- âBinsâ overlapping
Now, you are probably wondering whatâs happening in the case where a new migration causes a bin to be full or to be less than the associated size of the CoW file (this is because a binâs size is divided when a new migration process occurs). Or, whatâs happening if a migration is hogging its bin and the other migrations are starving or just simply havenât filled up their respective bins?
In such a situation, we wonât just give up and mark migration 0
as failed: instead weâll âshard the extra data on the other sub-spacesâ. * (see the â3. âBinsâ overlappingâ section in the paper to get the explanation)
Then the bin will be marked as full and no underlying data will be grown inside it (until a migration process finishes and some space is reclaimed).
In case of âshardedâ migrations, weâll introduce the notion of priority so that this migration has a penalized throttle rate compared to the one living in its original bin. * (see the âthe throttle rate (sharded behavior)â section in the paper to get the explanation)
the throttle rate (non-sharded behavior)
- For a non-sharded migration, we propose the following heuristic to adjust the throttle rate * (see the âthe throttle rate (non-sharded behavior)â to see the formula.
Letâs explain the terms and the underlying idea behind this heuristic alongside the following picture:
The red curve is modeling the left part of the min(.)
operator and the blue curve is modeling the right part.
-
The blue curve is a simple, yet in practice, close to the reality model of a network bandwidth (a transitive state followed by a steady state) evolution divided by N which is constant most of the time (it is a âstairâ function to be exact).
-
The red curve show the evolution of the throttling rate with the size of the CoW file in the x-axis. At first, the throttling rate is around the value of the maximum partition size, which is, in this case, above the allocated network bandwidth U / N allocated for this bin. Then we use the value of the blue curve to adjust the throttling rate. The more the bin is filling up, the closest the red curve get to the blue curve and eventually, exponential throttling rate will be applied without reaching 0 MB/S because of the nominal throttle rate constant epsilon = 0.01 (in MB/s).
So if the network bandwidth is large, the throttling rate will very quickly follow the red curve law and will start to decrease in a somewhat linear way until the late exponential behavior appears at around 60% of the bin size (auto-convergence
behavior).
If the network bandwidth is small, we wonât be able to transfer the .qcow2
data at a fast pace anyway so the throttle rate will be in phase with the blue curve.
The interesting fact about that heuristic is that this behavior happens for whatever the size of the bin (in the above picture, the bin size for this migration is 100MB
) because of the exponential term.
the throttle rate (sharded behavior)
As explained in the partitioning function
section, we can have a migration process that is âshardedâ across multiple bins. We also spoke about some priorities that some migrations have over others in the case of a bin being âshardedâ. Letâs now explain how these priorities are expressed and explain how we can adjust the throttle rate of a migration process that is âshardedâ across multiple bins.
The chosen heuristic is surprisingly simple, because it is based on the same principle as the non-sharded behavior though it contains a simple addition.
Letâs picture this situation:
We have 4 concurrent migration. In this scenario, because each migration evolved differently, this translated to having 4 bins, among 2 of them are frozen.
-
Migration 0 is sharded across 3 bins: its âmainâ bin b_0 (all the migrations have at least one shard) and on b_1 and b_3.
-
Migration 2 is also sharded across 3 bins: its âmainâ bin b_2 and on b_1 and b_3.
You notice that the order of the shards is not the same for both migrations: In b_1, the green shard is on top of the orange one and on b_3, the orange shard is on top of the green one. Each time a shard is added to a bin, it is added on top of the other shards. The bottom shard is the one with the highest priority and then the order of priority is from bottom to top.
* (see the âthe throttle rate (sharded behavior)â to see the formula to compute the priorities and how a throttle rate is modified when itâs penalized by this priority
Here is what a penalized throttle rate looks like when it is sharded with its computed mean priority:
- The black curve is the âpenalizedâ version of the red curve (the non-sharded behavior). We can clearly see that the effect on the throttling rate is really noticeable for this mean priority (equal to 2. * (see the paper to understand how it is computed)
) after around 40% of the capacity of the sharded bins (the non-frozen ones), the throttling rate has been halved compared to the non-sharded behavior, letting an other migration whose the bin is the main one, more room to fill it up.
Conclusion on the heuristic
Through the monitoring of the bytes being written on the .qcow2
files of concurrent live VM migrations and a live measurement of the network bandwidth between the source and its destination nodes, we can adjust the IOPS throttle rate of a VMâs disk devices. This buffer technology is network aware and adaptive to the host fixed resources while giving a fair share of live-write IOPS per concurrent migration.
Itâll allow a LXD server to manage a centralized, hidden volume of a fixed size (decided during the lxd init
phase, or set by default to 10GiB
with a possibility to be increased / shrink by an admin) and to allow this scheduling logic to automatically throttle any number of concurrent live migrations without the user having to specify any size.state
parameter.
Now, this was the theory.
Letâs see what weâd need to introduce more practically speaking to enable that buffer management technology.
Daemon changes
-
The first thing is to create a new volume on the LXD management pool to allocate this fixed size volume (e.g,
migration-cow-buffer
) -
Weâll also need a connection tracking system to evaluate the speed of the consuming link between the source and the target node during the migration, in order to make our throttle rates network aware.
-
Weâll need a filesystem tracker to actively monitor the size of the CoW files.
-
Because the
.qcow2
files wonât be stored on the instance config volume anymore, weâll need an other way to send the.qcow2
bytes over the network because we wonât be able to use QEMUâs NBD server anymore for that purpose.
CLI changes
-
This will need to introduce a new configuration step in
lxd init
(with a default value if the--auto
flag is set) in order to ask the user how much space will be allocated for this volume. -
The second change, and the most notable one, is the removal of the
size.state
option parameter.
API changes
None.