I run a MicroCloud cluster that I put together on the cheap (really cheap). It is currently using a number of old spinning disks that struggle when I’m running any significant IO workload in more than one or two instances at a time. This weekend I replaced one of the disks that hosts the MicroCloud local
storage pool.
My cluster is a vanilla MicroCloud; it uses MicroOVN for networking, a seldom-used MicroCeph cluster, and one big ZFS disk on each member for local storage. I have only 3 (soon 4) machines and I use the cloud mostly for experimentation and integration testing; to keep the IO speed tolerable almost all of the instance data is kept in local storage instead of Ceph.
By default, MicroCloud configures LXD to use the local
storage pool for images and backups instead of the system’s root device (via the storage.backups_volume
and storage.images_volume
server configuration keys). Since all of my instances are more or less ephemeral I don’t use backups. All the images I use can be easily re-downloaded. To replace the disk, I simply removed/deleted these volumes and recreated them after replacing the disk. I expect this to be a pain point for users who have custom images or use the backups volume.
The cluster member under maintenance was houston
.
Outline
I replaced the backing disk/ZFS pool without recreating the pool in the LXD global database.
The LXD server will not start unless the storage volumes storage.backups_volume
and storage.images_volume
exist; in order to bring the cluster back up after replacing the disk I removed those configuration keys and re-added them using the new pool.
Pre-shutdown
I recommend grabbing the outputs of the following and saving them for reference:
lxc storage ls
lxc storage show local
lxc storage show local --target=houston
An easy mistake to make (although also easy to correct) is to be in the wrong project when deleting/recreating the backups and images volumes. This should be done in the default
project:
lxc project switch default
lxc cluster evacuate houston
Clear the image cache:
lxc image delete $(lxc image ls --columns f --format csv)
Remove the configuration keys & associated volumes. Make sure to run these commands on the correct cluster member: these keys are scope: local
, which means that they are stored in the local configuration database on each LXD cluster member instead of the global Dqlite database. Removing these keys/volumes from one member will not disturb the same volumes on other members.
lxc config unset storage.backups_volume
lxc config unset storage.images_volume
lxc storage volume delete local backups --target=houston
lxc storage volume delete local images --target=houston
Before shutting down LXD, make sure that there are no volumes remaining on the local
zfs pool that you want to hold on to:
sudo zfs list
Then disable LXD (to prevent it trying to load the local
on startup):
sudo snap disable lxd
Shutdown
Shut the machine down and replace the disk.
Startup
Before starting LXD, the local
ZFS pool needs to exist. Initialize the pool with the following command, adapted from lxd/storage/drivers/driver_zfs.go
; note that you’ll need to provide the correct device path.
sudo zpool create -m none -O compression=on -o autotrim=on local /dev/sdb
LXD will initialize its expected ZFS datasets on startup:
sudo snap enable lxd
Then reconfigure storage.backups_volume
and storage.images_volume
:
lxc storage volume create local backups
lxc storage volume create local images
lxc config set storage.backups_volume=local/backups
lxc config set storage.images_volume=local/images
You may also want to reset volatile.initial_source
on the storage pool. This key is only informational and is set when the pool was initially created. MicroCloud used a symlink from /dev/disk/by-id
during cluster initialization:
lxc storage set local volatile.initial_source=/dev/disk/by-id/scsi-SATA_SanDisk_SDSSDH3_222801A009D7
Conclusion
This process was less painful than I expected, but still more painful than I think it should be.
In the context of a single LXD server, the loss of a disk or all the disks in a local storage pool is expected to be a contingency that’s planned for by the operator. Furthermore, the total loss of the storage pool means that a simple lxc storage delete <pool>
would be sufficient to clear the way for rebuilding the pool/server.
However, in the clustered context, the global database records for the local
pool may still refer to fully-functional pools even if one cluster member’s pool fails. LXD/MicroCloud should provide a means to replace a single disk/pool if it fails or needs upgrading.