Recovery of instances on failed cluster member, local storage

I am investigating failure modes of LXD clusters.
It seems there is a problem if a cluster members “root drive fails”, containing the LXD snap (database, certs, etc.). The data drive containing a ZFS or LVM based instances remains intact.

How would the instances be recovered?
It seems on the surface, they cannot as new cluster member must have an empty storage volume and therefore could not join the cluster when the root disk is fixed / system re-provisioned

Options include:

  1. install as standalone, recover instanes, copyout, install as member
  • force remove failed member from cluster
  • fix member, install as a standalone LXD server
    • recover the instances on ZFS/LVM drive
    • copy from standalone to cluster the instances
    • wipe standalone LXD
    • re-provision as cluster member
    • copy instances back
  1. Take regular snap snapshots
  • make regular snap snapshots of LXD snap
  • on failure of member, restore snap
  • LXD should re-join cluster
  • LXD database “catches up”
  • everything just works

Am I missing any other possibilities for recoverying a failed member with local storage?

If the node’s system disk (the one that holds /var/snap/lxd) dies while the
data disk with the ZFS/LVM pool survives, LXD has no record of those
volumes any more. In a cluster the storage pool is owned by that node, so
a fresh install cannot simply re-join and claim the old datasets.

You have two practical recovery paths:


Rebuild the node and restore its LXD database

(fastest if you have backups)

  1. Re-install Ubuntu, same hostname/IP.
  2. snap install lxd --channel=<same-version>
  3. Restore /var/snap/lxd/common/lxd/ from an off-box backup (or a nightly
    lxc cluster export dump).
  4. snap restart lxd – the node rejoins, raft syncs, and the ZFS/LVM pool
    is recognised because the DB entry is back.

No instance copy/move is needed.


Remove the dead node from the cluster and re-import the data

(when no DB backup exists)

On another node:

lxc cluster remove <dead-node> --force

Rebuild the machine, install LXD, join the cluster with a new empty pool.
Manually import the old datasets:
Mount the data disk read-only;

for ZFS: zfs rename ..., then lxc import <instance>.tar.gz or
lxc init -s <newpool> + disk attach + lxc start – whatever method you
prefer to recreate guests from their rootfs snapshots.
for LVM: similarly create new logical volumes and lxc import.
When everything runs, destroy the orphaned datasets on the old pool.

This is slower because every instance is recreated.


“snap snapshots” of the LXD snap don’t help if the whole OS drive dies; keep
an external copy of /var/snap/lxd/common/lxd or use lxd backup export
for each instance. With that in place, path 1 is a five-minute job and your
node (and its local storage) come straight back into the cluster.

Please also see https://documentation.ubuntu.com/lxd/latest/howto/disaster_recovery/.

lxd recover can be used to scan storage pools (even those not known by the LXD database) for volumes that look like they are associated with LXD.

1 Like