Recovery of instances on failed cluster member, local storage

I am investigating failure modes of LXD clusters.
It seems there is a problem if a cluster members “root drive fails”, containing the LXD snap (database, certs, etc.). The data drive containing a ZFS or LVM based instances remains intact.

How would the instances be recovered?
It seems on the surface, they cannot as new cluster member must have an empty storage volume and therefore could not join the cluster when the root disk is fixed / system re-provisioned

Options include:

  1. install as standalone, recover instanes, copyout, install as member
  • force remove failed member from cluster
  • fix member, install as a standalone LXD server
    • recover the instances on ZFS/LVM drive
    • copy from standalone to cluster the instances
    • wipe standalone LXD
    • re-provision as cluster member
    • copy instances back
  1. Take regular snap snapshots
  • make regular snap snapshots of LXD snap
  • on failure of member, restore snap
  • LXD should re-join cluster
  • LXD database “catches up”
  • everything just works

Am I missing any other possibilities for recoverying a failed member with local storage?

If the node’s system disk (the one that holds /var/snap/lxd) dies while the
data disk with the ZFS/LVM pool survives, LXD has no record of those
volumes any more. In a cluster the storage pool is owned by that node, so
a fresh install cannot simply re-join and claim the old datasets.

You have two practical recovery paths:


Rebuild the node and restore its LXD database

(fastest if you have backups)

  1. Re-install Ubuntu, same hostname/IP.
  2. snap install lxd --channel=<same-version>
  3. Restore /var/snap/lxd/common/lxd/ from an off-box backup (or a nightly
    lxc cluster export dump).
  4. snap restart lxd – the node rejoins, raft syncs, and the ZFS/LVM pool
    is recognised because the DB entry is back.

No instance copy/move is needed.


Remove the dead node from the cluster and re-import the data

(when no DB backup exists)

On another node:

lxc cluster remove <dead-node> --force

Rebuild the machine, install LXD, join the cluster with a new empty pool.
Manually import the old datasets:
Mount the data disk read-only;

for ZFS: zfs rename ..., then lxc import <instance>.tar.gz or
lxc init -s <newpool> + disk attach + lxc start – whatever method you
prefer to recreate guests from their rootfs snapshots.
for LVM: similarly create new logical volumes and lxc import.
When everything runs, destroy the orphaned datasets on the old pool.

This is slower because every instance is recreated.


“snap snapshots” of the LXD snap don’t help if the whole OS drive dies; keep
an external copy of /var/snap/lxd/common/lxd or use lxd backup export
for each instance. With that in place, path 1 is a five-minute job and your
node (and its local storage) come straight back into the cluster.

Please also see https://documentation.ubuntu.com/lxd/latest/howto/disaster_recovery/.

lxd recover can be used to scan storage pools (even those not known by the LXD database) for volumes that look like they are associated with LXD.

1 Like

what is this “lxc cluster export” that you are talking about? Is this something new in v6? I certainly do not have that on my 5.21/LTS cluster and Google also doesn’t know about it, the only hit google finds is this thread.

Sorry for the confusion I mixed two commands together.

In LXD 6.x there is a new helper

lxd cluster export <path>

which dumps the global database and the raft log into a tarball that you can
later restore with lxd cluster import. I should have written lxd,
not lxc.

On 5.21/LTS that helper does not exist.
The equivalent is

lxd sql global export > /backup/lxd-global.sql

plus a copy of /var/snap/lxd/common/lxd’s raft/ directory. If the OS
disk dies you reinstall LXD, copy the raft directory back, and restore the
SQL dump with lxd sql global import.

So:

5.21 - lxd sql global export/import + raft dir backup
6.0 / 6.1 - lxd cluster export/import does both in one step

Either backup gives you what you need to follow “path 1” (rebuild the node,
restore the DB, restart LXD, and let it re-join the cluster).