Cluster member won’t start after upgrade

klim8d · July 21, 2023, 5:18pm

Upgraded from 4.10 (non-snap) to latest snap version (5.15).

Cluster consists of four nodes, did the following steps on three of them and everything went fine:

snap install lxd
lxd.migrate

Three nodes are now waiting for the last node, however lxd won’t start on the fourth.

Did the same steps but something must have failed and now lxd won’t start on the last node.

output from lxd --debug --group lxd

INFO[07-21|18:50:50] LXD 4.10 is starting in normal mode      path=/var/lib/lxd
INFO[07-21|18:50:50] Kernel uid/gid map:
INFO[07-21|18:50:50]  - u 0 0 4294967295
INFO[07-21|18:50:50]  - g 0 0 4294967295
INFO[07-21|18:50:50] Configured LXD uid/gid map:
INFO[07-21|18:50:50]  - u 0 1000000 65536
INFO[07-21|18:50:50]  - g 0 1000000 65536
INFO[07-21|18:50:50] Kernel features:
INFO[07-21|18:50:50]  - closing multiple file descriptors efficiently: yes
INFO[07-21|18:50:50]  - netnsid-based network retrieval: yes
INFO[07-21|18:50:50]  - pidfds: yes
INFO[07-21|18:50:50]  - uevent injection: yes
INFO[07-21|18:50:50]  - seccomp listener: yes
INFO[07-21|18:50:50]  - seccomp listener continue syscalls: yes
INFO[07-21|18:50:50]  - seccomp listener add file descriptors: yes
INFO[07-21|18:50:50]  - attach to namespaces via pidfds: yes
INFO[07-21|18:50:50]  - safe native terminal allocation : yes
INFO[07-21|18:50:50]  - unprivileged file capabilities: yes
INFO[07-21|18:50:50]  - cgroup layout: hybrid
WARN[07-21|18:50:50]  - Couldn't find the CGroup blkio.weight, disk priority will be ignored
INFO[07-21|18:50:50]  - shiftfs support: yes
INFO[07-21|18:50:50] Initializing local database
DBUG[07-21|18:50:50] Initializing database gateway
DBUG[07-21|18:50:50] Start database node                      role=voter id=6 address=10.0.0.3:8443
INFO[07-21|18:50:50] Starting cluster handler:
INFO[07-21|18:50:50] Starting /dev/lxd handler:
INFO[07-21|18:50:50]  - binding devlxd socket                 socket=/var/lib/lxd/devlxd/sock
INFO[07-21|18:50:50] REST API daemon:
INFO[07-21|18:50:50]  - binding Unix socket                   socket=/var/lib/lxd/unix.socket
INFO[07-21|18:50:50]  - binding TCP socket                    socket=10.0.0.3:8443
INFO[07-21|18:50:50] Initializing global database
DBUG[07-21|18:50:50] Dqlite: attempt 0: server 10.0.0.1:8443: connected
DBUG[07-21|18:50:50] Database error: &errors.errorString{s:"this node's version is behind, please upgrade"}
EROR[07-21|18:50:50] Failed to start the daemon: failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade
INFO[07-21|18:50:50] Starting shutdown sequence
INFO[07-21|18:50:50] Stop database gateway
INFO[07-21|18:50:50] Stopping REST API handler:
INFO[07-21|18:50:50]  - closing socket                        socket=10.0.0.3:8443
INFO[07-21|18:50:50]  - closing socket                        socket=/var/lib/lxd/unix.socket
INFO[07-21|18:50:50] Stopping /dev/lxd handler:
INFO[07-21|18:50:50]  - closing socket                        socket=/var/lib/lxd/devlxd/sock
DBUG[07-21|18:50:50] Not unmounting temporary filesystems (containers are still running)
Error: failed to open cluster database: failed to ensure schema: this node's version is behind, please upgrade

tomp · July 23, 2023, 3:57pm

Are you able to restore a backup back to lxd 4.0 so you can switch to the snap lxd 4.0 version first before upgrading to 5.0.x or 5.x?

klim8d · July 23, 2023, 4:15pm

Unfortunately I did not take a backup of /var/lib/lxd before I ran lxd.migrate. Well lessons learned till future migrations.
Tried copying the lxd directory from where snap had migrated the files to and renamed the database.pre-migration folder to database - this just made things worse.

So I ended up doing manual recovery of the cluster, luckily I had must configurations in git and I were able to recover my containers that has ceph as a storage-backend with lxd recover.
Others using LVM, was old and dirty lvchange to activate LV and copy the files to a new container with rsync.