figolu
June 11, 2024, 4:37pm
1
Hi community!
I had a 3 nodes cluster running for 1,5 years, LXD is really great!
Some time ago, I had to remove one node (we needed the server for another use).
No problem here, works like a charm!
Now, I need to send a server to another location (with some VM on it).
So, I split VM/containers between the 2 remaining nodes, and removed a node from the cluster.
…but now, the other node won’t start anymore.
It was not holding a global database copy
The log says:
Failed connecting to global database" attempt=375 err="failed to create dqlite connection: no available dqlite leader server found
and
Dqlite: attempt 1: server lxd01.local:8443: no known leader
I still have access to the second node, if needed.
My LXD version: 5.21.1
Is there any solution?
I’d like to avoid reinstalling the server and import backups (and I’m pretty sure there is a better solution!)
Thank you!
Have you tried restarting the node that was removed from the cluster?
I was even able to partially reproduce the issue:
$ lxc exec node-1 -- lxc cluster ls
+--------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+--------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| node-1 | https://10.102.98.236:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+--------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| node-2 | https://10.102.98.171:8443 | database-standby | x86_64 | default | | ONLINE | Fully operational |
+--------+----------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
$ lxc exec node-1 -- lxc cluster remove node-1
Member node-1 removed
# Node 1 becomes inaccessible
$ lxc exec node-1 -- lxc cluster ls
Error: LXD unix socket "/var/snap/lxd/common/lxd/unix.socket" not accessible: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: connection refused
$ lxc exec node-1 -- sudo cat /var/snap/lxd/common/lxd/logs/lxd.log
time="2024-06-26T12:12:51Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
time="2024-06-26T12:12:51Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
time="2024-06-26T12:12:51Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
time="2024-06-26T12:12:51Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
time="2024-06-26T12:12:52Z" level=warning msg="Transaction timed out. Retrying once" err="Failed to begin transaction: context deadline exceeded" member=1
time="2024-06-26T12:37:52Z" level=warning msg="Dqlite: attempt 1: server 10.102.98.171:8443: reported leader server is not the leader"
time="2024-06-26T12:37:52Z" level=warning msg="Dqlite: attempt 1: server 10.102.98.236:8443: no known leader"
time="2024-06-26T12:37:52Z" level=warning msg="Dqlite: attempt 2: server 10.102.98.171:8443: reported leader server is not the leader"
time="2024-06-26T12:37:53Z" level=warning msg="Dqlite: attempt 2: server 10.102.98.236:8443: no known leader"
time="2024-06-26T12:37:53Z" level=warning msg="Dqlite: attempt 3: server 10.102.98.171:8443: reported leader server is not the leader"
time="2024-06-26T12:37:53Z" level=warning msg="Dqlite: attempt 3: server 10.102.98.236:8443: no known leader"
time="2024-06-26T12:37:54Z" level=warning msg="Dqlite: attempt 4: server 10.102.98.171:8443: reported leader server is not the leader"
time="2024-06-26T12:37:54Z" level=warning msg="Dqlite: attempt 4: server 10.102.98.236:8443: no known leader"
time="2024-06-26T12:37:55Z" level=warning msg="Dqlite: attempt 5: server 10.102.98.171:8443: reported leader server is not the leader"
time="2024-06-26T12:37:55Z" level=warning msg="Dqlite: attempt 5: server 10.102.98.236:8443: no known leader"
time="2024-06-26T12:37:56Z" level=warning msg="Dqlite: attempt 6: server 10.102.98.171:8443: reported leader server is not the leader"
time="2024-06-26T12:37:56Z" level=warning msg="Dqlite: attempt 6: server 10.102.98.236:8443: no known leader"
time="2024-06-26T12:38:03Z" level=warning msg="Failed to rebalance dqlite nodes: Get current raft nodes: Not leader"
# Node 2 takes the role of the leader.
$ lxc exec node-2 -- lxc cluster ls
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| node-2 | https://10.102.98.171:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+--------+----------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
But restarting the LXD (removed node) makes it aware it is not in the cluster anymore
$ lxc exec node-1 -- systemctl restart snap.lxd.daemon
$ lxc exec node-1 -- lxc cluster ls
Error: LXD server isn't part of a cluster