Some node of our LXD cluster hang up and recover automatically after several hours

equator8848 · March 26, 2024, 12:01am

Hi, I’ve been having some problems lately.
Some node of our LXD cluster hang up and recover automatically after several hours.

I found error “Dqlite: attempt 1: server 172.16.0.20:8443: no known leader” in /var/snap/lxd/common/lxd/logs/lxd.log.

How can I check why lxd throw error “no known leader”? Can I list the leader changelog?(I can list current leader by lxd sql local "SELECT * FROM raft_nodes", but I don’t know how to list the raft list).

At the same time, I found many error like:

time="2024-03-24T16:11:39Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-49125-instance-LYN8YRcY instanceType=container pid=1829414 project=default
time="2024-03-24T16:11:39Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 1829414 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-49125-instance-LYN8YRcY instanceType=container pid=1829414 project=default

When cluster recover automatically, these error disappear.

Any idea to debug?

tomp · March 27, 2024, 8:06am

What does snap list show on each server?

equator8848 · March 31, 2024, 10:38am

ok, tomp, when I meet same problem, i will check it.

equator8848 · April 4, 2024, 3:16pm

hi tomp ,all server show like this

equator8848 · April 4, 2024, 5:58pm

after I exec sudo systemctl reload snap.lxd.daemon, all node is offline. but I can connect 8443 port with nc.

equator8848 · April 4, 2024, 8:04pm

hi, I maybe found the cause.

I configure ceph mon5,mon8,mon9 for lxd like this:

when mon8 restart, some node of lxd cluster hung up.

how can I configue lxd that let it can request ceph mon balance?

equator8848 · August 8, 2024, 3:46am

Today, I meet this problem again. 7 node IO block of my 17 nodes cluster.

My cluster use ceph as remote storage, I can exec sudo rbd bench command to read write data on these node, but instance on these nodes all waiting for IO. I have to sudo systemctl restart snap.lxd.daemon to recovery finally, but it make me interrupt running job.

any idea to troubleshoot, thanks