Some node of our LXD cluster hang up and recover automatically after several hours

Hi, I’ve been having some problems lately.
Some node of our LXD cluster hang up and recover automatically after several hours.

I found error “Dqlite: attempt 1: server 172.16.0.20:8443: no known leader” in /var/snap/lxd/common/lxd/logs/lxd.log.

How can I check why lxd throw error “no known leader”? Can I list the leader changelog?(I can list current leader by lxd sql local "SELECT * FROM raft_nodes", but I don’t know how to list the raft list).

At the same time, I found many error like:

time="2024-03-24T16:11:39Z" level=warning msg="Failed to retrieve network information via netlink" instance=shpc-49125-instance-LYN8YRcY instanceType=container pid=1829414 project=default
time="2024-03-24T16:11:39Z" level=error msg="Error calling 'lxd forknet" err="Failed to run: /snap/lxd/current/bin/lxd forknet info -- 1829414 3: exit status 1 (Failed setns to container network namespace: No such file or directory)" instance=shpc-49125-instance-LYN8YRcY instanceType=container pid=1829414 project=default

When cluster recover automatically, these error disappear.

Any idea to debug? :smiling_face_with_tear:

What does snap list show on each server?

ok, tomp, when I meet same problem, i will check it.

hi tomp ,all server show like this
image

after I exec sudo systemctl reload snap.lxd.daemon, all node is offline. but I can connect 8443 port with nc.

image

hi, I maybe found the cause.

I configure ceph mon5,mon8,mon9 for lxd like this:

when mon8 restart, some node of lxd cluster hung up.

how can I configue lxd that let it can request ceph mon balance?