Hi, I have a lxd cluster, cluster Member crash due to OOM sometimes. When cluster database leader is changing, all cluster member hung up. If I can set LXD database leader manually, I will set database leader on stable cluster member, my cluster will more strong.
Hi, have you tried with lxd cluster edit
[1]? Make sure you’re looking at the right version of the docs for your version of LXD, as cluster recovery has seen some changes recently.
[1] https://documentation.ubuntu.com/lxd/en/stable-5.21/howto/cluster_recover/
So cool , Let me try it. thank you!
Hi, I try it and found that it only can stopped cluster, but my cluster is running, can not stop because users are using instnace.
try running systemctl reload snap.lxd.daemon (it will not restart your containers, don’t worry) on the current leader until you get it where you want it.
for example:
[root@lxd11 ~]# lxc cluster list --format csv
lxd3,https://lxd3:8443,database,x86_64,default,,ONLINE,Fully operational
lxd4,https://lxd4:8443,database-standby,x86_64,default,,ONLINE,Fully operational
lxd10,https://lxd10:8443,"database-leader,database",x86_64,default,,ONLINE,Fully operational
lxd11,https://lxd11:8443,database,x86_64,default,,ONLINE,Fully operational
lxd12,https://lxd12:8443,database-standby,x86_64,default,,ONLINE,Fully operational
lxd13,https://lxd13:8443,,x86_64,default,,ONLINE,Fully operational
[root@lxd11 ~]# ssh lxd10 systemctl reload snap.lxd.daemon
[root@lxd11 ~]# lxc cluster list --format csv
lxd3,https://lxd3:8443,"database-leader,database",x86_64,default,,ONLINE,Fully operational
lxd4,https://lxd4:8443,database,x86_64,default,,ONLINE,Fully operational
lxd10,https://lxd10:8443,,x86_64,default,,ONLINE,Fully operational
lxd11,https://lxd11:8443,database,x86_64,default,,ONLINE,Fully operational
lxd12,https://lxd12:8443,database-standby,x86_64,default,,ONLINE,Fully operational
lxd13,https://lxd13:8443,,x86_64,default,,ONLINE,Fully operational
[root@lxd11 ~]# ssh lxd3 systemctl reload snap.lxd.daemon
[root@lxd11 ~]# lxc cluster list --format csv
lxd3,https://lxd3:8443,,x86_64,default,,ONLINE,Fully operational
lxd4,https://lxd4:8443,database,x86_64,default,,ONLINE,Fully operational
lxd10,https://lxd10:8443,database-standby,x86_64,default,,ONLINE,Fully operational
lxd11,https://lxd11:8443,"database-leader,database",x86_64,default,,ONLINE,Fully operational
lxd12,https://lxd12:8443,database,x86_64,default,,ONLINE,Fully operational
lxd13,https://lxd13:8443,database-standby,x86_64,default,,ONLINE,Fully operational
etc… etc…
hi, Aleks, When I exec sudo lxd cluster edit
, I got error Error: The LXD daemon is running, please stop it first.
. when I exec systemctl reload snap.lxd.daemon
, leader whill not change.
But I found a method to change leader without stoping cluster: when cluster leader is changed and cluster hung up, I exec systemctl restart snap.lxd.daemon
on node client
(clinet is just a client), client will be new database leader, and cluster will recovery soon.
after exec
systemctl restart snap.lxd.daemon
on node client
You cannot set the automatic cluster member roles manually
Automatic roles are assigned by LXD itself and cannot be modified by the user.
https://documentation.ubuntu.com/lxd/en/latest/explanation/clustering/#member-roles
However you may be able to influence LXD’s role placement decision by putting the problem server in its own failure domain, and the other members in their own failure domain such a member within the same failure domain gets the leader role if another member goes down.
For example, if a cluster member that currently has the database role gets shut down, LXD tries to assign its database role to another cluster member in the same failure domain, if one is available.
https://documentation.ubuntu.com/lxd/en/latest/explanation/clustering/#failure-domains
But in general I think the priority here should be to fix the OOM scenario