Cluster Configurations Issues | LXD Cluster Crashes When Powered On After Force-Removing Offline Node

Environment and Issue:

  • LXD Version: 5.21.3-c5ae129 (Snap)
  • Cluster Size: 3 nodes
  • One node went permanently offline and was removed using lxc cluster remove --force <member>
  • Some workloads were still present on the removed node - no worries for that
  • After powering on the removed node, it automatically starts snap.lxd.daemon
  • This causes the entire cluster to crash, become unstable, or go offline

Certificate Issues:

  • By default, LXD uses self-signed certificates for server and cluster communication
  • I replaced them with valid wildcard CA-signed certificates (*.example.com) in /var/snap/lxd/common/lxd/
  • These certs are valid for domains, but not for IP addresses
  • LXD cluster internally uses node URLs in the form of https://<IP>:8443
  • When running lxc cluster list, nodes are shown with IP-based URLs
  • When running lxc monitor --pretty, I receive the warning:
    cluster notification isn't using trusted server certificate
  • This indicates that internal cluster communication fails validation because the wildcard certs do not match the IP-based URLs
  • It seems LXD still uses internal IPs for inter-node communication, and does not validate certificates correctly in this setup

How can I resolve these things, any help will save us?

Please can you get the logs from the other 2 nodes using journalctl -b | grep lxd.

@markylaing maybe able to share some insights here.

But my understanding is that LXD always uses mutual TLS for intra cluster communications and doesn’t validate based on domain/IP.

So that suggests one or more of your cluster members are missing valid entries in the internal trust store.

I think we will need more details on what you did in this step:

  • I replaced them with valid wildcard CA-signed certificates (*.example.com) in /var/snap/lxd/common/lxd/

Question: Is this related to the fact the cluster wont start up now? Had you previously restarted the cluster and it was working before the node failed that was forcefully removed?

1 Like

As @tomp suggests, this is the likely cause. In a clustered setup, replacing server.{crt,key} is not supported. These self-signed certificates are used for internal traffic only, so there should be no reason to change them. These certificates also secure the DQLite connection, so if they are not trusted, DQLite (and therefore LXD) will break.

Can you please let us know what your reason was for doing this? There may be other features in LXD that help you achieve the behaviour you want.

The cluster certificate can be replaced via lxc cluster update-certificate. The cluster certificate can also be updated automatically via ACME. There are also options for self-managed PKI.

2 Likes

Now I understand the certificate thing, can you please guide me about new certificate generation? Here is my way but it throws error in cluster and server communication.

This ui-cert.crt is different, trust store does not have other server certificates and I am manully change the certs in the /var/snap/lxd/common/lxd/ Can you please guide me how to add server certificate again in the store?

root@lx-stg1:/var/snap/lxd/common/lxd# lxc config trust ls
+--------+-------------+-------------+--------------+------------------------------+------------------------------+
|  TYPE  |    NAME     | COMMON NAME | FINGERPRINT  |          ISSUE DATE          |         EXPIRY DATE          |
+--------+-------------+-------------+--------------+------------------------------+------------------------------+
| client | ui-cert.crt |             | 0754cb5e33b4 | Mar 17, 2025 at 5:57am (UTC) | Dec 12, 2027 at 5:57am (UTC) |
+--------+-------------+-------------+--------------+------------------------------+------------------------------+

Like this, this is my other lxc cluster - just for ref

root@lxc-test-1:~# lxc config trust ls
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
|  TYPE  |    NAME    |   COMMON NAME   | FINGERPRINT  |          ISSUE DATE          |         EXPIRY DATE         |
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
| server | lxc-test-1 | root@lxc-test-1 | bb11e7e11157 | May 11, 2025 at 1:37pm (UTC) | May 9, 2035 at 1:37pm (UTC) |
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
| server | lxc-test-2 | root@lxc-test-2 | 890b4da409a1 | May 11, 2025 at 1:41pm (UTC) | May 9, 2035 at 1:41pm (UTC) |
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
root@lx-stg1:~# cat generate_lxd_certs.sh 
#!/bin/bash

# Auto-detect hostname
CN="root@$(hostname)"
HOSTNAME="$(hostname)"

# 1. Generate private key using EC P-384 curve
openssl ecparam -name secp384r1 -genkey -noout -out lxd.key

# 2. Create general-purpose OpenSSL config
cat > cert.cnf <<EOF
[req]
distinguished_name = req_distinguished_name
req_extensions = req_ext
prompt = no

[req_distinguished_name]
CN = ${CN}
O = LXD

[req_ext]
keyUsage = critical, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
basicConstraints = critical, CA:false
subjectAltName = @alt_names

[alt_names]
DNS.1 = ${HOSTNAME}
DNS.2 = lx-stg2.nayatel.com
IP.1 = 127.0.0.1
IP.2 = ::1
IP.3 = 172.29.21.202
IP.4 = 172.29.21.203
EOF

# 3. Generate self-signed certificate
openssl req -new -x509 -sha384 \
  -key lxd.key \
  -out lxd.crt \
  -days 3650 \
  -config cert.cnf \
  -extensions req_ext

# 4. Duplicate as both server and cluster certs
cp lxd.key server.key
cp lxd.key cluster.key
cp lxd.crt server.crt
cp lxd.crt cluster.crt

# 5. Permissions
chmod 600 server.key cluster.key
chmod 644 server.crt cluster.crt
chown root:root server.* cluster.*

# 6. Move to LXD cert path
mv server.key server.crt cluster.key cluster.crt /var/snap/lxd/common/lxd/

# 7. Cleanup
rm -f lxd.key lxd.crt cert.cnf

systemctl restart snap.lxd.daemon

echo "✅ LXD certs (server + cluster) generated and moved to /var/snap/lxd/common/lxd for ${HOSTNAME} and daemon restarted"

I think, this issue is same. I myself removed all the certificates from trust store cause back then I do not know about them. But I want to know how to put them back so my tls work should work again. In production we do not want to do any mistake that’s why we are testing everything which we can fix immediately.