Cluster Configurations Issues | LXD Cluster Crashes When Powered On After Force-Removing Offline Node

Environment and Issue:

  • LXD Version: 5.21.3-c5ae129 (Snap)
  • Cluster Size: 3 nodes
  • One node went permanently offline and was removed using lxc cluster remove --force <member>
  • Some workloads were still present on the removed node - no worries for that
  • After powering on the removed node, it automatically starts snap.lxd.daemon
  • This causes the entire cluster to crash, become unstable, or go offline

Certificate Issues:

  • By default, LXD uses self-signed certificates for server and cluster communication
  • I replaced them with valid wildcard CA-signed certificates (*.example.com) in /var/snap/lxd/common/lxd/
  • These certs are valid for domains, but not for IP addresses
  • LXD cluster internally uses node URLs in the form of https://<IP>:8443
  • When running lxc cluster list, nodes are shown with IP-based URLs
  • When running lxc monitor --pretty, I receive the warning:
    cluster notification isn't using trusted server certificate
  • This indicates that internal cluster communication fails validation because the wildcard certs do not match the IP-based URLs
  • It seems LXD still uses internal IPs for inter-node communication, and does not validate certificates correctly in this setup

How can I resolve these things, any help will save us?

Please can you get the logs from the other 2 nodes using journalctl -b | grep lxd.

@markylaing maybe able to share some insights here.

But my understanding is that LXD always uses mutual TLS for intra cluster communications and doesn’t validate based on domain/IP.

So that suggests one or more of your cluster members are missing valid entries in the internal trust store.

I think we will need more details on what you did in this step:

  • I replaced them with valid wildcard CA-signed certificates (*.example.com) in /var/snap/lxd/common/lxd/

Question: Is this related to the fact the cluster wont start up now? Had you previously restarted the cluster and it was working before the node failed that was forcefully removed?

1 Like

As @tomp suggests, this is the likely cause. In a clustered setup, replacing server.{crt,key} is not supported. These self-signed certificates are used for internal traffic only, so there should be no reason to change them. These certificates also secure the DQLite connection, so if they are not trusted, DQLite (and therefore LXD) will break.

Can you please let us know what your reason was for doing this? There may be other features in LXD that help you achieve the behaviour you want.

The cluster certificate can be replaced via lxc cluster update-certificate. The cluster certificate can also be updated automatically via ACME. There are also options for self-managed PKI.

1 Like