@markylaing maybe able to share some insights here.
But my understanding is that LXD always uses mutual TLS for intra cluster communications and doesn’t validate based on domain/IP.
So that suggests one or more of your cluster members are missing valid entries in the internal trust store.
I think we will need more details on what you did in this step:
I replaced them with valid wildcard CA-signed certificates (*.example.com) in /var/snap/lxd/common/lxd/
Question: Is this related to the fact the cluster wont start up now? Had you previously restarted the cluster and it was working before the node failed that was forcefully removed?
As @tomp suggests, this is the likely cause. In a clustered setup, replacing server.{crt,key} is not supported. These self-signed certificates are used for internal traffic only, so there should be no reason to change them. These certificates also secure the DQLite connection, so if they are not trusted, DQLite (and therefore LXD) will break.
Can you please let us know what your reason was for doing this? There may be other features in LXD that help you achieve the behaviour you want.
Now I understand the certificate thing, can you please guide me about new certificate generation? Here is my way but it throws error in cluster and server communication.
This ui-cert.crt is different, trust store does not have other server certificates and I am manully change the certs in the /var/snap/lxd/common/lxd/ Can you please guide me how to add server certificate again in the store?
root@lx-stg1:/var/snap/lxd/common/lxd# lxc config trust ls
+--------+-------------+-------------+--------------+------------------------------+------------------------------+
| TYPE | NAME | COMMON NAME | FINGERPRINT | ISSUE DATE | EXPIRY DATE |
+--------+-------------+-------------+--------------+------------------------------+------------------------------+
| client | ui-cert.crt | | 0754cb5e33b4 | Mar 17, 2025 at 5:57am (UTC) | Dec 12, 2027 at 5:57am (UTC) |
+--------+-------------+-------------+--------------+------------------------------+------------------------------+
Like this, this is my other lxc cluster - just for ref
root@lxc-test-1:~# lxc config trust ls
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
| TYPE | NAME | COMMON NAME | FINGERPRINT | ISSUE DATE | EXPIRY DATE |
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
| server | lxc-test-1 | root@lxc-test-1 | bb11e7e11157 | May 11, 2025 at 1:37pm (UTC) | May 9, 2035 at 1:37pm (UTC) |
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
| server | lxc-test-2 | root@lxc-test-2 | 890b4da409a1 | May 11, 2025 at 1:41pm (UTC) | May 9, 2035 at 1:41pm (UTC) |
+--------+------------+-----------------+--------------+------------------------------+-----------------------------+
I think, this issue is same. I myself removed all the certificates from trust store cause back then I do not know about them. But I want to know how to put them back so my tls work should work again. In production we do not want to do any mistake that’s why we are testing everything which we can fix immediately.