I check the LXD services in the offline node and it is running. I tried to check the debug message on the offline node with lxd --debug --group lxd and find invalid certificate error.
ERROR [2023-09-05T07:13:05+07:00] Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:50874
I cannot run any lxc command on the offline node, whether its lxd service is run. It is hanging for long time and then exited with error Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers
From your neptune server, can you please verify that lxc config trust list contains the certificate for your triton server and that it is of type server? Thanks.
Can you please post the version of LXD that you are running on both servers.
Additionally, can you please run lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates' on both cluster members and verify that the contents are identical, and that the public keys in the certificate column match with the public key that can be found at /var/snap/lxd/common/lxd/server.crt.
If the lxd sql local command does not work, run this command instead: sqlite3 /var/snap/lxd/common/lxd/database/local.db 'SELECT fingerprint, type, name, certificate FROM certificates'.
If you aren’t running the snapped version of LXD, substitute /var/snap/lxd/common/lxd with /var/lib/lxd or whatever the value of the LXD_DIR environment variable is on your system.
Your output of lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates' shows the problem. Both cluster members should return the same results here. neptune should have it’s own certificate, as well as a certificate for triton, and triton is missing the certificate for neptune.
On start up, neptune cannot get a secure connection to the DQLite cluster (requests are made over TLS via an internal endpoint). triton can connect to the “clustered” DQLite, but can’t connect to neptune.
Regarding a fix, patching the local database may work, however, the local certificate entries will be overwritten by the cluster database certificate entries as soon as the DQLite heartbeat succeeds. So if the contents of the cluster database are incorrect, we’ll be back to square one.
Since the cluster is currently not functional, I would recommend tearing it down and creating a new one. If this cluster must be fixed, I’d first run
sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT fingerprint, type, name, certificate FROM certificates WHERE type = 2'
on both nodes and verify that the contents are identical, and that the output contains the certificates of both neptune and triton.
I don’t really remember what step we through to fix the error. But, it’s first occur after we try to refresh lxd snap package. After that, the lxc command hang for a long time, then bring up the error.
I don’t really remember the detail of what happened. It is occured after we snap refresh lxd package. As far as I remember, I didn’t try to revert the version.
I run command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin 'SELECT version FROM schema' on triton
Error: in prepare, no such table: schema (1)
I tried another command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin and list table with .table. It looks like it’s empty.
I tried to confirm with sudo ls -alh /var/snap/lxd/common/lxd/database/global/db.bin
output: -rw-r--r-- 1 root root 0 May 6 09:17 /var/snap/lxd/common/lxd/database/global/db.bin
Can you please run sudo ls -alh /var/snap/lxd/common/global on both cluster members?
Can I confirm that you are still able to interact with LXD on the neptune server? e.g. it is still fully operational? I have assumed this since you have posted the output of lxc cluster list from this node at the beginning of this thread.