I have LXD cluster installed and its behave strangely today. When I check cluster list, it is appear that one of its member is offline.
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| neptune | https://172.30.68.16:8443 | database-leader | x86_64 | default | | ONLINE | Fully operational |
| | | database | | | | | |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| triton | https://172.30.68.18:8443 | database-standby | x86_64 | default | | OFFLINE | No heartbeat for 15h12m54.799128025s (2023-09-04 15:48:03.216504734 +0700 +0700) |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
I check the LXD services in the offline node and it is running. I tried to check the debug message on the offline node with lxd --debug --group lxd
and find invalid certificate error.
ERROR [2023-09-05T07:13:05+07:00] Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:50874
I cannot run any lxc command on the offline node, whether its lxd service is run. It is hanging for long time and then exited with error Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers
Can you run the following on the offline member to check if the certificate is still valid?
openssl x509 -noout -text -in /var/snap/lxd/common/lxd/server.crt
tomp
September 21, 2023, 1:12pm
3
What do you see in /var/snap/lxd/common/lxd/logs/lxd.log
Run this command on both servers :
On ‘neptune’ :
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
c6:e4:aa:fd:16:c5:ed:ee:44:09:69:d1:ca:65:f9:2d
Signature Algorithm: ecdsa-with-SHA384
Issuer: O = linuxcontainers.org, CN = root@neptune
Validity
Not Before: Feb 9 07:56:58 2022 GMT
Not After : Feb 7 07:56:58 2032 GMT
Subject: O = linuxcontainers.org, CN = root@neptune
Subject Public Key Info:
Public Key Algorithm: id-ecPublicKey
Public-Key: (384 bit)
pub:
04:1a:f6:4b:97:a8:9d:e4:81:9b:bb:f7:c6:7d:b8:
9b:eb:34:00:6e:d0:dc:bf:1d:2a:a1:30:0b:bb:77:
3e:22:ec:54:9f:7c:b3:59:39:8d:22:59:73:fe:f2:
7a:e7:4f:20:9e:f3:62:a6:a3:e3:e3:1f:ca:12:bb:
d4:0f:b8:13:26:a5:80:18:6d:4b:b9:45:dd:44:32:
b2:16:7e:a5:35:d4:4e:bf:2d:2c:12:20:f9:3f:46:
44:cf:1b:55:49:73:f9
ASN1 OID: secp384r1
NIST CURVE: P-384
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Alternative Name:
DNS:neptune, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1
Signature Algorithm: ecdsa-with-SHA384
Signature Value:
30:65:02:31:00:f5:65:f7:4a:52:82:6a:a3:39:ba:a7:17:d6:
28:41:bb:5f:ac:b5:f4:2b:ab:3c:1b:40:48:c0:be:57:57:63:
2c:80:7f:3c:e9:87:35:d0:68:04:fb:c5:c8:f3:19:5a:af:02:
30:3b:0a:fd:3f:b4:0e:e3:55:67:a4:13:a5:aa:1c:db:59:d8:
21:c1:6f:86:40:54:9b:79:d6:7b:af:ea:ad:e7:e3:a5:c0:50:
4e:14:f1:74:b2:5e:bf:7a:47:bd:15:e3:18
On ‘triton’ :
Certificate:
Data:
Version: 3 (0x2)
Serial Number:
2e:d6:e0:9b:e8:16:6d:cd:8a:f2:88:c3:9a:8a:4a:e3
Signature Algorithm: ecdsa-with-SHA384
Issuer: O = linuxcontainers.org, CN = root@triton
Validity
Not Before: Feb 9 08:01:25 2022 GMT
Not After : Feb 7 08:01:25 2032 GMT
Subject: O = linuxcontainers.org, CN = root@triton
Subject Public Key Info:
Public Key Algorithm: id-ecPublicKey
Public-Key: (384 bit)
pub:
04:97:77:51:f5:8c:fd:6e:bd:f3:b6:ce:11:ce:37:
16:a7:19:88:33:72:d6:d4:a0:59:3c:57:2d:e5:5c:
78:c1:37:07:0c:94:a8:39:10:9b:cf:d7:c0:d8:c0:
fd:06:41:95:7a:d5:23:d2:03:84:27:81:6e:87:1c:
5a:78:f6:b7:09:f9:9c:69:fc:5c:fb:f1:9a:13:fc:
e6:d0:19:d1:5d:39:81:57:a6:45:b5:1a:05:30:70:
9d:4c:d3:12:cd:48:1a
ASN1 OID: secp384r1
NIST CURVE: P-384
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Alternative Name:
DNS:triton, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1
Signature Algorithm: ecdsa-with-SHA384
Signature Value:
30:65:02:31:00:b1:9a:19:07:ba:76:c3:c0:5b:6a:5e:be:b8:
8e:58:30:52:4d:32:04:26:46:da:ac:cf:38:5f:fc:8e:3e:53:
18:93:f1:7c:f4:53:d6:7b:2d:3b:2a:2a:ff:3e:7d:ce:d4:02:
30:44:49:00:4b:3e:a0:72:09:a0:52:d1:60:a4:8b:1a:8c:4b:
d6:3a:c4:a2:f1:02:22:9c:55:38:e1:3f:38:66:60:10:ba:d3:
cf:23:23:59:49:14:64:45:9c:06:8f:89:bb
I see that both of server.crt are still valid.
lxd.log in ‘neptune’ :
time="2024-03-31T10:06:49+07:00" level=warning msg="Failed heartbeat" err="Heartbeat request failed with status: 403 Forbidden" remote="172.30.68.18:8443"
time="2024-03-31T10:06:49+07:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-03-31T10:06:50+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55518"
time="2024-03-31T10:06:51+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55534"
time="2024-03-31T10:06:52+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55536"
time="2024-03-31T10:06:53+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55540"
time="2024-03-31T10:06:54+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55556"
time="2024-03-31T10:06:55+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55564"
time="2024-03-31T10:06:56+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:42072"
time="2024-03-31T10:06:57+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:42088"
lxd.log in ‘triton’ :
time="2024-03-31T10:09:13+07:00" level=warning msg="Dqlite: attempt 8: server 172.30.68.16:8443: dial: Dialing failed: expected status code 101 got 403"
time="2024-03-31T10:09:13+07:00" level=warning msg="Dqlite: attempt 8: server 172.30.68.18:8443: no known leader"
time="2024-03-31T10:09:14+07:00" level=error msg="Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:43730"
time="2024-03-31T10:09:14+07:00" level=error msg="Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:43746"
@muhfiasbin which version of LXD are you running?
From your neptune server, can you please verify that lxc config trust list
contains the certificate for your triton server and that it is of type server
? Thanks.
It is hang for long time and then timeout with error:
Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers
Can you please post the version of LXD that you are running on both servers.
Additionally, can you please run lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates'
on both cluster members and verify that the contents are identical, and that the public keys in the certificate
column match with the public key that can be found at /var/snap/lxd/common/lxd/server.crt
.
If the lxd sql local
command does not work, run this command instead: sqlite3 /var/snap/lxd/common/lxd/database/local.db 'SELECT fingerprint, type, name, certificate FROM certificates'
.
If you aren’t running the snapped version of LXD, substitute /var/snap/lxd/common/lxd
with /var/lib/lxd
or whatever the value of the LXD_DIR
environment variable is on your system.
tomp
April 9, 2024, 1:03pm
9
Please provide output of snap list
on each server.
This is the result from neptune :
+-------------+------+------+-------------+
| fingerprint | type | name | certificate |
+-------------+------+------+-------------+
+-------------+------+------+-------------+
and this is from triton :
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
| fingerprint | type | name | certificate |
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
| ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219 | 2 | triton | -----BEGIN CERTIFICATE----- |
| | | | MIICAzCCAYmgAwIBAgIQLtbgm+gWbc2K8ojDmopK4zAKBggqhkjOPQQDAzA0MRww |
| | | | GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRQwEgYDVQQDDAtyb290QHRyaXRv |
| | | | bjAeFw0yMjAyMDkwODAxMjVaFw0zMjAyMDcwODAxMjVaMDQxHDAaBgNVBAoTE2xp |
| | | | bnV4Y29udGFpbmVycy5vcmcxFDASBgNVBAMMC3Jvb3RAdHJpdG9uMHYwEAYHKoZI |
| | | | zj0CAQYFK4EEACIDYgAEl3dR9Yz9br3zts4RzjcWpxmIM3LW1KBZPFct5Vx4wTcH |
| | | | DJSoORCbz9fA2MD9BkGVetUj0gOEJ4FuhxxaePa3Cfmcafxc+/GaE/zm0BnRXTmB |
| | | | V6ZFtRoFMHCdTNMSzUgao2AwXjAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYI |
| | | | KwYBBQUHAwEwDAYDVR0TAQH/BAIwADApBgNVHREEIjAgggZ0cml0b26HBH8AAAGH |
| | | | EAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIxALGaGQe6dsPAW2pe |
| | | | vriOWDBSTTIEJkbarM84X/yOPlMYk/F89FPWey07Kir/Pn3O1AIwREkASz6gcgmg |
| | | | UtFgpIsajEvWOsSi8QIinFU44T84ZmAQutPPIyNZSRRkRZwGj4m7 |
| | | | -----END CERTIFICATE----- |
| | | | |
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
and it is identical with the /var/snap/lxd/common/lxd/server.crt/
The result from neptune :
Name Version Rev Tracking Publisher Notes
core20 20240227 2264 latest/stable canonical✓ base
lxd 5.0.3-babaaf8 27948 5.0/stable canonical✓ in-cohort
snapd 2.62 21465 latest/stable canonical✓ snapd
and from triton :
Name Version Rev Tracking Publisher Notes
core20 20240227 2264 latest/stable canonical✓ base
lxd 5.0.3-babaaf8 27948 5.0/stable canonical✓ in-cohort
snapd 2.61.2 21184 latest/stable canonical✓ snapd
Your output of lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates'
shows the problem. Both cluster members should return the same results here. neptune
should have it’s own certificate, as well as a certificate for triton
, and triton
is missing the certificate for neptune
.
On start up, neptune
cannot get a secure connection to the DQLite cluster (requests are made over TLS via an internal endpoint). triton
can connect to the “clustered” DQLite, but can’t connect to neptune.
Regarding a fix, patching the local database may work, however, the local certificate entries will be overwritten by the cluster database certificate entries as soon as the DQLite heartbeat succeeds. So if the contents of the cluster database are incorrect, we’ll be back to square one.
Since the cluster is currently not functional, I would recommend tearing it down and creating a new one. If this cluster must be fixed, I’d first run
sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT fingerprint, type, name, certificate FROM certificates WHERE type = 2'
on both nodes and verify that the contents are identical, and that the output contains the certificates of both neptune and triton.
tomp
April 24, 2024, 3:33pm
13
Do you know what happened when this occurred?
I don’t really remember what step we through to fix the error. But, it’s first occur after we try to refresh lxd snap package. After that, the lxc command hang for a long time, then bring up the error.
markylaing:
sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT fingerprint, type, name, certificate FROM certificates WHERE type = 2'
I tried this command on both server
neptune
not display any output and triton
output an error : Error: in prepare, no such table: certificates (1)
This is very strange. no such table: certificates (1)
suggests that the schema version is not what is shipped with LXD 5.0.3. Can you please:
Tell us what happened when this occurred. Is it possible that LXD was updated to a newer version and then reverted?
Run sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT version FROM schema'
on triton and share the result.
Thanks.
I don’t really remember the detail of what happened. It is occured after we snap refresh lxd package. As far as I remember, I didn’t try to revert the version.
I run command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin 'SELECT version FROM schema'
on triton
Error: in prepare, no such table: schema (1)
I tried another command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin
and list table with .table
. It looks like it’s empty.
I tried to confirm with sudo ls -alh /var/snap/lxd/common/lxd/database/global/db.bin
output:
-rw-r--r-- 1 root root 0 May 6 09:17 /var/snap/lxd/common/lxd/database/global/db.bin
Can you please run sudo ls -alh /var/snap/lxd/common/global
on both cluster members?
Can I confirm that you are still able to interact with LXD on the neptune server? e.g. it is still fully operational? I have assumed this since you have posted the output of lxc cluster list
from this node at the beginning of this thread.
I don’t find /var/snap/lxd/common/global
directory, but if you mean /var/snap/lxd/common/lxd/database/global/
here it is :
Neptune :
total 63M
drwxr-x--- 2 root root 4.0K May 19 15:59 .
drwx------ 3 root root 4.0K May 19 16:26 ..
-rw------- 1 root root 359K May 18 15:58 0000000011097096-0000000011097183
-rw------- 1 root root 2.3M May 18 17:33 0000000011097184-0000000011097754
-rw------- 1 root root 1.4M May 18 18:28 0000000011097755-0000000011098084
-rw------- 1 root root 143K May 18 18:34 0000000011098085-0000000011098119
-rw------- 1 root root 4.1M May 18 21:24 0000000011098120-0000000011099143
-rw------- 1 root root 4.1M May 19 00:15 0000000011099144-0000000011100167
-rw------- 1 root root 4.1M May 19 03:06 0000000011100168-0000000011101191
-rw------- 1 root root 4.1M May 19 05:56 0000000011101192-0000000011102215
-rw------- 1 root root 2.1M May 19 07:22 0000000011102216-0000000011102727
-rw------- 1 root root 172K May 19 07:29 0000000011102728-0000000011102769
-rw------- 1 root root 1.9M May 19 08:47 0000000011102770-0000000011103239
-rw------- 1 root root 408K May 19 09:04 0000000011103240-0000000011103339
-rw------- 1 root root 3.7M May 19 11:37 0000000011103340-0000000011104263
-rw------- 1 root root 1.2M May 19 12:25 0000000011104264-0000000011104546
-rw------- 1 root root 3.0M May 19 14:28 0000000011104547-0000000011105287
-rw------- 1 root root 2.2M May 19 15:59 0000000011105288-0000000011105832
-rw------- 1 root root 2.1M May 6 09:17 db.bin
-rw------- 1 root root 32 Feb 9 2022 metadata1
-rw------- 1 root root 8.0M May 19 16:26 open-1
-rw------- 1 root root 8.0M May 19 15:59 open-2
-rw------- 1 root root 8.0M May 19 15:59 open-3
-rw------- 1 root root 587K May 19 11:37 snapshot-1-11104263-4257674434
-rw------- 1 root root 96 May 19 11:37 snapshot-1-11104263-4257674434.meta
-rw------- 1 root root 588K May 19 14:28 snapshot-1-11105287-4267905664
-rw------- 1 root root 96 May 19 14:28 snapshot-1-11105287-4267905664.meta
Triton :
total 39M
drwxr-x--- 2 root root 4.0K May 18 07:22 .
drwx------ 3 root root 4.0K Sep 4 2023 ..
-rw------- 1 root root 4.3M Sep 4 2023 0000000007490568-0000000007491591
-rw------- 1 root root 4.2M Sep 4 2023 0000000007491592-0000000007492615
-rw------- 1 root root 4.2M Sep 4 2023 0000000007492616-0000000007493639
-rw------- 1 root root 4.3M Sep 4 2023 0000000007493640-0000000007494663
-rw------- 1 root root 4.2M Sep 4 2023 0000000007494664-0000000007495687
-rw------- 1 root root 4.3M Sep 4 2023 0000000007495688-0000000007496711
-rw------- 1 root root 4.2M Sep 4 2023 0000000007496712-0000000007497735
-rw------- 1 root root 4.2M Sep 4 2023 0000000007497736-0000000007498759
-rw------- 1 root root 3.9M Sep 4 2023 0000000007498760-0000000007499633
-rw-r--r-- 1 root root 0 May 6 09:17 db.bin
-rw------- 1 root root 32 Feb 9 2022 metadata1
-rw------- 1 root root 629K Sep 4 2023 snapshot-1-7497735-14075030263
-rw------- 1 root root 96 Sep 4 2023 snapshot-1-7497735-14075030263.meta
-rw------- 1 root root 631K Sep 4 2023 snapshot-1-7498759-14080084005
-rw------- 1 root root 96 Sep 4 2023 snapshot-1-7498759-14080084005.meta
When I tried lxc cluster list
here is the output : Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers
.
I can confirm that LXD Daemon is active on both node by running sudo snap services
Service Startup Current Notes
lxd.activate enabled inactive -
lxd.daemon enabled active socket-activated
lxd.user-daemon enabled inactive socket-activated
You can attempt to recover the cluster by following the steps outlined here: https://documentation.ubuntu.com/lxd/en/latest/howto/cluster_recover/#recover-from-quorum-loss
However, there has been some data loss (missing certificate data) so this might not be possible. Let us know the results. Thanks.