LXD cluster : invalid client certificate

I have LXD cluster installed and its behave strangely today. When I check cluster list, it is appear that one of its member is offline.

+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
|  NAME   |            URL            |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                     MESSAGE                                      |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| neptune | https://172.30.68.16:8443 | database-leader  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
|         |                           | database         |              |                |             |         |                                                                                  |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| triton  | https://172.30.68.18:8443 | database-standby | x86_64       | default        |             | OFFLINE | No heartbeat for 15h12m54.799128025s (2023-09-04 15:48:03.216504734 +0700 +0700) |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+

I check the LXD services in the offline node and it is running. I tried to check the debug message on the offline node with lxd --debug --group lxd and find invalid certificate error.

ERROR [2023-09-05T07:13:05+07:00] Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:50874

I cannot run any lxc command on the offline node, whether its lxd service is run. It is hanging for long time and then exited with error Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers

Can you run the following on the offline member to check if the certificate is still valid?

openssl x509 -noout -text -in /var/snap/lxd/common/lxd/server.crt

What do you see in /var/snap/lxd/common/lxd/logs/lxd.log

Run this command on both servers :

On ‘neptune’ :

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            c6:e4:aa:fd:16:c5:ed:ee:44:09:69:d1:ca:65:f9:2d
        Signature Algorithm: ecdsa-with-SHA384
        Issuer: O = linuxcontainers.org, CN = root@neptune
        Validity
            Not Before: Feb  9 07:56:58 2022 GMT
            Not After : Feb  7 07:56:58 2032 GMT
        Subject: O = linuxcontainers.org, CN = root@neptune
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (384 bit)
                pub:
                    04:1a:f6:4b:97:a8:9d:e4:81:9b:bb:f7:c6:7d:b8:
                    9b:eb:34:00:6e:d0:dc:bf:1d:2a:a1:30:0b:bb:77:
                    3e:22:ec:54:9f:7c:b3:59:39:8d:22:59:73:fe:f2:
                    7a:e7:4f:20:9e:f3:62:a6:a3:e3:e3:1f:ca:12:bb:
                    d4:0f:b8:13:26:a5:80:18:6d:4b:b9:45:dd:44:32:
                    b2:16:7e:a5:35:d4:4e:bf:2d:2c:12:20:f9:3f:46:
                    44:cf:1b:55:49:73:f9
                ASN1 OID: secp384r1
                NIST CURVE: P-384
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Alternative Name: 
                DNS:neptune, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1
    Signature Algorithm: ecdsa-with-SHA384
    Signature Value:
        30:65:02:31:00:f5:65:f7:4a:52:82:6a:a3:39:ba:a7:17:d6:
        28:41:bb:5f:ac:b5:f4:2b:ab:3c:1b:40:48:c0:be:57:57:63:
        2c:80:7f:3c:e9:87:35:d0:68:04:fb:c5:c8:f3:19:5a:af:02:
        30:3b:0a:fd:3f:b4:0e:e3:55:67:a4:13:a5:aa:1c:db:59:d8:
        21:c1:6f:86:40:54:9b:79:d6:7b:af:ea:ad:e7:e3:a5:c0:50:
        4e:14:f1:74:b2:5e:bf:7a:47:bd:15:e3:18

On ‘triton’ :

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            2e:d6:e0:9b:e8:16:6d:cd:8a:f2:88:c3:9a:8a:4a:e3
        Signature Algorithm: ecdsa-with-SHA384
        Issuer: O = linuxcontainers.org, CN = root@triton
        Validity
            Not Before: Feb  9 08:01:25 2022 GMT
            Not After : Feb  7 08:01:25 2032 GMT
        Subject: O = linuxcontainers.org, CN = root@triton
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (384 bit)
                pub:
                    04:97:77:51:f5:8c:fd:6e:bd:f3:b6:ce:11:ce:37:
                    16:a7:19:88:33:72:d6:d4:a0:59:3c:57:2d:e5:5c:
                    78:c1:37:07:0c:94:a8:39:10:9b:cf:d7:c0:d8:c0:
                    fd:06:41:95:7a:d5:23:d2:03:84:27:81:6e:87:1c:
                    5a:78:f6:b7:09:f9:9c:69:fc:5c:fb:f1:9a:13:fc:
                    e6:d0:19:d1:5d:39:81:57:a6:45:b5:1a:05:30:70:
                    9d:4c:d3:12:cd:48:1a
                ASN1 OID: secp384r1
                NIST CURVE: P-384
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Alternative Name: 
                DNS:triton, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1
    Signature Algorithm: ecdsa-with-SHA384
    Signature Value:
        30:65:02:31:00:b1:9a:19:07:ba:76:c3:c0:5b:6a:5e:be:b8:
        8e:58:30:52:4d:32:04:26:46:da:ac:cf:38:5f:fc:8e:3e:53:
        18:93:f1:7c:f4:53:d6:7b:2d:3b:2a:2a:ff:3e:7d:ce:d4:02:
        30:44:49:00:4b:3e:a0:72:09:a0:52:d1:60:a4:8b:1a:8c:4b:
        d6:3a:c4:a2:f1:02:22:9c:55:38:e1:3f:38:66:60:10:ba:d3:
        cf:23:23:59:49:14:64:45:9c:06:8f:89:bb

I see that both of server.crt are still valid.

lxd.log in ‘neptune’ :

time="2024-03-31T10:06:49+07:00" level=warning msg="Failed heartbeat" err="Heartbeat request failed with status: 403 Forbidden" remote="172.30.68.18:8443"
time="2024-03-31T10:06:49+07:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-03-31T10:06:50+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55518"
time="2024-03-31T10:06:51+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55534"
time="2024-03-31T10:06:52+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55536"
time="2024-03-31T10:06:53+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55540"
time="2024-03-31T10:06:54+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55556"
time="2024-03-31T10:06:55+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55564"
time="2024-03-31T10:06:56+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:42072"
time="2024-03-31T10:06:57+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:42088"

lxd.log in ‘triton’ :

time="2024-03-31T10:09:13+07:00" level=warning msg="Dqlite: attempt 8: server 172.30.68.16:8443: dial: Dialing failed: expected status code 101 got 403"
time="2024-03-31T10:09:13+07:00" level=warning msg="Dqlite: attempt 8: server 172.30.68.18:8443: no known leader"
time="2024-03-31T10:09:14+07:00" level=error msg="Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:43730"
time="2024-03-31T10:09:14+07:00" level=error msg="Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:43746"

@muhfiasbin which version of LXD are you running?

From your neptune server, can you please verify that lxc config trust list contains the certificate for your triton server and that it is of type server? Thanks.

It is hang for long time and then timeout with error:

Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers

Can you please post the version of LXD that you are running on both servers.

Additionally, can you please run lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates' on both cluster members and verify that the contents are identical, and that the public keys in the certificate column match with the public key that can be found at /var/snap/lxd/common/lxd/server.crt.

If the lxd sql local command does not work, run this command instead: sqlite3 /var/snap/lxd/common/lxd/database/local.db 'SELECT fingerprint, type, name, certificate FROM certificates'.

If you aren’t running the snapped version of LXD, substitute /var/snap/lxd/common/lxd with /var/lib/lxd or whatever the value of the LXD_DIR environment variable is on your system.

Please provide output of snap list on each server.

This is the result from neptune :

+-------------+------+------+-------------+
| fingerprint | type | name | certificate |
+-------------+------+------+-------------+
+-------------+------+------+-------------+

and this is from triton :

+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
|                           fingerprint                            | type |  name  |                           certificate                            |
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
| ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219 | 2    | triton | -----BEGIN CERTIFICATE-----                                      |
|                                                                  |      |        | MIICAzCCAYmgAwIBAgIQLtbgm+gWbc2K8ojDmopK4zAKBggqhkjOPQQDAzA0MRww |
|                                                                  |      |        | GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRQwEgYDVQQDDAtyb290QHRyaXRv |
|                                                                  |      |        | bjAeFw0yMjAyMDkwODAxMjVaFw0zMjAyMDcwODAxMjVaMDQxHDAaBgNVBAoTE2xp |
|                                                                  |      |        | bnV4Y29udGFpbmVycy5vcmcxFDASBgNVBAMMC3Jvb3RAdHJpdG9uMHYwEAYHKoZI |
|                                                                  |      |        | zj0CAQYFK4EEACIDYgAEl3dR9Yz9br3zts4RzjcWpxmIM3LW1KBZPFct5Vx4wTcH |
|                                                                  |      |        | DJSoORCbz9fA2MD9BkGVetUj0gOEJ4FuhxxaePa3Cfmcafxc+/GaE/zm0BnRXTmB |
|                                                                  |      |        | V6ZFtRoFMHCdTNMSzUgao2AwXjAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYI |
|                                                                  |      |        | KwYBBQUHAwEwDAYDVR0TAQH/BAIwADApBgNVHREEIjAgggZ0cml0b26HBH8AAAGH |
|                                                                  |      |        | EAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIxALGaGQe6dsPAW2pe |
|                                                                  |      |        | vriOWDBSTTIEJkbarM84X/yOPlMYk/F89FPWey07Kir/Pn3O1AIwREkASz6gcgmg |
|                                                                  |      |        | UtFgpIsajEvWOsSi8QIinFU44T84ZmAQutPPIyNZSRRkRZwGj4m7             |
|                                                                  |      |        | -----END CERTIFICATE-----                                        |
|                                                                  |      |        |                                                                  |
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+

and it is identical with the /var/snap/lxd/common/lxd/server.crt/

The result from neptune :

Name    Version        Rev    Tracking       Publisher   Notes
core20  20240227       2264   latest/stable  canonical✓  base
lxd     5.0.3-babaaf8  27948  5.0/stable     canonical✓  in-cohort
snapd   2.62           21465  latest/stable  canonical✓  snapd

and from triton :

Name    Version        Rev    Tracking       Publisher   Notes
core20  20240227       2264   latest/stable  canonical✓  base
lxd     5.0.3-babaaf8  27948  5.0/stable     canonical✓  in-cohort
snapd   2.61.2         21184  latest/stable  canonical✓  snapd

Your output of lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates' shows the problem. Both cluster members should return the same results here. neptune should have it’s own certificate, as well as a certificate for triton, and triton is missing the certificate for neptune.

On start up, neptune cannot get a secure connection to the DQLite cluster (requests are made over TLS via an internal endpoint). triton can connect to the “clustered” DQLite, but can’t connect to neptune.

Regarding a fix, patching the local database may work, however, the local certificate entries will be overwritten by the cluster database certificate entries as soon as the DQLite heartbeat succeeds. So if the contents of the cluster database are incorrect, we’ll be back to square one.

Since the cluster is currently not functional, I would recommend tearing it down and creating a new one. If this cluster must be fixed, I’d first run

sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT fingerprint, type, name, certificate FROM certificates WHERE type = 2'

on both nodes and verify that the contents are identical, and that the output contains the certificates of both neptune and triton.

Do you know what happened when this occurred?

I don’t really remember what step we through to fix the error. But, it’s first occur after we try to refresh lxd snap package. After that, the lxc command hang for a long time, then bring up the error.

I tried this command on both server

neptune not display any output and triton output an error : Error: in prepare, no such table: certificates (1)

This is very strange. no such table: certificates (1) suggests that the schema version is not what is shipped with LXD 5.0.3. Can you please:

  1. Tell us what happened when this occurred. Is it possible that LXD was updated to a newer version and then reverted?
  2. Run sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT version FROM schema' on triton and share the result.

Thanks.

I don’t really remember the detail of what happened. It is occured after we snap refresh lxd package. As far as I remember, I didn’t try to revert the version.

I run command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin 'SELECT version FROM schema' on triton

Error: in prepare, no such table: schema (1)

I tried another command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin and list table with .table. It looks like it’s empty.

I tried to confirm with sudo ls -alh /var/snap/lxd/common/lxd/database/global/db.bin
output:
-rw-r--r-- 1 root root 0 May 6 09:17 /var/snap/lxd/common/lxd/database/global/db.bin

Can you please run sudo ls -alh /var/snap/lxd/common/global on both cluster members?

Can I confirm that you are still able to interact with LXD on the neptune server? e.g. it is still fully operational? I have assumed this since you have posted the output of lxc cluster list from this node at the beginning of this thread.

I don’t find /var/snap/lxd/common/global directory, but if you mean /var/snap/lxd/common/lxd/database/global/ here it is :

Neptune :

total 63M
drwxr-x--- 2 root root 4.0K May 19 15:59 .
drwx------ 3 root root 4.0K May 19 16:26 ..
-rw------- 1 root root 359K May 18 15:58 0000000011097096-0000000011097183
-rw------- 1 root root 2.3M May 18 17:33 0000000011097184-0000000011097754
-rw------- 1 root root 1.4M May 18 18:28 0000000011097755-0000000011098084
-rw------- 1 root root 143K May 18 18:34 0000000011098085-0000000011098119
-rw------- 1 root root 4.1M May 18 21:24 0000000011098120-0000000011099143
-rw------- 1 root root 4.1M May 19 00:15 0000000011099144-0000000011100167
-rw------- 1 root root 4.1M May 19 03:06 0000000011100168-0000000011101191
-rw------- 1 root root 4.1M May 19 05:56 0000000011101192-0000000011102215
-rw------- 1 root root 2.1M May 19 07:22 0000000011102216-0000000011102727
-rw------- 1 root root 172K May 19 07:29 0000000011102728-0000000011102769
-rw------- 1 root root 1.9M May 19 08:47 0000000011102770-0000000011103239
-rw------- 1 root root 408K May 19 09:04 0000000011103240-0000000011103339
-rw------- 1 root root 3.7M May 19 11:37 0000000011103340-0000000011104263
-rw------- 1 root root 1.2M May 19 12:25 0000000011104264-0000000011104546
-rw------- 1 root root 3.0M May 19 14:28 0000000011104547-0000000011105287
-rw------- 1 root root 2.2M May 19 15:59 0000000011105288-0000000011105832
-rw------- 1 root root 2.1M May  6 09:17 db.bin
-rw------- 1 root root   32 Feb  9  2022 metadata1
-rw------- 1 root root 8.0M May 19 16:26 open-1
-rw------- 1 root root 8.0M May 19 15:59 open-2
-rw------- 1 root root 8.0M May 19 15:59 open-3
-rw------- 1 root root 587K May 19 11:37 snapshot-1-11104263-4257674434
-rw------- 1 root root   96 May 19 11:37 snapshot-1-11104263-4257674434.meta
-rw------- 1 root root 588K May 19 14:28 snapshot-1-11105287-4267905664
-rw------- 1 root root   96 May 19 14:28 snapshot-1-11105287-4267905664.meta

Triton :

total 39M
drwxr-x--- 2 root root 4.0K May 18 07:22 .
drwx------ 3 root root 4.0K Sep  4  2023 ..
-rw------- 1 root root 4.3M Sep  4  2023 0000000007490568-0000000007491591
-rw------- 1 root root 4.2M Sep  4  2023 0000000007491592-0000000007492615
-rw------- 1 root root 4.2M Sep  4  2023 0000000007492616-0000000007493639
-rw------- 1 root root 4.3M Sep  4  2023 0000000007493640-0000000007494663
-rw------- 1 root root 4.2M Sep  4  2023 0000000007494664-0000000007495687
-rw------- 1 root root 4.3M Sep  4  2023 0000000007495688-0000000007496711
-rw------- 1 root root 4.2M Sep  4  2023 0000000007496712-0000000007497735
-rw------- 1 root root 4.2M Sep  4  2023 0000000007497736-0000000007498759
-rw------- 1 root root 3.9M Sep  4  2023 0000000007498760-0000000007499633
-rw-r--r-- 1 root root    0 May  6 09:17 db.bin
-rw------- 1 root root   32 Feb  9  2022 metadata1
-rw------- 1 root root 629K Sep  4  2023 snapshot-1-7497735-14075030263
-rw------- 1 root root   96 Sep  4  2023 snapshot-1-7497735-14075030263.meta
-rw------- 1 root root 631K Sep  4  2023 snapshot-1-7498759-14080084005
-rw------- 1 root root   96 Sep  4  2023 snapshot-1-7498759-14080084005.meta

When I tried lxc cluster list here is the output : Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers.

I can confirm that LXD Daemon is active on both node by running sudo snap services

Service          Startup  Current   Notes
lxd.activate     enabled  inactive  -
lxd.daemon       enabled  active    socket-activated
lxd.user-daemon  enabled  inactive  socket-activated

You can attempt to recover the cluster by following the steps outlined here: https://documentation.ubuntu.com/lxd/en/latest/howto/cluster_recover/#recover-from-quorum-loss

However, there has been some data loss (missing certificate data) so this might not be possible. Let us know the results. Thanks.