LXD cluster : invalid client certificate

I have LXD cluster installed and its behave strangely today. When I check cluster list, it is appear that one of its member is offline.

+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
|  NAME   |            URL            |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |  STATE  |                                     MESSAGE                                      |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| neptune | https://172.30.68.16:8443 | database-leader  | x86_64       | default        |             | ONLINE  | Fully operational                                                                |
|         |                           | database         |              |                |             |         |                                                                                  |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+
| triton  | https://172.30.68.18:8443 | database-standby | x86_64       | default        |             | OFFLINE | No heartbeat for 15h12m54.799128025s (2023-09-04 15:48:03.216504734 +0700 +0700) |
+---------+---------------------------+------------------+--------------+----------------+-------------+---------+----------------------------------------------------------------------------------+

I check the LXD services in the offline node and it is running. I tried to check the debug message on the offline node with lxd --debug --group lxd and find invalid certificate error.

ERROR [2023-09-05T07:13:05+07:00] Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:50874

I cannot run any lxc command on the offline node, whether its lxd service is run. It is hanging for long time and then exited with error Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers

Can you run the following on the offline member to check if the certificate is still valid?

openssl x509 -noout -text -in /var/snap/lxd/common/lxd/server.crt

What do you see in /var/snap/lxd/common/lxd/logs/lxd.log

Run this command on both servers :

On ‘neptune’ :

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            c6:e4:aa:fd:16:c5:ed:ee:44:09:69:d1:ca:65:f9:2d
        Signature Algorithm: ecdsa-with-SHA384
        Issuer: O = linuxcontainers.org, CN = root@neptune
        Validity
            Not Before: Feb  9 07:56:58 2022 GMT
            Not After : Feb  7 07:56:58 2032 GMT
        Subject: O = linuxcontainers.org, CN = root@neptune
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (384 bit)
                pub:
                    04:1a:f6:4b:97:a8:9d:e4:81:9b:bb:f7:c6:7d:b8:
                    9b:eb:34:00:6e:d0:dc:bf:1d:2a:a1:30:0b:bb:77:
                    3e:22:ec:54:9f:7c:b3:59:39:8d:22:59:73:fe:f2:
                    7a:e7:4f:20:9e:f3:62:a6:a3:e3:e3:1f:ca:12:bb:
                    d4:0f:b8:13:26:a5:80:18:6d:4b:b9:45:dd:44:32:
                    b2:16:7e:a5:35:d4:4e:bf:2d:2c:12:20:f9:3f:46:
                    44:cf:1b:55:49:73:f9
                ASN1 OID: secp384r1
                NIST CURVE: P-384
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Alternative Name: 
                DNS:neptune, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1
    Signature Algorithm: ecdsa-with-SHA384
    Signature Value:
        30:65:02:31:00:f5:65:f7:4a:52:82:6a:a3:39:ba:a7:17:d6:
        28:41:bb:5f:ac:b5:f4:2b:ab:3c:1b:40:48:c0:be:57:57:63:
        2c:80:7f:3c:e9:87:35:d0:68:04:fb:c5:c8:f3:19:5a:af:02:
        30:3b:0a:fd:3f:b4:0e:e3:55:67:a4:13:a5:aa:1c:db:59:d8:
        21:c1:6f:86:40:54:9b:79:d6:7b:af:ea:ad:e7:e3:a5:c0:50:
        4e:14:f1:74:b2:5e:bf:7a:47:bd:15:e3:18

On ‘triton’ :

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            2e:d6:e0:9b:e8:16:6d:cd:8a:f2:88:c3:9a:8a:4a:e3
        Signature Algorithm: ecdsa-with-SHA384
        Issuer: O = linuxcontainers.org, CN = root@triton
        Validity
            Not Before: Feb  9 08:01:25 2022 GMT
            Not After : Feb  7 08:01:25 2032 GMT
        Subject: O = linuxcontainers.org, CN = root@triton
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (384 bit)
                pub:
                    04:97:77:51:f5:8c:fd:6e:bd:f3:b6:ce:11:ce:37:
                    16:a7:19:88:33:72:d6:d4:a0:59:3c:57:2d:e5:5c:
                    78:c1:37:07:0c:94:a8:39:10:9b:cf:d7:c0:d8:c0:
                    fd:06:41:95:7a:d5:23:d2:03:84:27:81:6e:87:1c:
                    5a:78:f6:b7:09:f9:9c:69:fc:5c:fb:f1:9a:13:fc:
                    e6:d0:19:d1:5d:39:81:57:a6:45:b5:1a:05:30:70:
                    9d:4c:d3:12:cd:48:1a
                ASN1 OID: secp384r1
                NIST CURVE: P-384
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Alternative Name: 
                DNS:triton, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1
    Signature Algorithm: ecdsa-with-SHA384
    Signature Value:
        30:65:02:31:00:b1:9a:19:07:ba:76:c3:c0:5b:6a:5e:be:b8:
        8e:58:30:52:4d:32:04:26:46:da:ac:cf:38:5f:fc:8e:3e:53:
        18:93:f1:7c:f4:53:d6:7b:2d:3b:2a:2a:ff:3e:7d:ce:d4:02:
        30:44:49:00:4b:3e:a0:72:09:a0:52:d1:60:a4:8b:1a:8c:4b:
        d6:3a:c4:a2:f1:02:22:9c:55:38:e1:3f:38:66:60:10:ba:d3:
        cf:23:23:59:49:14:64:45:9c:06:8f:89:bb

I see that both of server.crt are still valid.

lxd.log in ‘neptune’ :

time="2024-03-31T10:06:49+07:00" level=warning msg="Failed heartbeat" err="Heartbeat request failed with status: 403 Forbidden" remote="172.30.68.18:8443"
time="2024-03-31T10:06:49+07:00" level=warning msg="Failed to create warning" err="Local member name not available"
time="2024-03-31T10:06:50+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55518"
time="2024-03-31T10:06:51+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55534"
time="2024-03-31T10:06:52+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55536"
time="2024-03-31T10:06:53+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55540"
time="2024-03-31T10:06:54+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55556"
time="2024-03-31T10:06:55+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:55564"
time="2024-03-31T10:06:56+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:42072"
time="2024-03-31T10:06:57+07:00" level=error msg="Invalid client certificate CN=root@triton,O=linuxcontainers.org (ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219) from 172.30.68.18:42088"

lxd.log in ‘triton’ :

time="2024-03-31T10:09:13+07:00" level=warning msg="Dqlite: attempt 8: server 172.30.68.16:8443: dial: Dialing failed: expected status code 101 got 403"
time="2024-03-31T10:09:13+07:00" level=warning msg="Dqlite: attempt 8: server 172.30.68.18:8443: no known leader"
time="2024-03-31T10:09:14+07:00" level=error msg="Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:43730"
time="2024-03-31T10:09:14+07:00" level=error msg="Invalid client certificate CN=root@neptune,O=linuxcontainers.org (a8190c3d03e0239a0ef02933996922c80f87bbee825a16d2c41fff3c458436ba) from 172.30.68.16:43746"

@muhfiasbin which version of LXD are you running?

From your neptune server, can you please verify that lxc config trust list contains the certificate for your triton server and that it is of type server? Thanks.

It is hang for long time and then timeout with error:

Error: Get "http://unix.socket/1.0": net/http: timeout awaiting response headers

Can you please post the version of LXD that you are running on both servers.

Additionally, can you please run lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates' on both cluster members and verify that the contents are identical, and that the public keys in the certificate column match with the public key that can be found at /var/snap/lxd/common/lxd/server.crt.

If the lxd sql local command does not work, run this command instead: sqlite3 /var/snap/lxd/common/lxd/database/local.db 'SELECT fingerprint, type, name, certificate FROM certificates'.

If you aren’t running the snapped version of LXD, substitute /var/snap/lxd/common/lxd with /var/lib/lxd or whatever the value of the LXD_DIR environment variable is on your system.

Please provide output of snap list on each server.

This is the result from neptune :

+-------------+------+------+-------------+
| fingerprint | type | name | certificate |
+-------------+------+------+-------------+
+-------------+------+------+-------------+

and this is from triton :

+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
|                           fingerprint                            | type |  name  |                           certificate                            |
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+
| ea8325a48771d0c9ea4420b63c17b074c772ebf092b2bdbfd73d8466a375b219 | 2    | triton | -----BEGIN CERTIFICATE-----                                      |
|                                                                  |      |        | MIICAzCCAYmgAwIBAgIQLtbgm+gWbc2K8ojDmopK4zAKBggqhkjOPQQDAzA0MRww |
|                                                                  |      |        | GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRQwEgYDVQQDDAtyb290QHRyaXRv |
|                                                                  |      |        | bjAeFw0yMjAyMDkwODAxMjVaFw0zMjAyMDcwODAxMjVaMDQxHDAaBgNVBAoTE2xp |
|                                                                  |      |        | bnV4Y29udGFpbmVycy5vcmcxFDASBgNVBAMMC3Jvb3RAdHJpdG9uMHYwEAYHKoZI |
|                                                                  |      |        | zj0CAQYFK4EEACIDYgAEl3dR9Yz9br3zts4RzjcWpxmIM3LW1KBZPFct5Vx4wTcH |
|                                                                  |      |        | DJSoORCbz9fA2MD9BkGVetUj0gOEJ4FuhxxaePa3Cfmcafxc+/GaE/zm0BnRXTmB |
|                                                                  |      |        | V6ZFtRoFMHCdTNMSzUgao2AwXjAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYI |
|                                                                  |      |        | KwYBBQUHAwEwDAYDVR0TAQH/BAIwADApBgNVHREEIjAgggZ0cml0b26HBH8AAAGH |
|                                                                  |      |        | EAAAAAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIxALGaGQe6dsPAW2pe |
|                                                                  |      |        | vriOWDBSTTIEJkbarM84X/yOPlMYk/F89FPWey07Kir/Pn3O1AIwREkASz6gcgmg |
|                                                                  |      |        | UtFgpIsajEvWOsSi8QIinFU44T84ZmAQutPPIyNZSRRkRZwGj4m7             |
|                                                                  |      |        | -----END CERTIFICATE-----                                        |
|                                                                  |      |        |                                                                  |
+------------------------------------------------------------------+------+--------+------------------------------------------------------------------+

and it is identical with the /var/snap/lxd/common/lxd/server.crt/

The result from neptune :

Name    Version        Rev    Tracking       Publisher   Notes
core20  20240227       2264   latest/stable  canonical✓  base
lxd     5.0.3-babaaf8  27948  5.0/stable     canonical✓  in-cohort
snapd   2.62           21465  latest/stable  canonical✓  snapd

and from triton :

Name    Version        Rev    Tracking       Publisher   Notes
core20  20240227       2264   latest/stable  canonical✓  base
lxd     5.0.3-babaaf8  27948  5.0/stable     canonical✓  in-cohort
snapd   2.61.2         21184  latest/stable  canonical✓  snapd

Your output of lxd sql local 'SELECT fingerprint, type, name, certificate FROM certificates' shows the problem. Both cluster members should return the same results here. neptune should have it’s own certificate, as well as a certificate for triton, and triton is missing the certificate for neptune.

On start up, neptune cannot get a secure connection to the DQLite cluster (requests are made over TLS via an internal endpoint). triton can connect to the “clustered” DQLite, but can’t connect to neptune.

Regarding a fix, patching the local database may work, however, the local certificate entries will be overwritten by the cluster database certificate entries as soon as the DQLite heartbeat succeeds. So if the contents of the cluster database are incorrect, we’ll be back to square one.

Since the cluster is currently not functional, I would recommend tearing it down and creating a new one. If this cluster must be fixed, I’d first run

sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT fingerprint, type, name, certificate FROM certificates WHERE type = 2'

on both nodes and verify that the contents are identical, and that the output contains the certificates of both neptune and triton.

Do you know what happened when this occurred?

I don’t really remember what step we through to fix the error. But, it’s first occur after we try to refresh lxd snap package. After that, the lxc command hang for a long time, then bring up the error.

I tried this command on both server

neptune not display any output and triton output an error : Error: in prepare, no such table: certificates (1)

This is very strange. no such table: certificates (1) suggests that the schema version is not what is shipped with LXD 5.0.3. Can you please:

  1. Tell us what happened when this occurred. Is it possible that LXD was updated to a newer version and then reverted?
  2. Run sudo sqlite3 /var/lib/lxd/database/global/db.bin 'SELECT version FROM schema' on triton and share the result.

Thanks.

I don’t really remember the detail of what happened. It is occured after we snap refresh lxd package. As far as I remember, I didn’t try to revert the version.

I run command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin 'SELECT version FROM schema' on triton

Error: in prepare, no such table: schema (1)

I tried another command sudo sqlite3 /var/snap/lxd/common/lxd/database/global/db.bin and list table with .table. It looks like it’s empty.

I tried to confirm with sudo ls -alh /var/snap/lxd/common/lxd/database/global/db.bin
output:
-rw-r--r-- 1 root root 0 May 6 09:17 /var/snap/lxd/common/lxd/database/global/db.bin

Can you please run sudo ls -alh /var/snap/lxd/common/global on both cluster members?

Can I confirm that you are still able to interact with LXD on the neptune server? e.g. it is still fully operational? I have assumed this since you have posted the output of lxc cluster list from this node at the beginning of this thread.