Canntot issue ACME certificate with LXD 5.21.4

ykazakov · September 11, 2025, 3:04pm

I have fresh install of an LXD 5.21.4 cluster on physical hosts using microcloud with a preseed file. The clusluster uses an interal network:

lookup_subnet: 10.0.1.0/24

I am trying to obtain ACME sertificate using the acme.* configuration keys and the HAproxy with the example config provided in the documentation:

# lxc config show
config:
  acme.agree_tos: "true"
  acme.ca_url: https://acme-staging-v02.api.letsencrypt.org/directory
  acme.domain: node2.example.net
  acme.email: my@email.com
  cluster.https_address: 10.0.1.12:8443
  core.https_address: '[::]:8443'
  core.https_trusted_proxy: 10.0.1.12
  instances.migration.stateful: "true"
  network.ovn.northbound_connection: ssl:10.0.1.11:6641,ssl:10.0.1.13:6641,ssl:10.0.1.14:6641
  user.microcloud: 2.1.1

Here are my cluster nodes:

# lxc cluster list
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| NAME  |          URL           |      ROLES       | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| node1 | https://10.0.1.11:8443 | database-leader  | x86_64       | default        |             | ONLINE | Fully operational |
|       |                        | database         |              |                |             |        |                   |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| node2 | https://10.0.1.12:8443 | database-standby | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| node3 | https://10.0.1.13:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+
| node4 | https://10.0.1.14:8443 | database         | x86_64       | default        |             | ONLINE | Fully operational |
+-------+------------------------+------------------+--------------+----------------+-------------+--------+-------------------+

Upon trying to (re-)issue the sertificate (by changing acme keys) I obtain the error:

# lxc config unset acme.domain
# lxc config set acme.domain node2.example.net
Error: Failed to notify peer node1 at 10.0.1.11:8443: Put "https://10.0.1.11:8443/1.0": EOF

The LXD log shows the following messages right after updated configuration was sent to all cluster members (not copying because the config is too large):

location: node2
metadata:
  context:
    ip: 66.133.109.36:36851
    url: /.well-known/acme-challenge/HbtigqBnLPHAq_yTU7M3Lq6tl5PlLmUokxGtdUf7c1Q
  level: debug
  message: Allowing untrusted GET
timestamp: "2025-09-11T14:29:27.609006324Z"
type: logging


location: node2
metadata:
  context:
    url: https://10.0.1.11:8443
  level: debug
  message: Connecting to a remote LXD over HTTPS
timestamp: "2025-09-11T14:29:27.621640394Z"
type: logging


location: node2
metadata:
  context:
    ip: 13.229.94.34:38162
    url: /.well-known/acme-challenge/HbtigqBnLPHAq_yTU7M3Lq6tl5PlLmUokxGtdUf7c1Q
  level: debug
  message: Allowing untrusted GET
timestamp: "2025-09-11T14:29:28.844969483Z"
type: logging


location: node2
metadata:
  context:
    url: https://10.0.1.11:8443
  level: debug
  message: Connecting to a remote LXD over HTTPS
timestamp: "2025-09-11T14:29:28.857482946Z"
type: logging


location: node2
metadata:
  context:
    raftMembers: '[{{1 10.0.1.11:8443 voter} node1} {{2 10.0.1.13:8443 voter} node3}
      {{3 10.0.1.14:8443 voter} node4} {{4 10.0.1.12:8443 stand-by} node2}]'
  level: debug
  message: Replace current raft nodes
timestamp: "2025-09-11T14:29:33.016031003Z"
type: logging


location: node2
metadata:
  context:
    fingerprint: 03b182c2dda539d788883af041f3989b90fe3dc5ee033f6f73abf7b2da9efdea
    subject: CN=root@node1,O=LXD
  level: debug
  message: Matched trusted cert
timestamp: "2025-09-11T14:29:33.015816784Z"
type: logging


location: node2
metadata:
  context:
    local: 10.0.1.12:8443
    name: dqlite
    remote: 10.0.1.11:40906
  level: info
  message: Dqlite proxy stopped
timestamp: "2025-09-11T14:29:41.256171002Z"
type: logging


location: node2
metadata:
  context:
    listener: 1fbe1dd7-1e7c-4831-b204-8de975f1f8f5
    local: 10.0.1.12:8443
    remote: 10.0.1.11:40922
  level: debug
  message: Event listener server handler stopped
timestamp: "2025-09-11T14:29:41.256072234Z"
type: logging


location: node2
metadata:
  context:
    local: 10.0.1.12:48966
    name: raft
    remote: 10.0.1.11:8443
  level: info
  message: Dqlite proxy stopped
timestamp: "2025-09-11T14:29:41.256458036Z"
type: logging


location: node2
metadata:
  context:
    err: 'Failed to notify peer node1 at 10.0.1.11:8443: Put "https://10.0.1.11:8443/1.0":
      EOF'
  level: error
  message: Failed to notify other members about config change
timestamp: "2025-09-11T14:29:41.256484546Z"
type: logging


location: node2
metadata:
  context:
    err: 'Failed to notify peer node1 at 10.0.1.11:8443: Put "https://10.0.1.11:8443/1.0":
      EOF'
  level: error
  message: Failed to notify other members about config change
timestamp: "2025-09-11T14:29:41.256484546Z"
type: logging


location: node2
metadata:
  context: {}
  level: debug
  message: 'Dqlite: EOF detected: call exec-sql (budget 9.999970569s): receive: header:
    EOF'
timestamp: "2025-09-11T14:29:41.256742606Z"
type: logging


location: node2
metadata:
  context: {}
  level: debug
  message: 'Dqlite: network connection lost: write tcp 10.0.1.12:48958->10.0.1.11:8443:
    write: broken pipe'
timestamp: "2025-09-11T14:29:41.257002785Z"
type: logging


location: node2
metadata:
  context: {}
  level: debug
  message: 'Dqlite: network connection lost: write tcp 10.0.1.12:48958->10.0.1.11:8443:
    write: broken pipe'
timestamp: "2025-09-11T14:29:41.257156986Z"
type: logging


location: node2
metadata:
  context: {}
  level: debug
  message: 'Dqlite: network connection lost: write tcp 10.0.1.12:48958->10.0.1.11:8443:
    write: broken pipe'
timestamp: "2025-09-11T14:29:41.257275028Z"
type: logging

Followed by many similar “broken pipe” messages.

Repeated attemts to re-issue a certificate show similar errors but for different nodes.

With LXD 5.21.3 (same fresh setup) the same process finishes successfully. Here is the relevant part of LXD log for comparison:

location: node2
metadata:
  context:
    ip: 34.208.39.181:60360
    url: /.well-known/acme-challenge/_KQ9NCVOIqpQTPbitr5lbj1pJ389oR3yhAzqxcDMRxw
  level: debug
  message: Allowing untrusted GET
timestamp: "2025-09-11T13:53:28.561107051Z"
type: logging


location: node2
metadata:
  context:
    url: https://10.0.1.11:8443
  level: debug
  message: Connecting to a remote LXD over HTTPS
timestamp: "2025-09-11T13:53:28.573394067Z"
type: logging


location: node2
metadata:
  context:
    fingerprint: 306bea6f374d7803d10abaaa7a33f75d4f644241166f22aa777e086adeebbb24
    subject: CN=root@node1,O=LXD
  level: debug
  message: Matched trusted cert
timestamp: "2025-09-11T13:53:32.046090206Z"
type: logging


location: node2
metadata:
  context:
    fingerprint: 306bea6f374d7803d10abaaa7a33f75d4f644241166f22aa777e086adeebbb24
    ip: 10.0.1.11:55870
    method: PUT
    protocol: cluster
    url: /1.0/cluster/certificate
  level: debug
  message: Handling API request
timestamp: "2025-09-11T13:53:32.046171888Z"
type: logging


location: node2
metadata:
  context: {}
  level: info
  message: 'http: TLS handshake error from 10.0.1.11:34562: remote error: tls: bad
    certificate'
timestamp: "2025-09-11T13:53:34.178952287Z"
type: logging


location: node1
metadata:
  context:
    err: 'Failed to send heartbeat request: Put "https://10.0.1.12:8443/internal/database":
      tls: failed to verify certificate: x509: certificate signed by unknown authority'
    remote: 10.0.1.12:8443
  level: warning
  message: Failed heartbeat
timestamp: "2025-09-11T13:53:34.176864386Z"
type: logging


location: node1
metadata:
  context:
    err: 'Failed to send heartbeat request: Put "https://10.0.1.13:8443/internal/database":
      tls: failed to verify certificate: x509: certificate signed by unknown authority'
    remote: 10.0.1.13:8443
  level: warning
  message: Failed heartbeat
timestamp: "2025-09-11T13:53:34.911198392Z"
type: logging

I guess the errors “tls: failed to verify certificate” are due to the use of the staging let’s encrypt certificate (I already exhausted the rate limit for my domain while trying to debug the problem).

Also, is there a way to force re-issuing ACME certificate without changing the acme.* keys? I noticed that after a while changing the keys does no longer trigger the certificate renewal.

sdeziel1 · September 12, 2025, 10:56pm

With LXD 5.21.3 (same fresh setup) the same process finishes successfully.

Having the accompanying LXD logs would probably help understand this error you get with 5.21.4:

# lxc config unset acme.domain
# lxc config set acme.domain node2.example.net
Error: Failed to notify peer node1 at 10.0.1.11:8443: Put "https://10.0.1.11:8443/1.0": EOF

I noticed that after a while changing the keys does no longer trigger the certificate renewal.

Indeed, the code hardcodes a limit of 5/reqs per hour. It’s probably referring to this.

sdeziel1 · September 12, 2025, 11:00pm

Are you positive that the same HAProxy config worked well in conjunction with 5.21.3 and that’s when switching to 5.21.4 that things stop working?

ykazakov · September 13, 2025, 12:47pm

I have already provided lxd logs (the output of lxc monitor --type=logging ). Which other logs would you like me to include?

Thanks for pointing to the source code! Looks like variable retries is used in a loop for limiting the number of connection attempts (with the pause of 10 seconds). It does not seem that it limits the number of retries per any particular time period.

Yes. I made a fresh install with both 5.21.3 and 5.21.4 several times (without any containers or further changes, straight after the micrlocloud deploy) and it always worked for 5.21.3 and always failed with 5.21.4.

I think the certificate update also failed with 6.5, but I am not 100% certain.