6.5 cluster won't start after power failure

Hey,

4 node cluster 6/stable. Experienced a power failure last night and one of the nodes now has a broken NIC (sapphire).

lxc cluster ls
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
|   NAME   |          URL           |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |   STATE   |            MESSAGE             |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| diamond  | https://10.0.0.54:8443 | database-leader | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
|          |                        | database        |              |                |             |           |                                |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| emerald  | https://10.0.0.52:8443 | database        | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| ruby     | https://10.0.0.53:8443 | database        | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| sapphire | https://10.0.0.51:8443 |                 | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+

I think they auto evacuated because of the cluster.healing_threshold I set to 15 the other day. Wish I hadn’t. I’m unable to restore any the members. Restores of any member fail with the following error:

root@diamond:/home/vos# lxc cluster restore diamond
Are you sure you want to restore cluster member "diamond"? (yes/no) [default=no]: yes
Error: Failed restoring network: Failed adding OVS chassis "diamond" with priority 11300 to chassis group "lxd-net7": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-add-chassis lxd-net7 diamond 11300: signal: alarm clock (2025-10-10T10:46:32Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))

It seems OVN is the reason (microovn 24.03/stable)

root@emerald:/home/vos# microovn status

MicroOVN deployment summary:
- diamond (10.0.0.54)
  Services: chassis, switch
- emerald (10.0.0.52)
  Services: central, chassis, switch
- ruby (10.0.0.53)
  Services: central, chassis, switch
- sapphire (10.0.0.51)
  Services: central, chassis, switch
OVN Database summary:

OVN Northbound: Upgrade or attention required!
Currently active schema: 7.3.0
Cluster report (expected schema versions):
        emerald: 7.3.0
        diamond: 7.3.0
        ruby: 7.3.0
        sapphire: Error. Failed to contact member

Error creating OVN Southbound Database summary: failed to get OVN Southbound active schema version
OVN Southbound: Upgrade or attention required!
Currently active schema:
Cluster report (expected schema versions):
        emerald: 20.33.0
        ruby: 20.33.0
        diamond: 20.33.0
        sapphire: Error. Failed to contact member

After the power failure my servers came back online but the whole cluster is down.

It’s a new cluster (I’m new to OVN so I wanted to test excessively before moving production here) with only a few instances that aren’t that important. A power failure is something i didn’t plan for but I’m glad it happened now.

Is a cluster unable to recover a cluster member if another member is down when using OVN? That seems like a massive risk :frowning: . It’s not uncommon for hardware to fail after a power failure. I expected LXD to self-heal and simply continue to exist with 3 out of 4 nodes.

What would’ve happened if I didn’t configure cluster.healing_threshold? I’m guessing LXD would be up but OVN would still be down.

Oct 10 09:19:02 emerald lxd.daemon[2430]: - pidfds
Oct 10 09:19:03 emerald lxd.daemon[2185]: => Starting LXD
Oct 10 09:19:03 emerald lxd.daemon[2565]: time="2025-10-10T09:19:03Z" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
Oct 10 09:19:04 emerald lxd.daemon[2565]: time="2025-10-10T09:19:04Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.52:8443: no known leader"
Oct 10 09:19:04 emerald lxd.daemon[2565]: time="2025-10-10T09:19:04Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.53:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.53:8443\": dial tcp 10.0.0.53:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.54:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.54:8443\": dial tcp 10.0.0.54:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.52:8443: no known leader"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.54:8443: no known leader"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.53:8443: no known leader"
Oct 10 09:19:08 emerald lxd.daemon[2565]: time="2025-10-10T09:19:08Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:08 emerald lxd.daemon[2565]: time="2025-10-10T09:19:08Z" level=warning msg="Dqlite: attempt 3: server 10.0.0.53:8443: no known leader"
Oct 10 09:19:11 emerald lxd.daemon[2565]: time="2025-10-10T09:19:11Z" level=warning msg="Dqlite: attempt 3: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:17 emerald lxd.daemon[2565]: time="2025-10-10T09:19:17Z" level=warning msg="Could not notify all nodes of database upgrade" err="Failed to notify peer sapphire at 10.0.0.51:8443: failed to notify node about completed upgrade: Patch \"https://10.0.0.51:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Oct 10 09:19:29 emerald lxd.daemon[2565]: time="2025-10-10T09:19:29Z" level=error msg="Failed mounting storage pool" err="Failed to run: rbd --id admin --cluster ceph --pool lxd info lxd_lxd: signal: killed" pool=ceph
Oct 10 09:19:31 emerald lxd.daemon[2565]: time="2025-10-10T09:19:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"cloud\" setup: Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb --format=csv --no-headings --data=bare --columns=_uuid,name,acl find port_group name=lxd_net8: exit status 1 (ovn-nbctl: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641: database connection failed (Connection refused))" network=cloud project=default
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"infra\" setup: Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb --format=csv --no-headings --data=bare --columns=_uuid,name,acl find port_group name=lxd_net7: exit status 1 (ovn-nbctl: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641: database connection failed (Connection refused))" network=infra project=default
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=warning msg="cannot set TTL (255) for TCPListener" Err="operation not supported" Key="[2001:db8:3cc4:3::52]:179" Topic=Peer
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=info msg="Add a peer configuration" Key="2001:db8:3cc4:3::1" Topic=Peer
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=error msg="No active cluster event listener clients" local="192.168.3.52:8443"
Oct 10 09:19:32 emerald lxd.daemon[2185]: => LXD is ready
Oct 10 09:19:34 emerald lxd.daemon[2565]: time="2025-10-10T09:19:34Z" level=info msg="Peer Up" Key="2001:db8:3cc4:3::1" State=BGP_FSM_OPENCONFIRM Topic=Peer
Oct 10 09:20:44 emerald lxd.daemon[2565]: time="2025-10-10T09:20:44Z" level=error msg="Failed initializing network" err="Failed starting: Failed deleting OVS chassis \"emerald\" from chassis group \"lxd-net8\": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-remove-chassis lxd-net8 emerald: signal: alarm clock (2025-10-10T09:20:44Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock))" network=cloud project=default
Oct 10 09:20:56 emerald lxd.daemon[2565]: time="2025-10-10T09:20:56Z" level=error msg="Failed initializing network" err="Failed starting: Failed deleting OVS chassis \"emerald\" from chassis group \"lxd-net7\": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-remove-chassis lxd-net7 emerald: signal: alarm clock (2025-10-10T09:20:56Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))" network=infra project=default

I’m thinking this being a 4-node cluster might be causing quorum/raft issues? Eventhough 3 out of 4 is still a majority…

microovn db services logs on all nodes were complaining about missing certificates/keys. The logs stated nothing other than hundreds of lines with:

ovs|00136|stream_ssl|ERR|Private key must be configured to use SSL ovs|00137|stream_ssl|ERR|Certificate must be configured to use SSL

I issued microovn certificate reissue all on all 3 working nodes and now the database services are able to start.

After this worked on one node, I forgot to check the state of the certs on the other nodes. In my enthousiasm I quickly executed the same command.

Emerald has assigned itself the leader role:

root@emerald:/home/vos# sudo ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
9852
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: 9852 (9852d4b2-5f7f-450a-9585-5854168afd77)
Address: ssl:10.0.0.52:6643
Status: cluster member
Role: leader
Term: 38
Leader: self
Vote: self

Ruby is nicely following:

root@ruby:/home/vos# sudo ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
e7e4
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: e7e4 (e7e45665-8b6d-4855-9e68-c8e7ec417c5d)
Address: ssl:10.0.0.53:6643
Status: cluster member
Role: follower
Term: 38
Leader: 9852
Vote: 9852

And diamond is trying to contact Sapphire to join the cluster. That obviously won’t work. I wonder why it doesn’t try to contact any.

ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
ea16
Name: OVN_Northbound
Cluster ID: not yet known
Server ID: ea16 (ea168c37-4f7f-4983-8f6c-81097088121d)
Address: ssl:10.0.0.54:6643
Status: joining cluster
Remotes for joining: ssl:10.0.0.51:6643
Role: follower
Term: 0
Leader: unknown
Vote: unknown

After reissueing certificates lxc cluster restore failed with a different error:

root@diamond:/home/vos# lxc cluster restore diamond
Are you sure you want to restore cluster member "diamond"? (yes/no) [default=no]: yes
Error: Failed to start instance "c5": Failed to start device "eth0": Failed setting up OVN port: Failed setting DNS for "c5.example.com": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb set dns 840a561f-1f28-4617-8959-bb9dcd61b030 external_ids:lxd_switch=lxd-net7-ls-int external_ids:lxd_switch_port=lxd-net7-instance-3fc9aab5-d860-4091-95d5-7cf8d3ebcc3a-eth0 records={"c5.example.com"="172.16.0.7 2001:db8:3cc4:1ab0:216:3eff:fe79:fd67" "7.0.16.172.in-addr.arpa"="c5.example.com" "7.6.d.f.9.7.e.f.f.f.e.3.6.1.2.0.0.b.a.1.4.c.c.3.8.b.d.1.0.0.2.ip6.arpa"="c5.example.com"}: signal: alarm clock (2025-10-10T11:15:39Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))

^ this instance lives on the diamond host.

so I tried to restore the members with --action skip and that worked to get my 3 working members out of the evacuation state.

Then I was then able to start the instances that reside on emerald, ruby and diamond manually.

Very strange that I had to reissue certificates for microovn to allow for the service to start. The cluster was up and running yesterday. The cluster was created on september 25th. The certificates were still valid for another two years.

        Validity
            Not Before: Sep 25 09:39:42 2025 GMT
            Not After : Sep 25 09:39:42 2027 GMT
        Subject: O = MicroOVN, OU = ovnnb, CN = sapphire

Edit:
I was able to replace the NIC on sapphire and the node booted up. (micro)OVN worked straight away. LXD came up without issues. I was able to restore the node and instances started right away.

But I still wonder why it didn’t selfheal when a node was missing after a cold cluster startup :frowning: . 3 out of 4 should have been enough, that’s why we have multiple nodes.