Hey,
4 node cluster 6/stable. Experienced a power failure last night and one of the nodes now has a broken NIC (sapphire).
lxc cluster ls
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| NAME | URL | ROLES | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE | MESSAGE |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| diamond | https://10.0.0.54:8443 | database-leader | x86_64 | default | | EVACUATED | Unavailable due to maintenance |
| | | database | | | | | |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| emerald | https://10.0.0.52:8443 | database | x86_64 | default | | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| ruby | https://10.0.0.53:8443 | database | x86_64 | default | | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| sapphire | https://10.0.0.51:8443 | | x86_64 | default | | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
I think they auto evacuated because of the cluster.healing_threshold I set to 15 the other day. Wish I hadn’t. I’m unable to restore any the members. Restores of any member fail with the following error:
root@diamond:/home/vos# lxc cluster restore diamond
Are you sure you want to restore cluster member "diamond"? (yes/no) [default=no]: yes
Error: Failed restoring network: Failed adding OVS chassis "diamond" with priority 11300 to chassis group "lxd-net7": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-add-chassis lxd-net7 diamond 11300: signal: alarm clock (2025-10-10T10:46:32Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))
It seems OVN is the reason (microovn 24.03/stable)
root@emerald:/home/vos# microovn status
MicroOVN deployment summary:
- diamond (10.0.0.54)
Services: chassis, switch
- emerald (10.0.0.52)
Services: central, chassis, switch
- ruby (10.0.0.53)
Services: central, chassis, switch
- sapphire (10.0.0.51)
Services: central, chassis, switch
OVN Database summary:
OVN Northbound: Upgrade or attention required!
Currently active schema: 7.3.0
Cluster report (expected schema versions):
emerald: 7.3.0
diamond: 7.3.0
ruby: 7.3.0
sapphire: Error. Failed to contact member
Error creating OVN Southbound Database summary: failed to get OVN Southbound active schema version
OVN Southbound: Upgrade or attention required!
Currently active schema:
Cluster report (expected schema versions):
emerald: 20.33.0
ruby: 20.33.0
diamond: 20.33.0
sapphire: Error. Failed to contact member
After the power failure my servers came back online but the whole cluster is down.
It’s a new cluster (I’m new to OVN so I wanted to test excessively before moving production here) with only a few instances that aren’t that important. A power failure is something i didn’t plan for but I’m glad it happened now.
Is a cluster unable to recover a cluster member if another member is down when using OVN? That seems like a massive risk . It’s not uncommon for hardware to fail after a power failure. I expected LXD to self-heal and simply continue to exist with 3 out of 4 nodes.
What would’ve happened if I didn’t configure cluster.healing_threshold
? I’m guessing LXD would be up but OVN would still be down.
Oct 10 09:19:02 emerald lxd.daemon[2430]: - pidfds
Oct 10 09:19:03 emerald lxd.daemon[2185]: => Starting LXD
Oct 10 09:19:03 emerald lxd.daemon[2565]: time="2025-10-10T09:19:03Z" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
Oct 10 09:19:04 emerald lxd.daemon[2565]: time="2025-10-10T09:19:04Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.52:8443: no known leader"
Oct 10 09:19:04 emerald lxd.daemon[2565]: time="2025-10-10T09:19:04Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.53:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.53:8443\": dial tcp 10.0.0.53:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.54:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.54:8443\": dial tcp 10.0.0.54:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.52:8443: no known leader"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.54:8443: no known leader"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.53:8443: no known leader"
Oct 10 09:19:08 emerald lxd.daemon[2565]: time="2025-10-10T09:19:08Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:08 emerald lxd.daemon[2565]: time="2025-10-10T09:19:08Z" level=warning msg="Dqlite: attempt 3: server 10.0.0.53:8443: no known leader"
Oct 10 09:19:11 emerald lxd.daemon[2565]: time="2025-10-10T09:19:11Z" level=warning msg="Dqlite: attempt 3: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:17 emerald lxd.daemon[2565]: time="2025-10-10T09:19:17Z" level=warning msg="Could not notify all nodes of database upgrade" err="Failed to notify peer sapphire at 10.0.0.51:8443: failed to notify node about completed upgrade: Patch \"https://10.0.0.51:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Oct 10 09:19:29 emerald lxd.daemon[2565]: time="2025-10-10T09:19:29Z" level=error msg="Failed mounting storage pool" err="Failed to run: rbd --id admin --cluster ceph --pool lxd info lxd_lxd: signal: killed" pool=ceph
Oct 10 09:19:31 emerald lxd.daemon[2565]: time="2025-10-10T09:19:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"cloud\" setup: Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb --format=csv --no-headings --data=bare --columns=_uuid,name,acl find port_group name=lxd_net8: exit status 1 (ovn-nbctl: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641: database connection failed (Connection refused))" network=cloud project=default
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"infra\" setup: Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb --format=csv --no-headings --data=bare --columns=_uuid,name,acl find port_group name=lxd_net7: exit status 1 (ovn-nbctl: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641: database connection failed (Connection refused))" network=infra project=default
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=warning msg="cannot set TTL (255) for TCPListener" Err="operation not supported" Key="[2001:db8:3cc4:3::52]:179" Topic=Peer
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=info msg="Add a peer configuration" Key="2001:db8:3cc4:3::1" Topic=Peer
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=error msg="No active cluster event listener clients" local="192.168.3.52:8443"
Oct 10 09:19:32 emerald lxd.daemon[2185]: => LXD is ready
Oct 10 09:19:34 emerald lxd.daemon[2565]: time="2025-10-10T09:19:34Z" level=info msg="Peer Up" Key="2001:db8:3cc4:3::1" State=BGP_FSM_OPENCONFIRM Topic=Peer
Oct 10 09:20:44 emerald lxd.daemon[2565]: time="2025-10-10T09:20:44Z" level=error msg="Failed initializing network" err="Failed starting: Failed deleting OVS chassis \"emerald\" from chassis group \"lxd-net8\": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-remove-chassis lxd-net8 emerald: signal: alarm clock (2025-10-10T09:20:44Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock))" network=cloud project=default
Oct 10 09:20:56 emerald lxd.daemon[2565]: time="2025-10-10T09:20:56Z" level=error msg="Failed initializing network" err="Failed starting: Failed deleting OVS chassis \"emerald\" from chassis group \"lxd-net7\": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-remove-chassis lxd-net7 emerald: signal: alarm clock (2025-10-10T09:20:56Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))" network=infra project=default
I’m thinking this being a 4-node cluster might be causing quorum/raft issues? Eventhough 3 out of 4 is still a majority…