6.5 cluster won't start after power failure

Hey,

4 node cluster 6/stable. Experienced a power failure last night and one of the nodes now has a broken NIC (sapphire).

lxc cluster ls
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
|   NAME   |          URL           |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION |   STATE   |            MESSAGE             |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| diamond  | https://10.0.0.54:8443 | database-leader | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
|          |                        | database        |              |                |             |           |                                |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| emerald  | https://10.0.0.52:8443 | database        | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| ruby     | https://10.0.0.53:8443 | database        | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+
| sapphire | https://10.0.0.51:8443 |                 | x86_64       | default        |             | EVACUATED | Unavailable due to maintenance |
+----------+------------------------+-----------------+--------------+----------------+-------------+-----------+--------------------------------+

I think they auto evacuated because of the cluster.healing_threshold I set to 15 the other day. Wish I hadn’t. I’m unable to restore any the members. Restores of any member fail with the following error:

root@diamond:/home/vos# lxc cluster restore diamond
Are you sure you want to restore cluster member "diamond"? (yes/no) [default=no]: yes
Error: Failed restoring network: Failed adding OVS chassis "diamond" with priority 11300 to chassis group "lxd-net7": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-add-chassis lxd-net7 diamond 11300: signal: alarm clock (2025-10-10T10:46:32Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))

It seems OVN is the reason (microovn 24.03/stable)

root@emerald:/home/vos# microovn status

MicroOVN deployment summary:
- diamond (10.0.0.54)
  Services: chassis, switch
- emerald (10.0.0.52)
  Services: central, chassis, switch
- ruby (10.0.0.53)
  Services: central, chassis, switch
- sapphire (10.0.0.51)
  Services: central, chassis, switch
OVN Database summary:

OVN Northbound: Upgrade or attention required!
Currently active schema: 7.3.0
Cluster report (expected schema versions):
        emerald: 7.3.0
        diamond: 7.3.0
        ruby: 7.3.0
        sapphire: Error. Failed to contact member

Error creating OVN Southbound Database summary: failed to get OVN Southbound active schema version
OVN Southbound: Upgrade or attention required!
Currently active schema:
Cluster report (expected schema versions):
        emerald: 20.33.0
        ruby: 20.33.0
        diamond: 20.33.0
        sapphire: Error. Failed to contact member

After the power failure my servers came back online but the whole cluster is down.

It’s a new cluster (I’m new to OVN so I wanted to test excessively before moving production here) with only a few instances that aren’t that important. A power failure is something i didn’t plan for but I’m glad it happened now.

Is a cluster unable to recover a cluster member if another member is down when using OVN? That seems like a massive risk :frowning: . It’s not uncommon for hardware to fail after a power failure. I expected LXD to self-heal and simply continue to exist with 3 out of 4 nodes.

What would’ve happened if I didn’t configure cluster.healing_threshold? I’m guessing LXD would be up but OVN would still be down.

Oct 10 09:19:02 emerald lxd.daemon[2430]: - pidfds
Oct 10 09:19:03 emerald lxd.daemon[2185]: => Starting LXD
Oct 10 09:19:03 emerald lxd.daemon[2565]: time="2025-10-10T09:19:03Z" level=warning msg=" - Couldn't find the CGroup network priority controller, per-instance network priority will be ignored. Please use per-device limits.priority instead"
Oct 10 09:19:04 emerald lxd.daemon[2565]: time="2025-10-10T09:19:04Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.52:8443: no known leader"
Oct 10 09:19:04 emerald lxd.daemon[2565]: time="2025-10-10T09:19:04Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.53:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.53:8443\": dial tcp 10.0.0.53:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 1: server 10.0.0.54:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.54:8443\": dial tcp 10.0.0.54:8443: connect: no route to host"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.52:8443: no known leader"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.54:8443: no known leader"
Oct 10 09:19:05 emerald lxd.daemon[2565]: time="2025-10-10T09:19:05Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.53:8443: no known leader"
Oct 10 09:19:08 emerald lxd.daemon[2565]: time="2025-10-10T09:19:08Z" level=warning msg="Dqlite: attempt 2: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:08 emerald lxd.daemon[2565]: time="2025-10-10T09:19:08Z" level=warning msg="Dqlite: attempt 3: server 10.0.0.53:8443: no known leader"
Oct 10 09:19:11 emerald lxd.daemon[2565]: time="2025-10-10T09:19:11Z" level=warning msg="Dqlite: attempt 3: server 10.0.0.51:8443: dial: Failed connecting to HTTP endpoint \"10.0.0.51:8443\": dial tcp 10.0.0.51:8443: connect: no route to host"
Oct 10 09:19:17 emerald lxd.daemon[2565]: time="2025-10-10T09:19:17Z" level=warning msg="Could not notify all nodes of database upgrade" err="Failed to notify peer sapphire at 10.0.0.51:8443: failed to notify node about completed upgrade: Patch \"https://10.0.0.51:8443/internal/database\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Oct 10 09:19:29 emerald lxd.daemon[2565]: time="2025-10-10T09:19:29Z" level=error msg="Failed mounting storage pool" err="Failed to run: rbd --id admin --cluster ceph --pool lxd info lxd_lxd: signal: killed" pool=ceph
Oct 10 09:19:31 emerald lxd.daemon[2565]: time="2025-10-10T09:19:31Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"cloud\" setup: Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb --format=csv --no-headings --data=bare --columns=_uuid,name,acl find port_group name=lxd_net8: exit status 1 (ovn-nbctl: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641: database connection failed (Connection refused))" network=cloud project=default
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=error msg="Failed initializing network" err="Failed starting: Failed getting port group UUID for network \"infra\" setup: Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb --format=csv --no-headings --data=bare --columns=_uuid,name,acl find port_group name=lxd_net7: exit status 1 (ovn-nbctl: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641: database connection failed (Connection refused))" network=infra project=default
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=warning msg="cannot set TTL (255) for TCPListener" Err="operation not supported" Key="[2001:db8:3cc4:3::52]:179" Topic=Peer
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=info msg="Add a peer configuration" Key="2001:db8:3cc4:3::1" Topic=Peer
Oct 10 09:19:32 emerald lxd.daemon[2565]: time="2025-10-10T09:19:32Z" level=error msg="No active cluster event listener clients" local="192.168.3.52:8443"
Oct 10 09:19:32 emerald lxd.daemon[2185]: => LXD is ready
Oct 10 09:19:34 emerald lxd.daemon[2565]: time="2025-10-10T09:19:34Z" level=info msg="Peer Up" Key="2001:db8:3cc4:3::1" State=BGP_FSM_OPENCONFIRM Topic=Peer
Oct 10 09:20:44 emerald lxd.daemon[2565]: time="2025-10-10T09:20:44Z" level=error msg="Failed initializing network" err="Failed starting: Failed deleting OVS chassis \"emerald\" from chassis group \"lxd-net8\": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-remove-chassis lxd-net8 emerald: signal: alarm clock (2025-10-10T09:20:44Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock))" network=cloud project=default
Oct 10 09:20:56 emerald lxd.daemon[2565]: time="2025-10-10T09:20:56Z" level=error msg="Failed initializing network" err="Failed starting: Failed deleting OVS chassis \"emerald\" from chassis group \"lxd-net7\": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb ha-chassis-group-remove-chassis lxd-net7 emerald: signal: alarm clock (2025-10-10T09:20:56Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))" network=infra project=default

I’m thinking this being a 4-node cluster might be causing quorum/raft issues? Eventhough 3 out of 4 is still a majority…

microovn db services logs on all nodes were complaining about missing certificates/keys. The logs stated nothing other than hundreds of lines with:

ovs|00136|stream_ssl|ERR|Private key must be configured to use SSL ovs|00137|stream_ssl|ERR|Certificate must be configured to use SSL

I issued microovn certificate reissue all on all 3 working nodes and now the database services are able to start.

After this worked on one node, I forgot to check the state of the certs on the other nodes. In my enthousiasm I quickly executed the same command.

Emerald has assigned itself the leader role:

root@emerald:/home/vos# sudo ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
9852
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: 9852 (9852d4b2-5f7f-450a-9585-5854168afd77)
Address: ssl:10.0.0.52:6643
Status: cluster member
Role: leader
Term: 38
Leader: self
Vote: self

Ruby is nicely following:

root@ruby:/home/vos# sudo ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
e7e4
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: e7e4 (e7e45665-8b6d-4855-9e68-c8e7ec417c5d)
Address: ssl:10.0.0.53:6643
Status: cluster member
Role: follower
Term: 38
Leader: 9852
Vote: 9852

And diamond is trying to contact Sapphire to join the cluster. That obviously won’t work. I wonder why it doesn’t try to contact any.

ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
ea16
Name: OVN_Northbound
Cluster ID: not yet known
Server ID: ea16 (ea168c37-4f7f-4983-8f6c-81097088121d)
Address: ssl:10.0.0.54:6643
Status: joining cluster
Remotes for joining: ssl:10.0.0.51:6643
Role: follower
Term: 0
Leader: unknown
Vote: unknown

After reissueing certificates lxc cluster restore failed with a different error:

root@diamond:/home/vos# lxc cluster restore diamond
Are you sure you want to restore cluster member "diamond"? (yes/no) [default=no]: yes
Error: Failed to start instance "c5": Failed to start device "eth0": Failed setting up OVN port: Failed setting DNS for "c5.example.com": Failed to run: ovn-nbctl --timeout=10 --db ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641 -c /proc/self/fd/3 -p /proc/self/fd/4 -C /proc/self/fd/5 --wait=sb set dns 840a561f-1f28-4617-8959-bb9dcd61b030 external_ids:lxd_switch=lxd-net7-ls-int external_ids:lxd_switch_port=lxd-net7-instance-3fc9aab5-d860-4091-95d5-7cf8d3ebcc3a-eth0 records={"c5.example.com"="172.16.0.7 2001:db8:3cc4:1ab0:216:3eff:fe79:fd67" "7.0.16.172.in-addr.arpa"="c5.example.com" "7.6.d.f.9.7.e.f.f.f.e.3.6.1.2.0.0.b.a.1.4.c.c.3.8.b.d.1.0.0.2.ip6.arpa"="c5.example.com"}: signal: alarm clock (2025-10-10T11:15:39Z|00003|fatal_signal|WARN|terminating with signal 14 (Alarm clock))

^ this instance lives on the diamond host.

so I tried to restore the members with --action skip and that worked to get my 3 working members out of the evacuation state.

Then I was then able to start the instances that reside on emerald, ruby and diamond manually.

Very strange that I had to reissue certificates for microovn to allow for the service to start. The cluster was up and running yesterday. The cluster was created on september 25th. The certificates were still valid for another two years.

        Validity
            Not Before: Sep 25 09:39:42 2025 GMT
            Not After : Sep 25 09:39:42 2027 GMT
        Subject: O = MicroOVN, OU = ovnnb, CN = sapphire

Edit:
I was able to replace the NIC on sapphire and the node booted up. (micro)OVN worked straight away. LXD came up without issues. I was able to restore the node and instances started right away.

But I still wonder why it didn’t selfheal when a node was missing after a cold cluster startup :frowning: . 3 out of 4 should have been enough, that’s why we have multiple nodes.

Hi @vosdev I’m asking the OVN team internally to look into this one. Thanks

Can you please share which versions of the snap you are using at the moment? You can run snap list on all of the systems.

Hi @vosdev,
This is an unfortunate situation, MicroOVN (and OVN) are certainly meant to survive power cycling and outage of the cluster members (within the fault tolerance of the raft (N/2)-1).

We’d like to get to the bottom of this if you could help us by providing some additional info.

1. Who manages the MicroOVN
Are you using Microcloud or is this a plain LXD cluster? If you are using only the LXD, you probably deployed the MicroOVN manually yourself, right?
If it’s the case of a manual setup, do you recall making any configuration changes to the MicroOVN after you deployed it?

2. Overall cluster health
You posted outputs from ovn-nbctl cluster/status which tells us part of the story, but unfortunately it lacks the bottom half of the output ( the part that shows servers:). Would you mind reposting the complete cluster/status output. It’d help us see what every cluster member thinks about what the cluster should look like.
Please also include:

  • contents of /var/snap/microovn/common/data/env/ovn.env
  • output of snap services microovn

from each node.

3. OVN central services on the diamond node.
This part is bit confusing for me. You mention

And diamond is trying to contact Sapphire to join the cluster. That obviously won’t work. I wonder why it doesn’t try to contact any.

But from the output of the microovn status:

MicroOVN deployment summary:
- diamond (10.0.0.54)
  Services: chassis, switch
- emerald (10.0.0.52)
  Services: central, chassis, switch
- ruby (10.0.0.53)
  Services: central, chassis, switch
- sapphire (10.0.0.51)
  Services: central, chassis, switch

It shows that diamond host should not be running any central services.
The one thing that I can think of, that would cause the system to be in this state, is if you just manually enabled microovn.ovn-ovsdb-server-nb, microovn.ovn-ovsdb-server-sb and microovn.ovn-northd snap services. But again, this is only if you are managing the MicroOVN yourself.

4. The timeout of lxc cluster restore
We did have a bug that could be a possible explanation for this. Long story short is that if a node that joined the “central” cluster last, became a database leader and also acquired northd database lock, the data would stop flowing between Northbound and Southbound database.
It has since been fixed and backported to the 24.03/stable, but it was not in the stable channel when you experienced this issue.

Another suspicious part of that log shows that LXD tries to connect to all four nodes, when only three of them should be running the central database according to microovn status. The extra IP address in there wouldn’t cause any problems in it self, but it would still be useful to see outputs of:

  • ovs-vsctl get Open_vSwitch . external_ids:ovn-remote on each node
  • lxc config get network.ovn.northbound_connection on any one node.

5. The missing certificate
This one is a complete mystery to me. The certificates are stored on the disk and options that specify path to them are hard-coded in the wrappers that start OVN services. Did you observe these errors on all the nodes? Were there also lines like this in the log

2025-10-30T12:37:12Z|00001|stream_ssl|ERR|SSL_use_certificate_file: error:80000002:system library::No such file or directory
2025-10-30T12:37:12Z|00002|stream_ssl|ERR|SSL_use_PrivateKey_file: error:10080002:BIO routines::system lib
2025-10-30T12:37:12Z|00003|stream_ssl|ERR|failed to load client certificates from /cacert.pem: error:0A080002:SSL routines::system lib

Thanks in advance,
Martin.

3 Likes

Hey Martin, thank you for the very detailed response!

I’ll try to answer where I can, but cannot provide you with output from the commands because the cluster is no longer in that broken state.

1. Who manages the MicroOVN

I’m using LXD and manually installed microovn. Reason being I wish to manage my own Ceph cluster and not use the more limited microceph.

do you recall making any configuration changes to the MicroOVN after you deployed it?

None. install the snap, create the cluster, point network.ovn.northbound_connection to the IPs, configure the certs under network.ovn.ca/client_cert/key and done.

Unfortunately I no longer have the output from back then. It’s been 3 weeks and I have since restored the cluster. So I cannot supply you the output with the servers: part.

3. OVN central services on the diamond node.

All I could see in the cluster/status is that the 10.0.0.54 node had Remotes for joining: ssl:10.0.0.51:6643 in the output. That won’t work because 10.0.0.51 was the one node that was offline. Not familiar enough with OVN but shouldn’t it try to join via multiple nodes or fallback to another node if the first one fails?

But again, this is only if you are managing the MicroOVN yourself.

I am not :slight_smile: I chose microovn for the plug & play simplicity. And I must say that besides this issue it really has been a simple plug & play! So thank you for that!

4. The timeout of lxc cluster restore

Thats a simple one, that is because I configured all 4 microovn nodes for the network.ovn.northbound_connection option :slight_smile: ssl:10.0.0.51:6641,ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.54:6641.

the output from the ovs-vsctl command is "ssl:10.0.0.52:6642,ssl:10.0.0.53:6642,ssl:10.0.0.51:6642" on all nodes.

I’m glad to hear that the additional IP address won’t cause any problems. But perhaps this is valuable information that can be added to the LXD+microovn documentation that I requested. Routed OVN set up guide and microovn guide · Issue #15844 · canonical/lxd · GitHub

Are you sure this won’t cause any problems? I configured all of the 4 nodes of which only 3 supply the central service. 1 of them was down, so that leaves 2 out of 4. That is not a majority so I can understand if LXD is not happy with this. I will remove 10.0.0.54 from network.ovn.northbound_connection to be sure.

5. The missing certificate

I had a working microovn setup. Power failure happened and I was left with a broken system. All I could see in the logs was that there was a certificate/key issue so I went looking for ways to try and solve that. My virtualization cluster was down so I was in a bit of a rush.

Just like you I also cannot think of any reason why these certificates/keys went missing on all these 4 nodes. (Or the config that states we need a certificate/key… because the errors I saw did not specify No such file or directory like in your example error)

My journald has since rotated (oldest logs are from october 27) so I cannot check if I see the errors you mentioned. Though I do not remember seeing these errors, only the ones I posted in my original post.

1 Like

I’ve created When using MicroOVN LXD should get candidate DB addresses from ovn.env · Issue #16856 · canonical/lxd · GitHub so that LXD can get its OVN central service candidate addresses directly from MicroOVN and avoid needing to populate network.ovn.northbound_connection at all. This way MicroOVN can keep LXD using a fresh set of candidates automatically.

1 Like

Thanks for the reply @vosdev

This is a perfectly valid way to run things. The only reason why I was asking, was to eliminate possible source of the problems.

Just FYI, this shouldn’t be necessary AFAIK. LXD should detect presence of MicroOVN and use snap interface to get relevant certificates. I’m not that well versed in the LXD, so perhaps @tomp can chime in here, but I don’t think the settings hurts if it’s set.

The output (along with the snap services microovn) would still be valuable with the current state of cluster. If there’s an orphaned OVN database trying to run on one of the nodes, that the rest of the cluster doesn’t know about, it would be best to disable it. It’d also give us an indication of possible issue at hand.

Right, but my point was that the central database, along with its local unix socket that you pointed the ovn-appctl command to, should not be there at all according to the microovn status. That’s why seeing the output of the cluster/status, snap services microovn and contents of the /var/snap/microovn/common/data/env/ovn.env files (on all four nodes) would help us get some insight on why the central databse is even running on the diamond node.

So this might be going a little bit into the weeds of the OVN, but since you asked :smiley:. There are two ways to tell clustered OVN database process (OVN_Northbound and OVN_Southbound) where to connect.

  • Specify list of all --remote endpoints via cli argument
  • Point to a singular --db-nb-cluster-remote-addr and use the connections table to figure out the list of servers in the cluster.

The first option is very rigid, and doesn’t allow changing the remotes at runtime. The whole process needs to be restarted with new set of --remote arguments whenever the cluster changes. That’s why we use the second option. This may seem like a single point of failure, but the value of --db-nb-cluster-remote-addr is really used only during the initial joining of the cluster. After that, on service start, the database will use it’s local copy of the connections table to find its peers. So if you are seeing the “Northbound database” on diamond trying to connect to a singular node, it’s likely that it never really joined the cluster before.

Good, that value is consistent with what MicroOVN thinks about the cluster layout.

It’s not a problem for the client commands (e.g. ovn-nbctl, ovn-sbctl) that the LXD is using under the hood. The client tries to find a cluster leader and talks only to it. It disregards any non-leaders. So as long as one of the IP addresses in the list hosts the leader, the client is fine.

It would be a problem if the members of the cluster thought that the size is supposed to be 4 and 2 of the members are unreachable. That would be beyond the raft’s fault tolerance. We will see what the cluster members think about the cluster from the full cluster/status, but given that you have the cluster working now, my guess is that emerald, ruby and sapphire expect only three members in the cluster. Otherwise the cluster would stop accepting writes.

On the topic of certificates. This is the biggest mystery to me and I have no follow-up. I’ll keep thinking about it. The paths to the certificates are pretty much hard-coded in the service startup scripts, so that leaves only the option that the files were not there.
I could see this happening on diamond, because if MicroOVN doesn’t think that the central service should be running there, it would not generate service certificates on that node. However you say that those SSL problems were on all nodes.

2 Likes

Here is the relevant part of the LXD snap that does this:

Just FYI, this shouldn’t be necessary AFAIK. LXD should detect presence of MicroOVN and use snap interface to get relevant certificates. I’m not that well versed in the LXD, so perhaps @tomp can chime in here, but I don’t think the settings hurts if it’s set.

That is interesting. It didn’t work for me at the time because the connection was refused due to missing certs/keys. It worked after I configured them. But perhaps it required a restart of LXD or reboot to dynamically learn about the certs/keys. Another reason for adding documentation of using LXD with microovn :slight_smile:

The output (along with the snap services microovn ) would still be valuable with the current state of cluster. If there’s an orphaned OVN database trying to run on one of the nodes, that the rest of the cluster doesn’t know about, it would be best to disable it. It’d also give us an indication of possible issue at hand.

sapphire, emerald and ruby have:

Service                          Startup  Current   Notes
microovn.chassis                 enabled  active    -
microovn.daemon                  enabled  active    -
microovn.ovn-northd              enabled  active    -
microovn.ovn-ovsdb-server-nb     enabled  active    -
microovn.ovn-ovsdb-server-sb     enabled  active    -
microovn.refresh-expiring-certs  enabled  inactive  timer-activated
microovn.switch                  enabled  active    -

for diamond the db services are disabled:

root@diamond:/home/vos# snap services microovn
Service                          Startup   Current   Notes
microovn.chassis                 enabled   active    -
microovn.daemon                  enabled   active    -
microovn.ovn-northd              disabled  inactive  -
microovn.ovn-ovsdb-server-nb     disabled  inactive  -
microovn.ovn-ovsdb-server-sb     disabled  inactive  -
microovn.refresh-expiring-certs  enabled   inactive  timer-activated
microovn.switch                  enabled   active    -

So that seems to be correct!

Ha. That exact command fails on diamond right now.

root@diamond:/home/vos# ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
2025-11-03T11:30:52Z|00001|unixctl|WARN|failed to connect to /var/snap/microovn/common/run/ovn/ovnnb_db.ctl
ovn-appctl: cannot connect to "/var/snap/microovn/common/run/ovn/ovnnb_db.ctl" (No such file or directory)

Here’s the content of the env file on all 4 nodes:

root@sapphire:/home/vos# cat /var/snap/microovn/common/data/env/ovn.env
# # Generated by MicroOVN, DO NOT EDIT.
OVN_INITIAL_NB="10.0.0.52"
OVN_INITIAL_SB="10.0.0.52"
OVN_NB_CONNECT="ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.51:6641"
OVN_SB_CONNECT="ssl:10.0.0.52:6642,ssl:10.0.0.53:6642,ssl:10.0.0.51:6642"
OVN_LOCAL_IP="10.0.0.51"
root@emerald:/home/vos# cat /var/snap/microovn/common/data/env/ovn.env
# # Generated by MicroOVN, DO NOT EDIT.
OVN_INITIAL_NB="10.0.0.53"
OVN_INITIAL_SB="10.0.0.53"
OVN_NB_CONNECT="ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.51:6641"
OVN_SB_CONNECT="ssl:10.0.0.52:6642,ssl:10.0.0.53:6642,ssl:10.0.0.51:6642"
OVN_LOCAL_IP="10.0.0.52"
root@ruby:/home/vos# cat /var/snap/microovn/common/data/env/ovn.env
# # Generated by MicroOVN, DO NOT EDIT.
OVN_INITIAL_NB="10.0.0.52"
OVN_INITIAL_SB="10.0.0.52"
OVN_NB_CONNECT="ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.51:6641"
OVN_SB_CONNECT="ssl:10.0.0.52:6642,ssl:10.0.0.53:6642,ssl:10.0.0.51:6642"
OVN_LOCAL_IP="10.0.0.53"
root@diamond:/home/vos# cat /var/snap/microovn/common/data/env/ovn.env
# # Generated by MicroOVN, DO NOT EDIT.
OVN_INITIAL_NB="10.0.0.52"
OVN_INITIAL_SB="10.0.0.52"
OVN_NB_CONNECT="ssl:10.0.0.52:6641,ssl:10.0.0.53:6641,ssl:10.0.0.51:6641"
OVN_SB_CONNECT="ssl:10.0.0.52:6642,ssl:10.0.0.53:6642,ssl:10.0.0.51:6642"
OVN_LOCAL_IP="10.0.0.54"
root@sapphire:/home/vos# ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
68fb
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: 68fb (68fb5af9-4b3f-4cc7-b3a3-1531738ec209)
Address: ssl:10.0.0.51:6643
Status: cluster member
Role: leader
Term: 89
Leader: self
Vote: self

Last Election started 81446864 ms ago, reason: leadership_transfer
Last Election won: 81446860 ms ago
Election timer: 16000
Log: [11036, 11037]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->e7e4 ->9852 <-9852 <-e7e4
Disconnections: 1
Servers:
    e7e4 (e7e4 at ssl:10.0.0.53:6643) next_index=11037 match_index=11036 last msg 3050 ms ago
    68fb (68fb at ssl:10.0.0.51:6643) (self) next_index=11036 match_index=11036
    9852 (9852 at ssl:10.0.0.52:6643) next_index=11037 match_index=11036 last msg 3050 ms ago
root@emerald:/home/vos# ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
9852
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: 9852 (9852d4b2-5f7f-450a-9585-5854168afd77)
Address: ssl:10.0.0.52:6643
Status: cluster member
Role: follower
Term: 89
Leader: 68fb
Vote: 68fb

Last Election started 167873471 ms ago, reason: leadership_transfer
Last Election won: 167873465 ms ago
Election timer: 16000
Log: [11035, 11037]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->e7e4 ->68fb <-68fb <-e7e4
Disconnections: 2
Servers:
    e7e4 (e7e4 at ssl:10.0.0.53:6643) last msg 81455131 ms ago
    68fb (68fb at ssl:10.0.0.51:6643) last msg 1066 ms ago
    9852 (9852 at ssl:10.0.0.52:6643) (self)
root@ruby:/home/vos# ovn-appctl -t /var/snap/microovn/common/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
e7e4
Name: OVN_Northbound
Cluster ID: 894a (894a6969-1c76-4de8-94a3-cf8df00c0e09)
Server ID: e7e4 (e7e45665-8b6d-4855-9e68-c8e7ec417c5d)
Address: ssl:10.0.0.53:6643
Status: cluster member
Role: follower
Term: 89
Leader: 68fb
Vote: 68fb

Last Election started 81892176 ms ago, reason: leadership_transfer
Last Election won: 81892165 ms ago
Election timer: 16000
Log: [11036, 11037]
Entries not yet committed: 0
Entries not yet applied: 0
Connections: ->68fb ->9852 <-68fb <-9852
Disconnections: 0
Servers:
    e7e4 (e7e4 at ssl:10.0.0.53:6643) (self)
    68fb (68fb at ssl:10.0.0.51:6643) last msg 686 ms ago
    9852 (9852 at ssl:10.0.0.52:6643) last msg 81460089 ms ago

Hehe thank you for the explanation! It makes sense to utilize the cluster’s connections table to make the peers dynamic. This allows for growing and shrinking without downtime.

So if you are seeing the “Northbound database” on diamond trying to connect to a singular node, it’s likely that it never really joined the cluster before.

That is scary. I’m glad it is now fixed at least for my situation. As you can see in the microovn status output of the initial post it was supposed to be a member of the cluster. As a microovn user that status output is all I really have or am supposed to have.

I wish I could provide you with more information. It is interesting to see that the cluster/status commands now fail on diamond because the db isnt present. So that seems to be fixed now.

But even with the 4th node in a broken state, there were still 2 out of 3 nodes online. Yet the services failed to start.

Thanks again for your well-thorough response!