MicroCloud init fails on MicroOVN clustering for 4+ nodes (OpenSSL DSO/AppArmor crash), but manual setup works perfectly

Hi everyone,

I wanted to share a recent experience and open a discussion regarding the scalability and reliability of microcloud init, specifically when handling MicroOVN in a cluster larger than 3 nodes.

The Environment:

  • OS: Tested on both Ubuntu 24.04 LTS and Ubuntu 22.04 LTS (the issue persists identically across both versions).

  • Subscription: All nodes are running with Ubuntu Pro enabled.

  • Hardware: 5 Bare-metal servers (Node 1 to 5).

  • Network: 10G bonded networking for Underlay/Ceph, dedicated VLANs for OVN Uplink.

  • Time Sync: NTP synchronized perfectly across all nodes via Chrony (sub-millisecond offsets).

  • Snaps: All on stable channels, strictly held with --cohort="+": lxd, microceph, microovn, microcloud.

The Issue:

Bootstrapping a 3-node cluster using microcloud init works flawlessly. However, when attempting to scale and join the 4th and 5th nodes, the automation fails entirely during the OVN distributed networking configuration.

The cluster falls into a timeout state, and digging into sudo snap logs microovn reveals a sequence of fatal OpenSSL and AppArmor/Snap confinement errors on the newly joining nodes:

Plaintext

ovsdb-server: EVP_DigestInit_ex failed: error:12800067:DSO support routines::could not load the shared library
ovsdb-tool: ovsdb error: /var/snap/microovn/common/data/central/db/ovnsb_db.db: cannot identify file type
systemd[1]: snap.microovn.ovn-ovsdb-server-nb.service: Main process exited, code=exited, status=1/FAILURE
ovsdb-client: connection attempt failed (No such file or directory)

Because the ovsdb-server crashes abruptly due to the DSO/library load failure, the generated .db files become 0-byte (corrupted), throwing the nodes into a restart-loop zombie state.

The Plot Twist (Why this isn’t a hardware/OS/network issue):

Initially, we suspected hardware/BIOS compatibility, OS regressions, or network/firewall drops. However, we performed the following workaround:

  1. Purged and reinstalled the snaps on all nodes.

  2. Ran microcloud init but answered no to Configure distributed networking?. (LXD and MicroCeph clustered perfectly across all 3 nodes).

  3. Configured LXD, MicroCeph, MicroOVN manually (microceph cluster add,microovn cluster add and lxd init on the rest).

Result: The manual OVN clustering worked perfectly across all 5 nodes without a single timeout or OpenSSL crash.

Discussion Points / Questions:

  1. Race Condition / Timeout: Is microcloud enforcing a strict context timeout that is simply too aggressive for OVSDB Raft elections when transitioning past a 3-node quorum?

  2. Snap Confinement Bug: Why does the snap sandbox/AppArmor block OpenSSL library loading (DSO support routines) specifically when triggered by MicroCloud’s automated API calls, but allows it when executed manually via CLI?

  3. Marketing vs Reality: Is MicroCloud strictly designed and tested only for 3-node Edge computing? For 10-20 node deployments in an Enterprise/Ubuntu Pro environment, should the community avoid MicroCloud and default to MAAS/Juju/Ansible for reliable LXD+Ceph+OVN orchestration?

Would love to hear the maintainers’ thoughts or if anyone else has hit this 3-node ceiling with MicroCloud!

@rizkyana Thank you for sharing your experience. I’m the product manager, so ’ll let the engineers answer on the technical bits, but I can answer on the 3rd point you mention.

MicroCloud is designed and tested on the scale of 1-50 nodes, so if something is happening when running more than 3, that is a bug that must be fixed.

We have many customers running larger MicroClouds in production, and this is the first time I hear of something like this. MicroCloud is designed in a way to orchestrate the deployment specifically so you don’t need to use additional tooling as you describe.

Reporting these kinds of issues is valuable, so that we can discover these kinds of edge use cases and address them.

Hi @mionaalex

Thank you for the prompt response and for clearing that up! It is incredibly reassuring to know that MicroCloud is officially designed and tested for up to 50 nodes out of the box. This gives us the confidence to fully rely on it for our production environment once this specific edge-case is ironed out.

I also realized that I forgot to include the exact Snap versions and revisions we are currently running in my original post. I have listed them below and have also attached a screenshot for the engineering team’s reference:

Just a quick heads-up for the engineering team, because we had a deployment deadline to meet, this 5-node cluster is now fully live in production using the manual cluster add workaround I mentioned earlier. As a result, I won’t be able to tear it down or run disruptive debug commands to reproduce the issue on our end anymore.

Hello, can you please share the exact snap versions, either snap list or snap info micro*.

the automation fails entirely during the OVN distributed networking configuration.

In your case MicroCloud hasn’t yet reached this stage. It failed during the formation of the MicroOVN cluster.
The CoreTokenRecord not found error originates from Microcluster. It indicates that a token is used which isn’t valid. In this case I suspect it expired.

MicroCloud creates those tokens with a lifetime of 5 minutes. If the forming of the clusters takes longer than that, each of the Micro* services Microclusters will start cleaning them up again. And so did MicroOVN in your case.

So could you please share some more details about the time it took to form the cluster (peers -05, -02 and -03) until you saw this error message? Can it be that it took >5 mins?

But let’s get to the bottom if it.
What I see is that you have selected 20 disks to be used as OSDs on all of the members. This results in 100 OSDs to be added to the MicroCloud in total.
Technically that should be fine, but I suspect setting up the 100 OSDs requires more than the five minutes.
First MicroCloud forms the MicroCloud Microcluster, second it will try MicroCeph (together with the OSDs) and MicroOVN in parallel. Last it will do LXD.
This block of four services gets executed for each joining member (in random order). This means that if in one of the blocks the MicroCeph setup takes too long (many OSDs), it will have a knock on effect on the follow up blocks (to be joined members).

As a temporary solution lowering the amount of OSDs per member likely resolves the issue.

The MicroOVN errors are likely an effect of the failed join operation.

Added Increase the join token timeout · Issue #1339 · canonical/microcloud · GitHub

Hi @jpelizaeus

Thank you for looking into this and for the explanation!

Yes, the 5-minute token timeout is very likely the culprit here. We noticed previously that when we tried adding the Ceph disks manually, occasionally one of the disks would throw an error on the first attempt, but re-adding it immediately after would succeed. It is highly probable that the sheer volume of initializing 100 OSDs across the nodes was simply too slow and exceeded that 5-minute window.

BTW, Thank you for opening the GitHub issue (#1339) to increase the timeout limit!:sweat_smile:

I also have a question regarding our network infrastructure and whether it might be exacerbating this delay. We are using 2 NICs configured in an active-backup bond for cluster network. However, we left the frame MTU at the default 1500 because we currently don’t have administrative access to the physical switches to safely enable Jumbo Frames (MTU 9000).

Could this default 1500 MTU also be a contributing factor to the overall slowness during the Ceph OSD setup and the initial cluster synchronization, pushing it past the 5-minute mark?

But we had a very strict deadline to go live by a specific date, this cluster is already running in production using the manual workaround. Unfortunately, I’m limited in providing further access or extracting deeper debug logs now, and I completely forgot to take screenshots at the time as we were rushing to meet the deadline.

I hope the details provided so far are still helpful for the team to reproduce the issue.

1 Like

The fix was merged earlier today in Service: Create join tokens with a lifetime of 1 hour by roosterfish · Pull Request #1341 · canonical/microcloud · GitHub and will be included in the upcoming 2.1.3 release of MicroCloud 2 LTS.

The fix is included in the latest 2.1.3 LTS release https://documentation.ubuntu.com/microcloud/default/reference/release-notes/release-notes-2.1.3/#join-token-expiry-extension.

1 Like