Add support for trusted non-cluster members

masnax · March 9, 2024, 3:25am

Project	microcluster
Status	In Review
Author	@masnax
Internal ID	LX071

Abstract

This adds the concept of cluster roles to a microcluster. A role can be one of cluster or non-cluster, and will determine whether a microcluster daemon will join the dqlite cluster, or simply serve the microcluster public REST API (/cluster/1.0 and /1.0).

Rationale

Both the database and API of microcluster are highly extensible, but both are necessarily required to be set up on each node of the cluster. To facilitate large-scale deployments of up to 500+ systems without the added overhead on dqlite, microcluster should be able to serve only the REST API on a node, without necessarily including it in the dqlite cluster. The dqlite cluster can continue to keep a record of these nodes for establishing and revoking trust.

The dqlite database allows for direct database actions by any node, and keeps the nodes’ internal states that are dependent on the database in-sync. In many cases database actions are not necessary on every node after it joins the cluster, as it may simply serve another service with a custom REST API on top. On these nodes, we can just serve the REST API, with the option to later promote the node to join dqlite.

Specification

Design

When a node joins an existing microcluster, it serves the REST API with the given API extensions, and joins the dqlite cluster. We can introduce the concept of role to determine whether it should be included in the dqlite cluster. This can be included in the local truststore yamls, as one of:

cluster for a node that should join dqlite, and be considered a dqlite cluster member
non-cluster for a node does not set up dqlite and is not considered a dqlite cluster member, but is trusted to communicate with the cluster over the REST API.

A microcluster establishes trust with cluster members by recording information about them locally in the state directory under /state/truststore/node01.yaml. We can add a role like so:

name: node01
address: 127.0.0.1:9000
role: cluster     # or non-cluster
certificate: |
  ...

Interface

state.Remotes() --> state.Remotes(trust.Role):
- Updates this function to take a trust.Role as an argument for returning a set of cluster members with the particular role of trust.Cluster or trust.NonCluster.
microcluster.IssueToken(string) --> microcluster.IssueToken(string, bool)
- Updates this function to take a boolean value for whether the node should be considered a non-cluster member, in addition to the node name to issue the token for. When the token is used by the node to join the cluster, the cluster will detect that the token is authorized for a particular role, and set the node up accordingly.
client.UpgradeClusterMember(ctx, types.ClusterMemberUpgrade)
- Upgrades the non-cluster member with the given args to cluster member. Any node, whether cluster or non-cluster can request a non-cluster node to upgrade, including itself.

For the CLI, the recommended implementation will be

# Adds a `--non-cluster` flag. This will also be reported in `microctl token list`
token=$(microctl token add node02 --non-cluster --state-dir ${node01_dir})   

# The join process is unchanged, as the role information is stored in the token.
microctl init --join ${token} --state-dir ${node02_dir}

# Upgrading the node to `cluster` role will add it to the dqlite cluster and fire the appropriate hooks.
microctl cluster upgrade node02 --state-dir ${node02_dir} # or ${node01_dir}

API Upgrades

The default role will be cluster, and existing systems upgrading from prior to this change will have their non-existent roles considered as cluster, as these nodes will all have dqlite already set up. After receiving the corresponding schema update, they will update the truststore schema to include the cluster role.

When a non-cluster member that has already been initialized is restarted, this will be one of the few times it reaches out to a cluster member automatically in order to determine whether it needs to wait for all other cluster and non-cluster members to upgrade their API versions, or whether the node itself still needs upgrading.

REST API

PUT /cluster/1.0/cluster/{name}/upgrade will be added. This endpoint will elevate the non-cluster member with the specified name to cluster status, and add it to dqlite. This can be run on any node, whether cluster or non-cluster member, as long as it targets a non-cluster member. The request will then be forwarded to the node-to-be-upgraded, where we will run the PreUpgrade hook, before forwarding the request to a random cluster member, who will then forward the request to the dqlite leader. The leader will prepare the existing cluster for the incoming node, who will then attempt to join dqlite. At this point, all cluster members will run their OnUpgradedMember hooks, and the newly upgraded node will run its PostUpgrade hook.
GET /cluster/1.0/cluster will have a role={roleName} query parameter indicating what role of cluster members we should fetch. The default will be cluster. If the role is non-cluster, their connection status will not be updated.
PUT /cluster/1.0/cluster will have role={roleName} and upgrade={boolean} query parameters to indicate what role the registered cluster member should have, and whether it should be re-registered as a cluster role (if upgrade=true).
POST /cluster/1.0/cluster will have a upgrade={boolean} query parameter to indicate whether the node should be added or upgraded.
POST /cluster/1.0/tokens will now accept payloads with the Role field set, to determine token role to generate.

non-cluster members will be trusted over the /cluster/1.0 (internal) and /1.0 (external) public APIs. If an endpoint in this set requires database access, it can choose one of two methods to handle this:

Manually handle the endpoint, checking state.Role() to change the behaviour based on the cluster role of the node.
Use access.AllowClusterMembers or access.AllowNonClusterMembers as the rest.EndpointAction.AccessHandler to restrict access to the endpoint method to a particular role.

By default, only cluster members will be able to use DELETE /cluster/1.0/cluster/{name}. All other public endpoints will be accessible by non-cluster members. POST and PUT /cluster/1.0/cluster will be re-used for upgrade purposes by non-cluster member nodes.

Modified and new API types:

// ClusterMemberUpgrade represents information about upgrading a non-cluster member to a dqlite member.
type ClusterMemberUpgrade struct {
  Name string
  SchemaVersion int
  InitConfig map[string]string
}

// Role is a string representing the cluster role for the token.
types.TokenRecord.Role
types.Token.Role

// NonClusterMembers is a []ClusterMemberLocal representing the list of non-cluster members that the joining node should record locally in its truststore.
types.TokenResponse.NonClusterMembers

Database

In addition to the truststore update above, this adds an internal_non_cluster_members table, similar to internal_cluster_members to keep track of API versions of non-cluster members, and facilitate joining. The internal_token_records table will have a new column for role.

ALTER TABLE internal_token_records ADD COLUMN role TEXT NOT NULL; 

CREATE TABLE internal_non_cluster_members (
  id                       INTEGER   PRIMARY  KEY    AUTOINCREMENT  NOT  NULL,
  name                     TEXT      NOT      NULL,
  address                  TEXT      NOT      NULL,
  certificate              TEXT      NOT      NULL,
  internal_api_extensions  TEXT,
  external_api_extensions  TEXT,
  UNIQUE(name),
  UNIQUE(certificate)
);

Action Hooks

Microcluster supports various action hooks that are specified by the project using microcluster, and are run at specific times during cluster creation.

cluster member only hooks:
- PreBootstrap, PostBootstrap are only run on bootstrapping nodes which must necessarily be cluster.
- OnHeartbeat requires dqlite and will only run on cluster members.
- OnNewMember will run only on pre-existing cluster members, even when non-cluster members join the cluster.
- PostRemove will run only on pre-existing cluster members, even when a non-cluster member is removed from the cluster.
- OnUpgradedMember is a new hook that will run only on pre-existing cluster members when a non-cluster member is upgraded to cluster.
- PostUpgrade is a new hook that runs on a newly upgradec cluster member, which was previously a non-cluster member.
non-cluster member only hooks:
- PreUpgrade is a new hook that runs only on a non-cluster member before it is upgraded to cluster member.
role-agnostic hooks:
- OnStart runs on a node when the daemon starts, regardless of cluster role.
- PreRemove runs a node before removing it from the cluster, regardless of its cluster role.
- PreJoin runs on a node before it joins the cluster, regardless of its cluster role.
- PostJoin runs on a node after it first joins the cluster, regardless of its cluster role.

masnax · March 9, 2024, 3:31am

The pull request implementing this feature can be found here:
https://github.com/canonical/microcluster/pull/77

chrome0 · March 11, 2024, 3:44pm

Interesting, thanks for this spec @masnax

I was wondering about the scalability of dqlite – you mention 500+ nodes in the “Rationale” section, would you say that dqlite scales up to this number of nodes?

And a second question, aiui from the spec non-cluster nodes would not be able to perform database queries, is that right? Would a non-cluster node then access DB data only via an API?

cheers,
peter.

masnax · March 11, 2024, 4:47pm

I haven’t stress-tested dqlite myself (perhaps it might be useful to spin up something large on PS6 or testflinger), but I recall that from some large scale tests of LXD during NorthSec that dqlite began to run into issues at around 100 cluster members. @colemiller do you have any input here? (question is how dqlite handles large-scale (500+) deployments)

Regarding your second question about db access by non-cluster nodes, they would indeed have no direct database access. Making a request that requires database access would have to either:

Detect the cluster role and forward the request to a full cluster member with dqlite enabled.
Detect the cluster role and reject the request outright, based on a flag set on the endpoint handler for that endpoint or REST method.
Not do either of the above, in which case the request would return a descriptive error at the point it tries to access the database.

I would prefer to avoid doing anything too obscure and/or complicated like having a special mechanism that automatically attempts fo forward any attempt to access the database to a dqlite-enabled node. At that point from an outside view the non-cluster member would have the same functionality as a full cluster member, but would reintroduce overhead in the form of potentially high dqlite traffic on a smaller subset of nodes, as well as the added delay of having to forward the request.

The non-cluster nodes will have the option to upgrade to join dqlite if they need to manage the database, but should otherwise be discouraged from reaching out to a cluster member for database access frequently.

colemiller · March 11, 2024, 5:08pm

I have not seen the results of the NorthSec testing but would be very interested in seeing what goes wrong at that scale! (i.e. +1 to spinning up a big LXD cluster at some point to see what happens.) The more nodes you have the greater the chance that some of them are misbehaving at any given time (slow connection to the leader, etc.) Of course dqlite is designed to gracefully handle this kind of misbehavior from up to 50% of voting members but I could certainly believe that testing with hundreds of nodes exposed a pathological case that we aren’t aware of yet. Maybe the leader is sending a lot of snapshots (a slow and resource-intensive operation)? It would be useful to know how many voters/standbys/spares there were in this huge cluster.

Regarding the actual proposal here, I have not yet read it in detail, but I wonder if we could make some changes to go-dqlite and/or dqlite that would solve the problem you’re tackling without requiring an extra “role” concept that’s layered atop dqlite’s voter/standby/spare distinction? Hopefully this would be a win for simplicity. For example, if we could add a knob to go-dqlite that would prevent a given spare node from forwarding requests to the cluster leader, would that be adequate? (Sorry if I’m oversimplifying.)

tomp · March 12, 2024, 7:58am

@masnax @chrome0 do these non-cluster members require DB access?
Has this been factored into the design (& associated implementation)?

markylaing · March 12, 2024, 11:58am

@masnax I’m aware this was requested for microceph, could you please give a concrete example in the rationale section?

My understanding is as follows:

Microcluster allows easy bootstrapping of microceph.
Ceph itself can grow to hundreds of nodes but dqlite can’t.
Some API endpoints being served by microceph (via the microcluster package) do not require access to the database (presumably they interact with ceph directly).
By adding non-cluster “members”, we can serve non-DB related APIs on those “members”. This would allow ceph to scale.
The dqlite members then essentially become a HA management layer for the full ceph deployment.

If that’s correct. What happens if a non-cluster “member” goes down, do we have health checks?

masnax · March 12, 2024, 3:03pm

About health checks, I didn’t include it in the spec, but it would be trivial to include a scan of non-cluster nodes as part of the heartbeat mechanism of microcluster. Any sort of health check would have to be managed by dqlite members.

Regarding your request for a concrete example, I’m not familiar enough with ceph to present an example that would fully capture what @chrome0 and @utkarshbhatthere have in mind.

@chrome0 @utkarshbhatthere could we please have a rough outline of how you expect to use these non-dqlite cluster members?

masnax · March 12, 2024, 3:08pm

Adding a sort of low-overhead option to spare dqlite nodes does sound like a promising alternative at a glance, I’ll get back to you with more details once I get a large-scale test up and running on testflinger.

utkarshbhatthere · March 12, 2024, 3:56pm

The idea of a non-cluster member is more like a worker node with the cluster members forming the control plane. As an initial design this could be used to spawn OSD services (the backbone of storage in Ceph) at scale with the remaining services (Mon, Mgr, Mds, RGW) located on the cluster member nodes (+1 to @markylaing’s observations).

For health-checks, ceph has ways to discover unhealthy services too but it would be great to hook some node local information and send periodically to cluster members for action or alerting the user.

The discussion around spare dqlite member has also sparked my interest. Looking forward to read your observations @masnax

masnax · March 12, 2024, 11:12pm

So I’ve managed to play with this a bit on a beefy testflinger machine with 200 containers. I installed a super minimal build of microcloud that doesn’t set anything up except itself, so basically just a REST API and a dqlite database with ~30 second interval heartbeats.

To be honest, cluster performance isn’t too bad at all. Everything does work well. Fetching and writing to the database is always <1s.

But even while the cluster is idle, CPU usage is pretty high. On voters it’s pretty much always above 10%, but it jumps around a lot from 10 to 50% to 100% to 200%. On spare nodes it’s lighter, usually at around 2% but does occasionally spike to 25-50%.

I thought it was heartbeats at first but the heartbeat interval is much larger than the interval of CPU spikes, so perhaps this is some internal dqlite communication?

colemiller · March 12, 2024, 11:31pm

This is definitely not expected! I wonder if it’s the same problem that has been plaguing MicroK8s (on much much smaller clusters). You’ve never seen this happen on normal-sized LXD clusters, right?

If you can get me SSH access to a testflinger machine in the cluster I can try to at least capture a CPU flamegraph and see if anything sticks out. Anatoliy (@just-now on MM and GitHub) will definitely also be interested and will probably have more ideas for data we can capture.