Scaling up the cluster

pmatulis · May 15, 2023, 10:02pm

Scaling up the cluster refers to the addition of cluster members.

Create a registration token

A registration token is needed before adding a new member. Do this by accessing an existing cluster member and running the cluster add command against the FQDN of the new node:

sunbeam cluster add --name <new_machine_fqdn>

Clustering does not support base hostnames. A node is only known by their FQDN.

Keep the token in a safe place. It will be used in a future step.

Provision the new machine

Several steps are needed to provision the new machine.

Install the openstack snap

You will need to install the openstack snap on the new machine in order to perform various commands:

sudo snap install openstack --channel <channel>

The snap channel must be common across all cluster members.

Install cloud software

Install cloud software on the new machine with the following command:

sunbeam prepare-node-script | bash -x

Join the machine to the cluster

On the new machine, join it to the cluster by using the cluster join command. Refer to the registration token obtained earlier:

sunbeam cluster join [--role <role> [--role <role>...]] --token <registration_token>

Multiple roles can be selected per node. Available role values are ‘control’, ‘compute’, and ‘storage’. If unstated, the role defaults to a combination of ‘control’ and ‘compute’.

Resize the cluster

After having joined the node, the cluster needs to be resized with the cluster resize command. This command can be invoked on an existing node or the new
node:

sunbeam cluster resize

If multiple nodes are to be added, you can join all nodes first and resize the cluster once. This can save a significant amount of processing time.

inertz · October 2, 2024, 5:51pm

After running ‘sunbeam cluster resize’ the deployment stuck below and timeout. Already set ‘MAX_FRAME_SIZE = 2**25’ but still stuck.

⠦ Deploying OpenStack Control Plane to Kubernetes (this may take a while) ... waiting for services to come online (31/32)[05:22:53] DEBUG    Waiting for 'cinder-ceph' cancelled                                                                                                                                                                          juju.py:1024
           WARNING  Timed out while waiting for model 'openstack' to be ready                                                                                                                                                openstack.py:265
           DEBUG    Finished running step 'Deploying OpenStack Control Plane'. Result: ResultType.FAILED                                                                                                                        common.py:261
Error: Timed out while waiting for model 'openstack' to be ready
node1@node1:~$

cinder-ceph status is waiting.

node1@node1:~$ juju status | grep cinder-ceph
cinder-ceph            waiting  local  node1.ip.com/openstack.cinder-ceph

Not sure where to troubleshoot first.

inertz · October 4, 2024, 8:44am

I manage to fix the problem. It related with osd pool replication. You need to know the PGs per OSD must not exceed 250. Have to tune the pg_num and pgp_num so that microceph.ceph status change from HEALTH_WARN to HEALTH_OK.