Scaling down the cluster

gnuoy · September 27, 2023, 1:20pm

Scaling down the cluster refers to the removal of cluster members.

These instructions show how to remove a node with the exception of the primary (bootstrap) node. For instructions on the latter, see page Removing the primary node. Note that, in terms of removing nodes, the primary node is always the last to be removed.

Remove the node

On the primary node:

sunbeam cluster remove --name <node FQDN>

Note: A current software issue (fixed in the edge risk level) causes the cluster remove command to fail with a message like:

ERROR removing machine failed: machine 1 has unit "sunbeam-machine/1" assigned.

To work around this manually remove the unit in the error message by issuing:

juju remove-unit sunbeam-machine/1

The cluster remove command can then be reissued.

Repeat the workaround as needed.

Remove components from the node

Software components now need to removed from the target node. Perform all the below steps on the target node.

Remove the Juju agent:

sudo /sbin/remove-juju-services

Remove the juju snap:

sudo snap remove --purge juju

Remove Juju configuration:

rm -rf ~/.local/share/juju

Remove the openstack-hypervisor and openstack snaps:

sudo snap remove --purge openstack-hypervisor
sudo snap remove --purge openstack

Remove openstack snap configuration:

rm -rf ~/.local/share/openstack

Remove the k8s snap:

sudo k8s remove-node
sudo snap remove --purge k8s

The above steps can take a few minutes to complete.

Remove the disk(s) used by microceph on this node:

sudo microceph disk list
sudo microceph disk remove <OSD on this node>

Remove the microceph snap:

sudo snap remove --purge microceph

If required clean the disk(s) identified in the earlier command:

sudo dd if=/dev/zero of=<DISK PATH> bs=4M count=10

Caution: The dd command will result in the permanent erasure of data. It is vital that you have specified the correct disk path to avoid unintended data loss.

Clear the remaining network configuration with a reboot:

sudo reboot

hemanth-n · September 29, 2023, 10:04am

Can we add steps to clean disks used in microceph?
Is it required to remove $HOME/.local/share/juju?

gnuoy · October 11, 2023, 8:32am

Thanks @hemanth-n, I’ve changed the guide and added those steps

hemanth-n · October 12, 2023, 1:21am

Can we use the below commands before removing microceph snap and cleaning up disk

microceph disk remove <>
microceph cluster remove <>

EDIT: Thinking aloud, may be both these commands should be part of sunbeam cluster remove <> command. Same goes with microk8s leave.

carbonwolf · September 28, 2024, 5:53pm

This clean up scripts helps me to clean up my lab nodes ( Use at your on risk )

#!/bin/bash

# Function to clean up each disk listed by microceph
clean_microceph_disks() {
    echo "Listing microceph disks..."
    disks=$(sudo microceph disk list | grep "/dev/disk/by-id/" | awk '{print $1}')
    
    if [ -z "$disks" ]; then
        echo "No disks found by microceph."
        return
    fi

    echo "Cleaning microceph disks..."
    for disk in $disks; do
        echo "Cleaning disk: $disk"
        sudo dd if=/dev/zero of="$disk" bs=4M count=10 status=progress
    done
    echo "Disk clean-up completed."
}

# Step 1: Destroy the Juju model with storage
echo "Destroying Juju model: openstack"
juju destroy-model --destroy-storage --no-prompt --force --no-wait openstack
wait

# Step 2: Destroy the Juju controller
echo "Destroying Juju controller: sunbeam-controller"
juju destroy-controller --no-prompt --destroy-storage --force --no-wait sunbeam-controller
wait

# Step 3: Remove the Juju agent services
echo "Removing Juju agent services"
sudo /sbin/remove-juju-services
wait

# Step 4: Remove the Juju snap
echo "Removing Juju snap"
sudo snap remove --purge juju
wait

# Step 5: Remove Juju configuration
echo "Removing Juju configuration"
rm -rf ~/.local/share/juju
sudo rm -rf /var/lib/juju/dqlite
sudo rm -rf /var/lib/juju/system-identity
sudo rm -rf /var/lib/juju/bootstrap-params
wait

# Step 6: Remove the OpenStack hypervisor and OpenStack snaps
echo "Removing OpenStack hypervisor and OpenStack snaps"
sudo snap remove --purge openstack-hypervisor
sudo snap remove --purge openstack
wait

# Step 7: Remove OpenStack snap configuration
echo "Removing OpenStack snap configuration"
rm -rf ~/.local/share/openstack
wait

# Step 8: Leave and remove MicroK8s
echo "Leaving and removing MicroK8s snap"
sudo microk8s leave
sudo snap remove --purge microk8s
wait

# Step 9: Clean up MicroCeph disks
clean_microceph_disks
wait

# Step 10: Remove the MicroCeph snap
echo "Removing MicroCeph snap"
sudo snap remove --purge microceph
wait

echo "Cleanup completed!"

gamebeast321 · November 14, 2024, 7:19pm

When attempting to remove an unreachable, errored out node:

⠧ Remove openstack-hypervisor unit from machine ...            DEBUG    Connector: closing controller connection                                                                                                                    connector.py:124
           DEBUG    Application 'openstack-hypervisor' is in status: 'error'                                                                                                         juju.py:770
           DEBUG    Waiting for app status to be: error ['active', 'unknown']

sunbeam cluster remove hangs.

I needed to locate the problematic unit using

juju status -m admin/controller

and these flags:

juju remove-unit openstack-hypervisor/0 --force --destroy-storage --no-wait

gamebeast321 · November 14, 2024, 7:28pm

Very useful, considering how often joining nodes/configuring nodes seems to fail. Its often easier to start all over than to troubleshoot.

carbonwolf · November 19, 2024, 8:09am

Really helped me to re create the lab environments and focus on actual thing I was testing.