Removing OSDs

pmatulis · May 10, 2022, 10:35pm

This guide describes the procedure of removing an OSD from a Ceph cluster.

Note:

This method makes use of the ceph-osd charm’s remove-disk action, which appeared in the charm’s quincy/stable channel. There is a pre-Quincy version of this page available.

Before removing an OSD unit, we first need to ensure that the cluster is healthy:
```
juju ssh ceph-mon/leader sudo ceph status
```

Identify the target OSD

Check OSD tree to map OSDs to their host machines:

juju ssh ceph-mon/leader sudo ceph osd tree

Sample output:

ID  CLASS  WEIGHT   TYPE NAME             STATUS  REWEIGHT  PRI-AFF
-1         0.09357  root default                                   
-5         0.03119      host finer-shrew                           
2    hdd  0.03119          osd.2             up   1.00000  1.00000
...

Assuming that we want to remove osd.2. As shown in the output, it is hosted on the machine finer-shrew.

Check which unit is deployed on this machine:

juju status

Sample output:

...
Unit         Workload  Agent  Machine  Public address  Ports  Message
...
ceph-osd/1*  blocked   idle   1        192.168.122.48         No block devices detected using current configuration
...

Machine  State    DNS             Inst id              Series  AZ       Message
...
1        started  192.168.122.48  finer-shrew          jammy   default  Deployed
...

In this case, ceph-osd/1 is the unit we want to remove.

Therefore, the target OSD can be identified by the following properties:

OSD_UNIT=ceph-osd/1
OSD=osd.2
OSD_ID=2

Remove the OSD disk

Remove the OSD disk using the remove-disk action:
```
juju run $OSD_UNIT remove-disk osd-ids=$OSD purge=true --wait=5m
```
Note:
For Juju versions < 3.0, use the juju run-action command.

Note:
The remove-disk action attempts to safely remove the target OSD from the cluster. This action will fail with a timeout error if the OSD cannot be safely removed within the timeout period (default is 5 minutes, but you can configure it to 10s, 2m, 10m, etc. ) or if there are not enough OSDs remaining in the cluster after the removal to meet the various pool level replication requirements. If you insist on removing the disk even if it is considered unsafe, you can add force=true to the command when running the action.
(Optional) If the unit hosting the target OSD does not have other active OSDs attached and you would like to delete it, you can do so by running:
```
juju remove-unit $OSD_UNIT
```
Caution:
If there are active OSDs on the unit, removing it will produce unexpected errors.
Ensure the cluster is in healthy state after being scaled down:
```
juju ssh ceph-mon/leader sudo ceph status
```

bran-castillo · July 29, 2024, 4:57pm

In this documentation, for step 3 it was noticed that --wait does not have a value in this example.