Replacing OSD disks

pmatulis · October 14, 2020, 7:25pm

The procedural steps given in this guide will show how to recreate a Ceph OSD disk within a Charmed Ceph deployment. It does so via a combination of the remove-disk and add-disk actions, while preserving the OSD Id. This is typically done because operators become accustomed to certain OSD’s having specific roles.

Note:

This method makes use of the ceph-osd charm’s remove-disk action, which appeared in the charm’s quincy/stable channel. There is a pre-Quincy version of this page available.

Identifying the target OSD

We’ll check the OSD tree to map OSDs to their host machines:

juju ssh ceph-mon/leader sudo ceph osd tree

Sample output:

ID   CLASS  WEIGHT   TYPE NAME               STATUS     REWEIGHT  PRI-AFF
 -1         0.11198  root default                                        
 -7         0.00980      host direct-ghost                               
  4    hdd  0.00980          osd.4                  up   1.00000  1.00000
 -9         0.00980      host famous-cattle                              
  3    hdd  0.00980          osd.3                  up   1.00000  1.00000
  5    ssd  0.00980          osd.5                  up   1.00000  1.00000
 -5         0.07280      host osd-01                                     
  0    hdd  0.07280          osd.0                  up   1.00000  1.00000
 -3         0.00980      host sure-tarpon                                
  1    hdd  0.00980          osd.1                  up   1.00000  1.00000
-11         0.00980      host valued-fly                                 
  2    hdd  0.00980          osd.2                  up   1.00000  1.00000

Thus, let’s assume that we want to replace osd.5. As shown in the output, it’s hosted on the machine famous-cattle.

So now, we check which unit is deployed on that machine:

juju status

Sample output:

Unit         Workload  Agent  Machine  Public address  Ports  Message
...
ceph-osd/1   active    idle   4        192.168.122.8          Unit is ready (2 OSD)
...

Machine  State    Address         Inst id        Series  AZ       Message
...
4        started  192.168.122.8   famous-cattle  focal   default  Deployed
...

In this case, ceph-osd/1 is the target unit.

Therefore, the target OSD can be identified by the following properties:

OSD_UNIT=ceph-osd/1
OSD=osd.5
OSD_ID=5

Replacing the disk

We’ll start by removing the disk. The command to run is the following:

juju run-action $OSD_UNIT --wait remove-disk osd-ids=$OSD

If successful, the output should contain the following:

1 disk(s) was removed
To replace them, run:
juju run-action ceph-osd/1 add-disk osd-devices=/dev/vdb osd-ids=osd.5

This includes the instructions on how to replace the disk. The important bit is that the OSD Id can be recycled since we didn’t use the purge flag during removal.

Now, let’s assume that the reason we want to replace the disk was to include a bcache device to make things faster. We can do this easily with the add-disk action.

For example, if the caching device is /dev/pmem0, and the backing device is the kept (i.e: /dev/vdb), we can identify the following properties:

OSD_CACHE_DEVICE=/dev/pmem0
OSD_BACKING_DEVICE=/dev/vdb

Thus, we can run the following to finally replace the disk:

juju run-action --wait $OSD_UNIT add-disk osd-devices=$OSD_BACKING_DEVICE cache-devices=$OSD_CACHE_DEVICE osd-ids=$OSD

This will create a new disk with a bcache device using a caching and backing device and will reuse the OSD Id of the original disk.

We can check that this is the case by running the following:

juju ssh $OSD_UNIT -- sudo ceph-volume lvm list

And checking that the output contains:

====== osd.5 =======

  [block]       /dev/ceph-55245f64-30ab-4544-88d7-dbdfb88521c3/osd-block-55245f64-30ab-4544-88d7-dbdfb88521c3

      block device              /dev/ceph-55245f64-30ab-4544-88d7-dbdfb88521c3/osd-block-55245f64-30ab-4544-88d7-dbdfb88521c3
      block uuid                i4vyZI-7m5D-36kp-Duy9-kx52-0UvG-Hz16SG
      cephx lockbox secret      
      cluster fsid              35a48462-7125-11ed-8c2f-e78ff76f8d8b
      cluster name              ceph
      crush device class        
      encrypted                 0
      osd fsid                  55245f64-30ab-4544-88d7-dbdfb88521c3
      osd id                    3
      osdspec affinity          
      type                      block
      vdo                       0
      devices                   /dev/bcache0

In other words, the OSD Id is the same (osd.5) and the device is using bcache.

We can further check that the cluster shows the right number of OSD’s by running:

juju status

And checking that the unit ceph-osd/1 shows 2 OSD’s.

guoqiao · March 10, 2021, 10:19pm

This doc is awesome, but I guess above is a typo ? Should it be something like sda (root) and sd{b,c,d} for backing devices ?

pmatulis · March 10, 2021, 11:34pm

Thanks. It looks like sdd is the root device. I’ll change it now.

thogarre · February 14, 2022, 5:42pm

This for this document! It’s pretty great. one thing that might be worth noting here is LP#1955110, after the purging action, nagios will alert for the systemd service that stopped running. Until the bug is resolved, manual cleanup of the relevant ceph directory is required:

sudo umount /var/lib/ceph/osd/ceph-$OSD_ID/
sudo rm -rf /var/lib/ceph/osd/ceph-$OSD_ID

thogarre · February 14, 2022, 5:45pm

the osd-devices config option for ceph-osd may have these directories changed from bcache to something else (like osddata), so it might be good to reference this a place to look.

pmatulis · February 17, 2022, 4:54pm

Thanks. I’ve included a bug citation at the purge-osd action step.

pmatulis · February 17, 2022, 5:04pm

Well the preceding line does say:

…the actual device is bcache1

So I think we’re good. Unless I misunderstand you? There are a lot of variables in the instructions so we do need to assume that the reader is following along. Otherwise it get’s messy. Let’s discuss.

thogarre · February 17, 2022, 5:39pm

Thanks for the quick response!

The actual device may be bcache1, but if it’s symlinked, the command as shown won’t work. It will work only if the symlink from osd-devices is taken into account, which might stump readers if they’re not aware that it needs to be checked.

juju config ceph-osd osd-devices
/dev/disk/by-dname/osddata0 /dev/disk/by-dname/osddata1 .....

# on the host in question
$ ls -la /dev/disk/by-dname/bcache* | egrep "bcache1$"
ls: cannot access '/dev/disk/by-dname/bcache*': No such file or directory
$ ls -la /dev/disk/by-dname/osddata* | egrep 'bcache1$'
lrwxrwxrwx 1 root root 13 Feb 14 17:18 /dev/disk/by-dname/osddata1 -> ../../bcache1

pmatulis · March 4, 2022, 11:04pm

Thanks for your interest in improving the documentation. If you could propose the exact wording you would like to see then I will add it.

thogarre · March 14, 2022, 2:20pm

Thanks! How about this:

We can use that in the below command to find the by-dname entry.

First, look for the ceph-osd osd-devices configuration option, and to determine which convention is used for the by-dname symlinks. Typically this is bcache, but it may have been changed during the cloud’s deployment.

$ juju config ceph-osd osd-devices
/dev/disk/by-dname/bcache0 ....

Then, on a on a ceph-osd unit, look for that by-dname symlink:

ls -la /dev/disk/by-dname/bcache* | egrep "bcache1$"

Sample output:

lrwxrwxrwx 1 root root 13 Oct  1 04:28 /dev/disk/by-dname/bcache3 -> ../../bcache1

pmatulis · March 14, 2022, 6:52pm

I added an admonishment.

nkoltsov · February 13, 2023, 1:20pm

HI, old version of this manual is still valid and pretty useful for old charm versions. So could you put it somewhere and add a link to it(on the note block above)

pmatulis · February 13, 2023, 3:00pm

Sure. I’ll do that soon.

This was done today.