Recovering from an outage with multi-site replication

pmatulis · May 29, 2023, 2:53pm

In the event that the primary site of the RADOS Gateway multi-site replication (in this example us-east) has an outage, the secondary endpoint can be used to ensure continuous access to the object store. However, the primary metadata zone must be failed over to the secondary site first; this operation is performed using the promote action:

juju run -m smodel --wait secondary-ceph-radosgw/0 promote

Once this action has completed, the secondary site will become the new primary for metadata updates and the deployment will accept new uploads of data.

Once the failed site has been recovered it will resync and resume as a secondary to the promoted primary site (secondary-ceph-radosgw in this example).

The primary metadata zone can be failed back to its original location once resync has completed using the promote action:

juju run -m pmodel --wait primary-ceph-radosgw/0 promote