Charmed Ceph Alerts Guide

chrome0 · May 31, 2024, 3:27pm

This guide aims to provide guidance on responding to and troubleshooting alerts generated by a Charmed Ceph cluster. The Charmed Ceph ceph-mon application comes with a default set of Metrics and alerts. This document expands on the above document to provide approaches to investigate, analyze and mitigate issues in relation to those alerts.

Alert: CephDaemonCrash (generic)

Overview

The CephDaemonCrash alert indicates that one or more Ceph daemons have experienced a crash recently. The purpose of this alert is to ensure that crashes are promptly noticed and addressed to maintain the stability and reliability of the Ceph storage cluster. Firstly, the cause of the crash needs to be diagnosed and the crashed service recovered. Then, these crashes need to be acknowledged by an administrator to be archived and to prevent them from triggering repeated alerts. Acknowledging a crash typically involves reviewing the crash details to determine the cause and taking appropriate action to prevent future occurrences. The recommended command for acknowledging a crash is ceph crash archive <id>, where <id> is the unique identifier of the crash report.

To tune what Ceph thinks of as recently update the mgr/crash/warn_recent_interval option. Set it to zero to disable this warning.

Troubleshooting

Before proceeding with troubleshooting, ensure you have access to the Ceph cluster via a user with appropriate permissions and access to Juju commands for interacting with the Ceph Charms.

Step 1: Identify Crashed Daemons and Gather Crash IDs

Use the Juju command to get a list of crashed daemons and their crash IDs:
```
juju exec --unit ceph-mon/leader -- ceph crash ls
```
Replace ceph-mon/leader with the appropriate application if your monitor daemons are named differently.

Step 2: Review Crash Reports

For each crash ID obtained in the previous step, review the crash report to understand the context and potential cause of the crash:
```
juju exec --unit ceph-mon/leader -- ceph crash info <id>
```
Make sure to replace <id> with the actual crash ID.

Step 3: Check Logs

Examine the logs of the crashed daemons for errors or warnings that could indicate the cause of the crash. Logs can be found in /var/log/ceph on the Ceph nodes. Use SSH to access the nodes if necessary:
```
grep -i error /var/log/ceph/<daemon-log-file>
```
Replace <daemon-log-file> with the log file name of the crashed daemon, for example, ceph-mon.log.

Step 4: System Health and Resource Utilization

Check the overall health of the Ceph cluster to see if there are any other issues that might have contributed to the daemon crashes:
```
juju exec --unit ceph-mon/leader -- ceph health detail
```
Check the system resource utilization (CPU, memory, disk I/O) on the nodes where the daemons crashed to identify any resource constraints.

Step 5: Acknowledge and Archive the Crash

Once you have reviewed the crash report(s) and taken any necessary corrective action, acknowledge and archive the crash using the following Juju command:
```
juju exec --unit ceph-mon/leader -- ceph crash archive <id>
```
Replace <id> with the crash ID. Repeat this for each crash that needs to be acknowledged.

Step 6: Monitor and Follow-up

Continue to monitor the Ceph cluster’s health and logs to ensure that the issue has been resolved and does not recur.
If crashes continue to occur, consider further investigation into system configuration, hardware issues, or consult Ceph documentation and community forums for similar issues.

Alert: CephDeviceFailurePredicted (osd)

Overview

The alert CephDeviceFailurePredicted (osd) indicates that the Ceph cluster’s device health monitoring system has predicted an imminent failure of one or more storage devices (OSDs). This prediction is based on various health and performance metrics collected by the cluster. The system triggers this alert to prompt preemptive action, typically taking out and replacing the offending OSD. By default, if the mgr/devicehealth/self_heal option is enabled, the cluster will automatically take out the corresponding OSD.

Troubleshooting

1. Identify Failing Device(s)

First, identify the predicted failing devices using the Ceph CLI.

juju exec --unit ceph-mon/leader -- ceph device ls

This command lists all devices in the cluster, including their health status.

2. Review Specific Device Information

For detailed information about a specific device, including its health metrics and failure predictions, use:

juju exec --unit ceph-mon/leader -- ceph device info <dev_id>

Replace <dev_id> with the ID of the device of interest as obtained from the previous step.

3. Remove OSD

If applicable, remove failing OSDs from the cluster. Be sure to replace <unit> and <osd-id> below with the correct values.

juju run ceph-osd/<unit> remove-disk osd-ids="<osd-id>"

4. Monitor Data Migration

Monitor the data migration progress. Ensure that the cluster reaches a healthy state before proceeding with hardware replacement.

juju exec --unit ceph-mon/leader -- ceph -s

5. Replace Hardware

Physically replace the failing device. Ensure the new device is correctly installed and recognized by the system.

6. Redeploy OSD

Add replaced disks back into the cluster. Be sure to replace <unit> and <osd-devices> with correct values:

juju run ceph-osd/<unit> add-disk osd-ids="<osd-devices.>"

8. Verify Cluster Health

Finally, verify the health of the Ceph cluster to ensure it has returned to a healthy state with the new device integrated.

juju exec --unit ceph-mon/leader -- ceph health detail

Alert: CephDeviceFailurePredictionTooHigh (osd)

Overview

The alert CephDeviceFailurePredictionTooHigh (osd) indicates a serious issue within a Ceph cluster, where the device health monitoring module has predicted that a number of devices (OSDs) are at risk of failing soon. The specific concern here is that the number of devices predicted to fail is so high that automatic remediation efforts, typically involving removing these devices from the cluster, cannot be performed without risking the cluster’s performance and availability. This situation demands immediate attention to prevent data integrity issues. The recommended action is to add new OSDs to the cluster so that data can be safely relocated from the failing devices.

Troubleshooting

Step 1: Verify Cluster Health

Begin by checking the overall health of the Ceph cluster using the Juju command: juju exec --unit ceph-mon/0 -- ceph status. This will give you an overview of the health status, number of OSDs, and any immediate issues.

Step 2: Identify Predicted Failing OSDs

To get a list of devices that are predicted to fail, execute: juju run --unit ceph-osd/0 -- ceph device ls. Look for devices with a high health metric or marked as predicted to fail.

Step 3: Check OSD Logs

Inspect the Ceph OSD logs for any errors or warnings related to disk health. Logs can be found in /var/log/ceph. Look for messages related to device health or SMART data anomalies.

Step 4: SMART Data Analysis

For any OSD predicted to fail, check the SMART data manually to confirm the prediction. This can be done using the smartctl utility.

Step 5: Add New OSDs

If the predictions are confirmed and the cluster is at risk, the immediate action is to add new OSDs to the cluster to allow data relocation. Use the appropriate Charmed Ceph commands to deploy new OSDs via Juju.

For example, add new OSDs via the add-disk action:

juju run ceph-osd/X --wait add-disk osd-devices="/dev/my-disk"

Ensure the new devices are healthy and properly integrated into the cluster.

Step 6: Data Relocation and Rebalancing

After adding new OSDs, ensure data is being relocated from the failing devices. Monitor the progress with: juju exec --unit ceph-mon/0 -- ceph status

Alert: CephDeviceFailureRelocationIncomplete (osd)

Overview

The CephDeviceFailureRelocationIncomplete alert indicates that the Ceph cluster has predicted a device (typically a hard drive or SSD) failure in one of its OSDs. The health module within Ceph has identified that this device is at risk of failing imminently. However, the automatic mechanism designed to safeguard data by relocating it from the at-risk device to other OSDs within the cluster is unable to proceed. This situation often arises due to insufficient free space in the cluster to accommodate the data from the failing device. It highlights the need for immediate intervention either by adding more storage capacity to the cluster or by optimizing current storage utilization.

Troubleshooting

Step 1: Verify Cluster Free Space

First, ensure there is enough free space in the cluster to facilitate data migration. Use the following command to check the overall cluster health and storage utilization:

juju exec --unit ceph-mon/leader -- ceph osd df

This command provides information on the storage utilization of your Ceph cluster, including total space, used space, and available space.

Step 2: Identify At-risk Devices

Next, identify which device(s) are predicted to fail using the Ceph device health monitoring feature:

juju exec --unit ceph-mon/leader -- ceph device ls

Look for devices marked with a health status indicating imminent failure or issues.

Step 3: Check for OSD Fullness

Inspect the fullness of each OSD to determine if specific OSDs are blocking data relocation:

juju exec --unit ceph-mon/leader -- ceph osd df

This command lists each OSD’s storage usage. Overly full OSDs might prevent data migration.

Step 4: Adjust Balancer (if applicable)

If the cluster has a balancer enabled, it may require adjustments to improve data distribution:

juju exec --unit ceph-mon/leader -- ceph balancer status

Step 5: Add Capacity (if necessary)

If free space is critically low, the long-term solution is adding more storage to the cluster. Use the Ceph Charms for expanding capacity, ensuring you consult the documentation for your specific deployment model.

Alert: CephFilesystemDamaged (mds)

Overview

The alert “CephFilesystemDamaged (mds)” indicates that the Ceph Filesystem (CephFS) has encountered damage to its metadata, which is crucial for managing and accessing the data stored in the filesystem. This situation can arise due to various reasons such as improper shutdowns, disk failures, or software bugs. The corruption of filesystem metadata means that the system may not be able to correctly locate and manage files, although the MDS still continues operating.

Troubleshooting

The troubleshooting plan for resolving the “CephFilesystemDamaged (mds)” alert involves several steps to diagnose and potentially repair the damaged CephFS metadata. Follow these steps to identify and rectify the issue:

Check Ceph Health and Status

Begin by checking the overall health of your Ceph cluster using the command:

juju exec --unit ceph-mon/0 -- ceph -s

This command will provide a general overview of the cluster’s health, including any health warnings or errors that could be related to the damaged filesystem, such as scrub errors or device health issues.

Perform PG repairs or OSD maintenance if applicable.

Retrieve Ceph MDS Diagnostics

Use the damage ls admin socket command to get more detail on the damage:

juju exec --unit ceph-mon/0 -- ceph tell mds.* damage ls

Contact support with those diagnostics.

Alert: CephFilesystemDegraded (mds)

Overview

The alert CephFilesystemDegraded (mds) indicates that the health of the CephFS (Ceph Filesystem) is compromised due to one or more of its metadata daemons (MDS) being in a failed or damaged state. Metadata daemons are crucial for the operation of CephFS, as they store and manage the metadata associated with files stored in the filesystem (e.g., directory structure, file access permissions, etc.). When an MDS rank fails or becomes damaged, it can lead to partial or, in severe cases, complete unavailability of the filesystem.

Troubleshooting

To troubleshoot the CephFilesystemDegraded (mds) alert, follow the steps below:

Step 1: Identify the Failed MDS Daemons

Run the following command to list the status of all MDS daemons:
```
juju exec --unit ceph-mon/0 -- ceph mds stat
```
Note down the MDS ranks that are in a failed or damaged state.

Step 2: Check MDS Daemon Logs

Inspect the logs of the affected MDS daemons to identify errors or warnings that might indicate the cause of the problem. Log files are located in /var/log/ceph, specifically look for files named ceph-mds.<id>.log.
```
juju exec --unit ceph-mon/0 -- tail -n 100 /var/log/ceph/ceph-mds.*.log
```
Replace <id> with the ID of the failed MDS daemon.

Step 3: Health and Status of the Cluster

Check the overall health and status of the Ceph cluster:
```
juju exec --unit ceph-mon/0 -- ceph status
```
This can provide more insights into why the MDS daemon(s) might be failing.

Step 4: Restart Failed MDS Daemons

If an MDS daemon is stuck in a failed state, restarting it might resolve the issue:

juju exec --unit ceph-mon/<unit-id> -- systemctl restart ceph-mds@<daemon-id>.service

Replace <unit-id> with the Juju unit number and <daemon-id> with the ID of the MDS daemon you wish to restart.

Step 5: Contact Support

If after performing the above steps the issue persists, consider contacting support with detailed logs and findings from your troubleshooting efforts.

Alert: CephFilesystemFailureNoStandby (mds)

Overview

The alert CephFilesystemFailureNoStandby (mds) indicates a significant issue within the Ceph filesystem, specifically related to the Metadata Server (MDS) component. MDS daemons are crucial for the proper functioning of the Ceph Filesystem (CephFS), as they manage metadata, which includes file names, directory structures, permissions, and other information necessary for the filesystem to operate. Each CephFS usually has one active MDS and can have one or more standby MDS daemons ready to take over in case the active MDS fails.

This alert signifies that an MDS daemon has failed, and there is only one active rank left with no standby MDS daemons available. This situation is precarious because if the remaining active MDS also fails, it could lead to a total loss of access to the filesystem until an MDS is recovered or a new one is promoted, which could result in significant downtime.

Troubleshooting

Step 1: Check MDS Status

Execute juju exec --unit ceph-mon/leader -- ceph mds stat to get an overview of the MDS daemons and their states.
Confirm that there’s indeed only one active MDS and no standbys.

Step 2: Review MDS Logs

Logs are located in /var/log/ceph on the MDS nodes. Look for recent entries in the MDS logs with grep -i error /var/log/ceph/ceph-mds.*.log and grep -i warning /var/log/ceph/ceph-mds.*.log.
Pay special attention to entries close to the time the issue was first observed.

Step 3: Verify Cluster Health

Ensure that the rest of the Ceph cluster is healthy, as issues elsewhere could impact the MDS.
Run juju exec --unit ceph-mon/leader -- ceph health detail to get a detailed health report of the entire cluster.

Step 4: Check for Adequate Resources

Use tools like top, htop, iostat, and free to monitor the resources on the MDS nodes.

Step 5: Add a Standby MDS

If the cluster is indeed lacking a standby MDS, consider deploying another one.
Use the Ceph charms (e.g., ceph-mds) to deploy a new MDS daemon. Execute juju deploy ceph-mds --to <node> where <node> is the identifier of the machine where you wish to deploy the new MDS.
After deployment, verify that the new MDS has joined as a standby by running juju exec --unit ceph-mon/leader -- ceph mds stat again.

Step 6: Monitor for Recovery

Re-run juju exec --unit ceph-mon/leader -- ceph health detail periodically to check if the cluster’s health improves.

Step 7: Plan for Redundancy

To prevent future occurrences, consider adding more standby MDS daemons for increased redundancy.
Evaluate the workload and size of your CephFS to decide on the optimal number of standby MDS daemons.

Alert: CephFilesystemInsufficientStandby (mds)

Overview

The alert CephFilesystemInsufficientStandby pertains to the situation where the Ceph filesystem (MDS - Metadata Server) doesn’t have a sufficient number of standby daemons in relation to the desired count specified by standby_count_wanted. This situation can lead to potential metadata service interruptions if active MDS daemons fail without adequate standby replacements. It’s crucial for the reliability and availability of the Ceph filesystem to have an adequate number of standby MDS daemons to take over should an active daemon experience issues.

Troubleshooting

Step 1: Verify Current Number of MDS Daemons

First, you need to ascertain the current count of MDS daemons, both active and standby. Execute the following command to accomplish this:

juju exec --unit ceph-mon/leader -- ceph mds stat

This command should provide an overview of the MDS cluster status, indicating how many daemons are up and in standby mode.

Step 2: Check `standby_count_wanted`

Next, determine the desired number of standby MDS daemons (standby_count_wanted) by running:

juju exec --unit ceph-mon/leader -- ceph fs get <fs-name>

Replace <fs-name> with the name of your Ceph filesystem. This command fetches detailed information about the filesystem, including the standby_count_wanted parameter.

Step 3: Compare the Numbers

At this point, you should compare the current number of standby MDS daemons to standby_count_wanted. If the current number is less, proceed to the next step to adjust.

Step 4: Adjust Standby Count or Add MDS Daemons

Depending on your specific needs and resources, you either need to adjust the standby_count_wanted to match the current capabilities or add more MDS daemons to meet the desired standby count.

To Adjust standby_count_wanted:

Utilize the following command to lower the standby_count_wanted:
```
juju exec --unit ceph-mon/leader -- 'ceph fs set <fs-name> standby_count_wanted <new-value>'
```
Ensure <fs-name> is your filesystem name and <new-value> is the adjusted standby count based on your current MDS daemons.
To Add More MDS Daemons:

Deploy additional MDS daemons by scaling out your MDS service with Juju:
```
juju add-unit ceph-mds --num-units=<additional-units-required>
```
Replace <additional-units-required> with the number of MDS daemons you wish to add.

Step 5: Verify Adjustments

After making adjustments, re-execute the commands from Steps 1 and 2 to ensure that the number of standby MDS daemons now meets or exceeds standby_count_wanted.

Step 6: Inspect Logs

If issues persist, consult the MDS logs for potential errors or issues:

Navigate to /var/log/ceph on the MDS servers.
Look for recent log entries in ceph-mds.<id>.log files that might indicate problems or issues with becoming a standby or promoting to active.

Alert: CephFilesystemMDSRanksLow (mds)

Overview

The alert “CephFilesystemMDSRanksLow” indicates that the number of active Metadata Server (MDS) daemons is lower than the expected number defined by the filesystem’s max_mds setting. The MDS daemons are crucial for the Ceph Filesystem (CephFS) as they manage metadata operations, such as directory listings and file locations, enabling efficient file access across the distributed storage cluster. A shortfall in the number of active MDS daemons can impair the filesystem’s performance and availability.

Troubleshooting

Step 1: Verify MDS Daemon Status

To get an overview of the current MDS daemons and their states, use the following command:

juju exec --unit ceph-mon/X -- ceph fs status

This command provides detailed information about the MDS daemons, including which ones are active, standby, or in any other state.

Step 2: Check the `max_mds` Setting

To confirm the configured max_mds setting for your CephFS, execute:

juju exec --unit ceph-mon/X -- ceph fs get cephfs  # Replace 'cephfs' with your filesystem's name if different

Look for the max_mds field in the output.

Step 3: Inspect MDS Logs for Errors

Examine the MDS daemon logs for any obvious errors or warnings that could indicate why the daemons are not reaching the desired max_mds count.

less /var/log/ceph/ceph-mds.*.log

Replace * with the identifier of your MDS daemons as necessary to check the logs of each.

Step 4: Review Ceph Cluster Overall Health

It’s also important to review the overall health of the Ceph cluster as issues elsewhere can indirectly affect the MDS daemons:

juju exec --unit ceph-mon/X -- ceph status

Step 5: Adjust `max_mds` Setting If Necessary

If the number of MDS daemons needs adjustment, you can modify the max_mds setting for your filesystem:

juju exec --unit ceph-mon/X -- ceph fs set cephfs max_mds <desired_number>

Replace <desired_number> with the number of MDS daemons you wish to set as the maximum, and cephfs with the name of your CephFS filesystem.

Step 6: Resolve Any Identified Issues

Based on the findings from the previous steps, take the necessary actions to resolve any identified issues. This might involve restarting failed daemons, adjusting configuration settings, or addressing any cluster-wide problems that could be impacting MDS functionality.

Alert: CephFilesystemOffline (mds)

Overview

The alert “CephFilesystemOffline” indicates a critical situation where the CephFS (Ceph Filesystem) is inaccessible due to all Metadata Server (MDS) daemons being down. The MDS daemons are responsible for managing filesystem metadata, which includes file names, directories, permissions, and other file attributes. When these daemons are offline, clients cannot access the filesystem, leading to a disruption in services relying on CephFS.

Troubleshooting

To resolve the “CephFilesystemOffline” alert, follow this troubleshooting plan:

Step 1: Verify MDS Daemon Status

Execute the command juju status to check the status of all units and applications in the Juju environment. Pay special attention to the Ceph units to see if they are active or in an error state.

Step 2: Check MDS Cluster Health

Run juju exec --unit ceph-mon/X -- ceph -s to get the overall cluster health and status. This command checks if there are any reported issues with the MDS or other parts of the cluster. An unhealthy cluster can impact MDS functionality.

Step 3: Inspect MDS Logs

Logs can provide detailed error messages or hints on why the MDS daemons are down. Logs are located in /var/log/ceph on the MDS servers.
Use SSH to access the MDS servers (the specific command would depend on your setup, typically something along the lines of juju ssh ceph-fs/X), and then inspect the MDS logs with tail -f /var/log/ceph/ceph-mds.<mds-id>.log.

Step 4: Restart MDS Daemons

If the MDS services are in a failed state, attempt to restart them.
To restart an MDS daemon, you might use a command like juju run-action --wait ceph-fs/X restart-services.

Step 5: Verify Network Connectivity

Network issues can prevent MDS daemons from communicating with other cluster components. Verify connectivity between the MDS daemons and other Ceph nodes (MONs, OSDs).
You can use the ping command from within the MDS servers to test connectivity to other cluster nodes.

Step 6: Check for Resource Constraints

Ensure that the MDS servers are not running out of resources such as CPU, memory, or disk. High resource utilization can lead to service instability or crashes.
Commands like top, htop, or df -h can be used to monitor these resources.

Step 7: Re-deploy MDS Daemons (If Necessary)

If the issue cannot be resolved through restarts or configuration adjustments, consider re-deploying the MDS daemons.
Use the juju remove-unit and juju add-unit commands to remove and add ceph-fs units.

Alert: CephFilesystemReadOnly (mds)

Overview

The alert CephFilesystemReadOnly indicates that the CephFS (Ceph Filesystem) has transitioned into a read-only state. This protective measure usually occurs to prevent further data corruption or loss after encountering an error related to writing data to the metadata pool, or if the administrator forces the MDS into read-only mode. The metadata pool in Ceph is crucial as it stores all the metadata for the Ceph Filesystem, including directory trees and file attributes. An issue with the metadata pool can severely affect the integrity of the filesystem.

Troubleshooting

Step 1: Identify the Affected MDS Daemon(s)

Start by listing the status of all MDS daemons in your cluster to identify which ones are in a read-only state.

juju exec --unit ceph-mon/X -- ceph mds stat

Step 2: Check MDS Daemon Logs

Next, inspect the logs of the affected MDS daemon(s). This can help identify any errors or warnings that could have caused the transition to read-only mode.

juju exec --unit ceph-fs/X -- tail -n 100 /var/log/ceph/ceph-mds.[mds-id].log

Replace [mds-id] with the identifier of the MDS daemon.

Step 3: Check Cluster Health

A broader issue within the cluster could also impact the MDS daemon. Run the following to check the overall health and especially look for issues related to the metadata pool.

juju exec --unit ceph-mon/X -- ceph status

Step 4: Inspect Metadata Pool for Issues

Verify the status and configuration of the metadata pool, which could provide insights into the root cause of the issue.

juju exec --unit ceph-mon/X -- ceph osd pool stats [cephfs_metadata]

Ensure to replace cephfs_metadata with the actual name of your metadata pool if it is different.

Step 5: Perform Remedial Actions

Perform any remedial actions to address issues found during the investigation, such as crashed daemons, network connectivity or resource issues.

Alert: CephHealthError (cluster health)

Overview

The alert “CephHealthError” indicates that the Ceph cluster has encountered a significant issue and has been in the HEALTH_ERROR state for more than 5 minutes. This state is critical and suggests that there are immediate problems that need to be addressed to ensure the cluster’s stability and data integrity. The HEALTH_ERROR state can be triggered by a variety of issues, such as OSDs (Object Storage Daemons) being down, network connectivity problems, disk failures, or misconfigurations, among others.

To resolve this issue, a detailed investigation is required to identify the root cause(s) and apply the necessary fixes.

Troubleshooting

Step 1: Check Cluster Health Details

First, obtain detailed information about the cluster’s health by running the following commands:
```
juju exec --unit ceph-mon/X -- ceph status
juju exec --unit ceph-mon/X -- ceph health detail
```
This command will output detailed information about the health issues. Make notes of any error messages or warnings.

Step 2: Review Ceph Monitor Logs

Logs can also provide additional detail.
Inspect the Ceph monitor logs for any anomalies or errors that might indicate what caused the HEALTH_ERROR state. Use the following command to view the latest entries:
```
juju exec ceph-mon/X 'tail -n 100 /var/log/ceph/ceph.log'
```

Step 3: Check OSD Status

A common cause for HEALTH_ERROR is OSDs being down or in an unhealthy state.
Refer to the diagnostics output above to see if OSD issues are listed.
Ensure all OSDs are up and in. If any OSD is down, investigate the logs for that specific OSD by running:
```
juju ssh ceph-osd/X 'tail -n 100 /var/log/ceph/ceph-osd.<osd-id>.log'
```
Replace X and <osd-id> with the unit id and <osd-id> with the id of the OSD in question.

Step 4: Inspect Network Connectivity

Network issues can also lead to a HEALTH_ERROR status. Test network connectivity between Ceph nodes using tools like ping and traceroute.
Ensure there are no significant latencies or packet losses.

Step 5: Verify Disk Health

Disk failures or errors can cause OSDs to go down, leading to health errors. Check the disk health on OSD nodes using e.g. the smartctl tool.

Step 6: Consult Ceph Documentation

For specific error messages encountered during troubleshooting, consult the Ceph documentation or the community forums for guidance and potential solutions.

Alert: CephHealthWarning (cluster health)

Overview

The alert CephHealthWarning indicates that the Ceph cluster’s health status has been in a WARNING state for more than 15 minutes. This state signifies that while the cluster is still functional, there are issues that need to be addressed to prevent any potential degradation in performance or data availability. The warning state can be triggered by a variety of conditions such as OSDs down, reduced data redundancy, slow requests, etc. As this is a generic alert it’s necessary to investigate the specifics of the warning to understand the underlying cause and take appropriate action.

Troubleshooting

Step 1: Check Cluster Health Details

First, obtain detailed information about the cluster’s health by running the following commands:
```
juju exec --unit ceph-mon/X -- ceph status
juju exec --unit ceph-mon/X -- ceph health detail
```
This command will output detailed information about the health issues. Make notes of any error messages or warnings.

Step 2: Review Ceph Monitor Logs

Logs can also provide additional detail.
Inspect the Ceph monitor logs for any anomalies or errors that might indicate what caused the HEALTH_WARN state. Use the following command to view the latest entries:
```
juju exec ceph-mon/X 'tail -n 100 /var/log/ceph/ceph.log'
```

Step 3: Check OSD Status

A common cause for HEALTH_WARN is OSDs being down or in an unhealthy state.
Refer to the diagnostics output above to see if OSD issues are listed.
Ensure all OSDs are up and in. If any OSD is down, investigate the logs for that specific OSD by running:
```
juju ssh ceph-osd/X 'tail -n 100 /var/log/ceph/ceph-osd.<osd-id>.log'
```
Replace X and <osd-id> with the unit id and <osd-id> with the id of the OSD in question.

Step 4: Inspect Network Connectivity

Network issues can also lead to a HEALTH_WARN status. Test network connectivity between Ceph nodes using tools like ping and traceroute.
Ensure there are no significant latencies or packet losses.

Step 5: Verify Disk Health

Disk failures or errors can cause OSDs to go down, leading to health errors. Check the disk health on OSD nodes using e.g. the smartctl tool.

Step 6: Consult Ceph Documentation

For specific error messages encountered during troubleshooting, consult the Ceph documentation or the community forums for guidance and potential solutions.

Alert: CephMgrModuleCrash (mgr)

Overview

The alert CephMgrModuleCrash indicates that one or more of the manager (mgr) modules within the Charmed Ceph storage system have encountered a crash. Manager modules are responsible for providing additional services and interfaces to the Ceph storage cluster, such as monitoring, dashboard interfaces, and REST APIs. A crash in one of these modules can lead to partial or complete loss of functionality in the services they provide. The alert advises to utilize the ceph crash command to identify the crashed module(s) and suggests archiving the crash information to acknowledge the failure.

Troubleshooting

Step 1: Identifying the Crashed Module

Log into one of the machines hosting the Ceph Manager daemon
```
juju ssh ceph-mon/X
```
Execute the following command to list the recent crashes:
```
sudo ceph crash ls
```
This command will provide details about crashes that have occurred, including the crashed module.

Step 2: Gathering Crash Information

After identifying the crashed module, gather more detailed information about the crash for diagnosis:

Use the crash ID obtained from the previous step to get detailed information:
```
sudo ceph crash info <crash_id>
```
Archive the crash report to acknowledge it:
```
sudo ceph crash archive <crash_id>
```

Step 3: Checking Logs

Inspect the log files for any error messages or warnings related to the manager module:

Log files for Ceph are located in /var/log/ceph/. Use commands like grep, less, or tail to examine ceph*.log files for relevant entries.

Step 4: Restarting the Ceph Manager Daemon

Attempt to restart the Ceph Manager daemon to see if the problem resolves:

Use Juju to restart the service:

juju exec --unit ceph-mon/<unit_number> -- systemctl restart ceph-mgr@<mgr_id>

Replace <unit_number> with the unit number of the Ceph Manager daemon and <mgr_id> with the ID of the manager instance.

Step 5: Contacting Support

If the issue persists, gather all the diagnostic information, including the output from the previous steps, and contact support for further assistance.

Alert: CephMonClockSkew (mon)

Overview

The CephMonClockSkew alert indicates a significant issue within a Ceph cluster, particularly among the Ceph monitors (mons). Ceph relies heavily on synchronized timekeeping across all nodes in the cluster to ensure data consistency, stability, and the maintenance of quorum among the monitors. When the time on one or more monitors drifts significantly from the others—beyond the configured threshold—this alert is triggered. Such time discrepancies can lead to errors in cluster operation, including but not limited to, loss of quorum or split-brain scenarios, which can have severe implications for data availability and integrity.

Troubleshooting

The troubleshooting process involves identifying the affected monitors, verifying and synchronizing the system time on all monitor hosts, and ensuring that time synchronization services are correctly configured and operational. Follow these steps:

Step 1: Identify Affected Monitors

Use juju exec --unit ceph-mon/<unit> -- ceph -s to review the cluster status and identify which monitors have clock skew issues. Replace <unit> with the appropriate unit number.
Pay attention to the “health” section of the output for messages regarding clock skew.

Step 2: Check Time Sync Status

For each monitor identified in step 1, check the time synchronization status with juju exec --unit ceph-mon/<unit> -- ceph time-sync-status. This command will help you understand if the time synchronization mechanism (NTP or Chrony) is functioning correctly.

Step 3: Verify System Time

Manually check the system time on each monitor node by executing juju exec --unit ceph-mon/<unit> -- date. Compare the outputs to ensure they are closely synchronized.

Step 4: Ensure NTP or Chrony is Running

Depending on whether your system uses NTP or Chrony for time synchronization, check the service status.
- For NTP, use juju exec --unit ceph-mon/<unit> -- systemctl status ntp.
- For Chrony, use juju exec --unit ceph-mon/<unit> -- systemctl status chrony.
Make sure the service is active and running.

Step 5: Synchronize System Time

If time discrepancies are found, synchronize the system time on the affected monitors.
- For NTP, force a synchronization with juju exec --unit ceph-mon/<unit> -- sudo ntpdate -u <ntp-server>, replacing <ntp-server> with your NTP server address.
- For Chrony, use juju exec --unit ceph-mon/<unit> -- chronyc makestep to correct significant offsets.

Step 6: Review the Configuration of Time Synchronization Services

Ensure that NTP or Chrony is correctly configured to start at boot and to synchronize with reliable time sources.
Review the configuration files (e.g., /etc/ntp.conf for NTP or /etc/chrony/chrony.conf for Chrony) for each monitor node using juju exec.

Step 7: Re-check Cluster Status

After resolving any time synchronization issues, re-check the cluster status with juju exec --unit ceph-mon/<unit> -- ceph -s to ensure there are no longer any clock skew warnings.

Alert: CephMonDiskspaceCritical (mon)

Overview

The alert CephMonDiskspaceCritical indicates that the filesystem space available to at least one Ceph monitor (mon) is critically low. This situation requires immediate attention because monitors are crucial for the functioning of the Ceph cluster, storing essential data about the cluster’s state. The alert specifically points toward the directories where the monitor’s data is stored, which, depending on the deployment, could be found at /var/lib/ceph/mon-*/data/store.db or /var/lib/rook/mon-*/data/store.db. The critical low space can lead to the Ceph cluster becoming unstable or going into a read-only state to prevent data corruption.

Troubleshooting

To troubleshoot and resolve the CephMonDiskspaceCritical alert, follow these steps:

Step 1: Identify Affected Monitors

To identify which monitors are experiencing issues:
```
juju exec --unit ceph-mon/X 'ceph health detail'
```

Step 2: Check Disk Usage

For each affected monitor, check the disk usage with the command:
```
juju ssh <mon-unit> -- 'df -h /var/lib/ceph'
```
Replace <mon-unit> with the actual monitor unit name, such as ceph-mon/0.

Step 3: Inspect Log Files and Other Space Consumers

Inspect log files and other large files
Look for old, rotated versions of log files (*.log)

Step 4: Clean Up Unnecessary Files

Before deleting any files, ensure they are not required by Ceph. You can safely remove rotated log files that are no longer in use. Use the juju ssh command to execute file removal commands on the specific monitor unit.

Step 5: Monitor Disk Space After Cleanup

After cleaning up unnecessary files, check the disk space again to ensure the issue is resolved:
```
juju ssh <mon-unit> -- 'df -h /var/lib/ceph'
```

Step 6: Consider Increasing Disk Space

If cleaning does not free up sufficient space, consider adding more disk space to the affected monitors. This may involve resizing the underlying volumes or attaching additional storage, depending on your infrastructure.

Step 7: Verify Monitor Health

Once you’ve addressed the disk space issue, ensure the health of the Ceph cluster with:
```
juju exec --unit ceph-mon/X 'ceph health detail'
```

Alert: CephMonDiskspaceLow (mon)

Overview

The alert “CephMonDiskspaceLow” indicates that the disk space available to at least one of the Ceph monitor nodes is running low, exceeding 70% usage by default. This situation can lead to performance degradation or even cluster failure if not addressed promptly. Monitors play a crucial role in the Ceph cluster by maintaining maps of the cluster state, including the monitor map, the OSD map, the placement group (PG) map, and the CRUSH map. Therefore, ensuring monitors have sufficient disk space is vital for the stable operation of a Ceph cluster.

Troubleshooting

Step 1: Identify Affected Monitors

To identify which monitors are experiencing issues:
```
juju exec --unit ceph-mon/X 'ceph health detail'
```

Step 2: Check Disk Usage

For each affected monitor, check the disk usage with the command:
```
juju ssh <mon-unit> -- 'df -h /var/lib/ceph'
```
Replace <mon-unit> with the actual monitor unit name, such as ceph-mon/0.

Step 3: Inspect Log Files and Other Space Consumers

Inspect log files and other large files
Look for old, rotated versions of log files (*.log)

Step 4: Clean Up Unnecessary Files

Before deleting any files, ensure they are not required by Ceph. You can safely remove rotated log files that are no longer in use. Use the juju ssh command to execute file removal commands on the specific monitor unit.

Step 5: Monitor Disk Space After Cleanup

After cleaning up unnecessary files, check the disk space again to ensure the issue is resolved:
```
juju ssh <mon-unit> -- 'df -h /var/lib/ceph'
```

Step 6: Consider Increasing Disk Space

If cleaning does not free up sufficient space, consider adding more disk space to the affected monitors. This may involve resizing the underlying volumes or attaching additional storage, depending on your infrastructure.

Step 7: Verify Monitor Health

Once you’ve addressed the disk space issue, ensure the health of the Ceph cluster with:
```
juju exec --unit ceph-mon/X 'ceph health detail'
```

Alert: CephMonDown (mon)

Overview

The alert “CephMonDown” indicates that one or more of the Ceph monitors within your cluster are down or unreachable. Monitors are crucial for managing the state of the Ceph cluster, including membership, configuration, and the current state of the distributed storage. They work in a quorum to ensure consistency and availability. While the cluster can function with a monitor down, it is at risk since losing an additional monitor could result in the cluster becoming inoperable. This alert serves as an early warning to prevent a potential full cluster outage.

Troubleshooting

Step 1: Verify Juju Status

- Use the `juju status` command to get an overview of the current state of all components in your deployment, including the monitor units. Look for any units in an error state or that are not active.

Step 2: Check the Monitors in Quorum

- Run `juju exec --unit ceph-mon/0 -- ceph mon stat` to see the status of monitors and which ones are in quorum. This can help identify which monitor(s) are down.

Step 3: Inspect Ceph Monitor Logs

- Investigate the logs of the monitor that is reported down. You can find these logs in `/var/log/ceph` on the affected mon nodes. Use commands like `grep` or `tail` to inspect the recent entries for any errors or warnings.
- Example: `tail -n 100 /var/log/ceph/ceph-mon.<hostname>.log`

Step 4: Check Network Connectivity

- Ensure there is network connectivity between the monitors and other nodes in the cluster. Use `ping` or `traceroute` from the monitor nodes to other cluster nodes and vice versa.

Step 5: Review Ceph Cluster Health

- From one of the mon nodes, run `juju exec --unit ceph-mon/0 -- ceph health detail` to get detailed health information about the cluster. This will provide insights into the specific issues affecting the monitor(s).

Step 6: Restart Down Monitors

- If a monitor is not responsive, consider restarting it. This can be done through Juju by running `juju exec --unit ceph-mon/X 'sudo systemctl restart ceph-mon@<hostname>'` where X is the unit number and `<hostname>` is the name of the node hosting the down monitor.

Alert: CephMonDownQuorumAtRisk (mon)

Overview

The alert “CephMonDownQuorumAtRisk” indicates a critical situation in a Ceph cluster where the quorum of Ceph Monitor (mon) daemons is at risk. In a Ceph cluster, the monitors maintain a master copy of the cluster map and require a majority (quorum) to agree on decisions. This quorum is essential for the cluster’s operation; without it, the cluster cannot make decisions or update the cluster state, leading to the cluster becoming inoperable. This situation can affect all services relying on the Ceph cluster and any connected clients, potentially leading to data unavailability or loss.

Quorum is calculated based on the total number of monitor daemons in the cluster, and a majority of these monitors need to be operational and communicating with each other to maintain quorum. For example, in a cluster with three monitors, at least two monitors need to be up and in communication with each other to maintain quorum. Losing quorum could be due to network partitions, hardware failures, misconfigurations, or other operational issues that result in monitor daemons being unable to communicate with each other.

Troubleshooting

Step 1: Verify Juju Status

- Use the `juju status` command to get an overview of the current state of all components in your deployment, including the monitor units. Look for any units in an error state or that are not active.

Step 2: Check the Monitors in Quorum

- Run `juju exec --unit ceph-mon/leader -- ceph mon stat` to see the status of monitors and which ones are in quorum. This can help identify which monitor(s) are down.

Step 3: Inspect Ceph Monitor Logs

- Investigate the logs of the monitor that is reported down. You can find these logs in `/var/log/ceph` on the affected mon nodes. Use commands like `grep` or `tail` to inspect the recent entries for any errors or warnings.
- Example: `tail -n 100 /var/log/ceph/ceph-mon.<hostname>.log`

Step 4: Check Network Connectivity

- Ensure there is network connectivity between the monitors and other nodes in the cluster. Use `ping` or `traceroute` from the monitor nodes to other cluster nodes and vice versa.

Step 5: Review Ceph Cluster Health

- Run `juju exec --unit ceph-mon/leader -- ceph health detail` to get detailed health information about the cluster. This will provide insights into the specific issues affecting the monitor(s).

Step 6: Restart Down Monitors

- If a monitor is not responsive, consider restarting it. This can be done through Juju by running `juju exec --unit ceph-mon/X 'systemctl restart ceph-mon@<hostname>'` where `X` is the unit number and `<hostname>` is the name of the node hosting the down monitor.

Alert: CephObjectMissing (rados)

Overview

The CephObjectMissing (rados) alert indicates that one or more objects in the RADOS (Reliable Autonomic Distributed Object Store) layer of the Ceph storage system have been marked as UNFOUND. This condition arises when the latest version of a RADOS object cannot be located on currently running OSDs. This situation is problematic because client I/O requests targeting the missing object will block or hang, potentially leading to degraded performance or application timeouts.

One possible cause is OSDs that went crashed before updates could be replicated. In this case, the issue can possibly be resolved by bringing back those OSDs.

Otherwise, the resolution often involves manual intervention to roll the object back to a previous version and ensure its integrity.

Troubleshooting

Step 1: Check Cluster Health

Run juju exec --unit ceph-mon/0 -- ceph -s for diagnostics.
Make note of OSDs that are unexpectedly down

Step 2: Review OSD Logs

Examine the OSD logs for any anomalies or errors that might indicate why the object is missing. The logs are located in /var/log/ceph on the respective OSD nodes. Focus on log entries around the time the issue was first reported.

Step 3: Rescue down OSDs

Attempt to rescue OSDs that might have crashed.
Sometimes, simply restarting an OSD can resolve temporary issues and bring it back online.
To restart a specific OSD, use juju exec --unit ceph-osd/<unit-number> -- systemctl restart ceph-osd@<osd-id>.service, ensuring you replace <unit-number> and <osd-id> with the appropriate values.

Step 4: Check for Under replicated or Degraded Objects

Execute juju exec --unit ceph-mon/0 -- ceph -s to check for any warnings about under replicated or degraded objects that could point to potential issues with data distribution or availability.

Step 6: Attempt Object Recovery

If reviving OSDs did not lead to resolving this issue, it might be necessary to manually recover.
Engage manual recovery procedures if a specific object version is known to be good. This could involve rolling the object back to this version. Note, this step requires a deep understanding of the Ceph internals and should be performed with caution.
See https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#failures-osd-unfound for details around this procedure

Alert: CephOSDBackfillFull (osd)

Overview

The CephOSDBackfillFull alert indicates that one or more OSD (Object Storage Daemons) in a Ceph cluster have reached the BACKFILL FULL threshold. When an OSD reaches this threshold, it becomes too full to proceed with backfill operations, which are crucial for data rebalancing and recovery processes within the cluster. This situation can lead to data distribution imbalances and potentially affect the cluster’s redundancy.

This condition triggers when the amount of data stored on an OSD approaches its capacity limit, preventing Ceph from moving or copying data to that OSD as part of its normal rebalancing and recovery processes. Such a scenario requires immediate attention to prevent data loss or downtime.

Troubleshooting

To effectively troubleshoot and resolve the CephOSDBackfillFull alert, follow the steps below:

Step 1: Identify Affected OSDs

Execute the command juju exec --unit ceph-mon/X -- ceph health detail to get detailed health information and identify which OSDs are affected by the BACKFILL FULL condition.
Run juju exec --unit ceph-mon/X -- ceph osd df to display the disk usage of all OSDs. Look for OSDs with high utilization percentages.

Step 2: Review Cluster Capacity

Analyze the overall capacity and usage of the cluster to understand if the issue is isolated or widespread.
Consider using juju exec --unit ceph-mon/X -- ceph df to review the cluster’s total, used, and available space.

Step 3: Add Capacity

If capacity is the issue, add more storage to your Ceph cluster. This might involve adding more OSDs or increasing the storage capacity of existing OSDs.
Use the appropriate Juju commands to scale your cluster, depending on your deployment configuration.

Step 4: Restore Down or Out OSDs

Check for any OSDs that are down or marked out. These OSDs can contribute to the problem by reducing the available capacity.
Use juju exec --unit ceph-mon/X -- ceph osd tree to check the status of OSDs and identify any that are down or out.
If any OSDs need to be restored, follow the necessary steps to bring them back online.

Step 5: Delete Unwanted Data

If adding capacity is not feasible, consider deleting unwanted or unnecessary data from the cluster to free up space.
Prioritize the cleanup of data that is safe to remove, such as old snapshots or redundant copies of data.

Alert: CephOSDDownHigh (osd)

Overview

The alert “CephOSDDownHigh” is triggered when more than 10% of the Object Storage Daemons (OSDs) in a Ceph cluster are down. OSDs are responsible for storing data, replication, recovery, rebalancing, and providing information to Ceph monitors about the storage state. Having a significant portion of OSDs down compromises the redundancy, performance, and potentially the availability of data stored in the cluster. This situation requires immediate attention to identify the root cause, bring the OSDs back online, and ensure the health and integrity of the Ceph cluster.

Troubleshooting

Step 1: Verify the Current Status of OSDs

First, confirm the alert by checking the current status of the OSDs in your Ceph cluster. Execute the following command using juju exec to run it on the appropriate unit:

juju exec --unit ceph-mon/leader -- ceph osd stat

This command provides a summary of the OSDs’ status, indicating how many are up and in. Confirm that the number of OSDs down matches the alert details.

Step 2: Identify Down OSDs

To list the OSDs that are currently down, use the following command:

juju exec --unit ceph-mon/leader -- ceph osd tree | grep down

This will filter the output of the ceph osd tree command to show only the OSDs that are down. Note the IDs of these OSDs for further investigation.

Step 3: Check for Common Issues

For each OSD identified in step 2, check for common issues that might cause an OSD to be down.

a. Review OSD Logs

Inspect the logs for any OSDs that are down. The logs may contain errors or warnings that could indicate why the OSD is not online.

juju exec --unit ceph-osd/X -- tail -n 100 /var/log/ceph/ceph-osd.<osd-id>.log

Replace X by the unit number of the ceph-osd unit and <osd-id> with the ID of the OSD you are investigating. Repeat this step for each down OSD, modifying the unit number as necessary to match the OSD’s host.

b. Check Disk Health

Hardware issues, such as disk failures, can cause an OSD to go down. Check the health of the physical disks on the nodes hosting the down OSDs, for instance using the smartctl program, depending on the type of disks employed.

c. Verify Network Connectivity

Network issues can also cause OSDs to become unreachable. Verify network connectivity and configuration on the nodes hosting the down OSDs.

One possible check is to ping a monitor:

juju exec --unit ceph-osd/X -- ping -c 3 <monitor-ip>

Replace X by the unit number of the ceph-osd unit and <monitor-ip> with the IP address of one of your Ceph monitors. This checks connectivity from an OSD node to a monitor node.

Step 4: Attempt to Restart Down OSDs

If no hardware or network issues are identified, attempt to restart the down OSDs.

To restart a specific OSD, use juju exec --unit ceph-osd/X -- systemctl restart ceph-osd@<osd-id>.service, ensuring you replace X and <osd-id> with the appropriate values.

Monitor the logs for any errors during the restart process.

Step 5: Escalate if Necessary

If the OSDs remain down after attempting a restart and no clear cause has been identified, escalate the issue. Consider reaching out to the support community or consulting the official Ceph documentation for further assistance.

Alert: CephOSDDown (osd)

Overview

The CephOSDDown alert indicates that one or more OSD (Object Storage Daemon) instances in the Ceph cluster have been marked as “down” for over 5 minutes. OSDs are responsible for storing data, handling data replication, recovery, rebalancing, and providing information to Ceph monitors about changes. When an OSD is marked down, it means it is not communicating with the rest of the cluster, which can lead to decreased data redundancy, potential data loss, and reduced cluster performance. This condition necessitates prompt investigation and resolution to ensure the health and integrity of the Ceph storage system.

Troubleshooting

Troubleshooting a CephOSDDown alert involves several steps to identify and resolve the underlying issues causing the OSD to be marked down.

Step 1: Verify the Current Status of OSDs

First, confirm the alert by checking the current status of the OSDs in your Ceph cluster. Execute the following command using juju exec to run it on the appropriate unit:

juju exec --unit ceph-mon/0 -- ceph osd stat

This command provides a summary of the OSDs’ status, indicating how many are up and in. Confirm that the number of OSDs down matches the alert details.

Step 2: Identify Down OSDs

To list the OSDs that are currently down, use the following command:

juju exec --unit ceph-mon/0 -- ceph osd tree | grep down

This will filter the output of the ceph osd tree command to show only the OSDs that are down. Note the IDs of these OSDs for further investigation.

Step 3: Check for Common Issues

For each OSD identified in step 2, check for common issues that might cause an OSD to be down.

a. Review OSD Logs

Inspect the logs for any OSDs that are down. The logs may contain errors or warnings that could indicate why the OSD is not online.

juju exec --unit ceph-osd/0 -- tail -n 100 /var/log/ceph/ceph-osd.<osd-id>.log

Replace <osd-id> with the ID of the OSD you are investigating. Repeat this step for each down OSD, modifying the unit number as necessary to match the OSD’s host.

b. Check Disk Health

Hardware issues, such as disk failures, can cause an OSD to go down. Check the health of the physical disks on the nodes hosting the down OSDs, for instance using the smartctl program, depending on the type of disks employed.

c. Verify Network Connectivity

Network issues can also cause OSDs to become unreachable. Verify network connectivity and configuration on the nodes hosting the down OSDs.

One possible check is to ping a monitor:

juju exec --unit ceph-osd/0 -- ping -c 3 <monitor-ip>

Replace <monitor-ip> with the IP address of one of your Ceph monitors. This checks connectivity from an OSD node to a monitor node.

Step 4: Attempt to Restart Down OSDs

If no hardware or network issues are identified, attempt to restart the down OSDs.

To restart a specific OSD, use juju exec --unit ceph-osd/<unit-number> -- systemctl restart ceph-osd@<osd-id>.service, ensuring you replace <unit-number> and <osd-id> with the appropriate values.

Monitor the logs for any errors during the restart process.

Step 5: Escalate if Necessary

If the OSDs remain down after attempting a restart and no clear cause has been identified, escalate the issue. Consider reaching out to the support community or consulting the official Ceph documentation for further assistance.

Alert: CephOSDFlapping (osd)

Overview

The alert CephOSDFlapping (osd) indicates that an Object Storage Daemon (OSD) in the Ceph storage cluster has been repeatedly marked as down and then back up within a short period. This behavior, known as “flapping,” typically occurs once a minute for 5 minutes according to the alert description. This problem can be caused by several network-related issues such as high latency, packet loss, or MTU (Maximum Transmission Unit) mismatch either on the cluster network (used for internal Ceph cluster communication) or on the public network (if no separate cluster network is deployed).

Flapping of OSDs is a significant concern because it can lead to instability in the storage cluster, potential data unavailability, or performance degradation. Identifying and resolving the underlying network issues is crucial for maintaining the health and performance of the Ceph storage system.

Troubleshooting

Step 1: Identify Affected OSDs and Hosts

Before diving into network diagnostics, you need to pinpoint which OSDs and their corresponding hosts are experiencing the flapping issue. While the alert should mention specific OSDs, you can also use the following command to list the current status of all OSDs in the cluster:

juju exec --unit ceph-mon/X -- ceph osd status

Step 2: Check Network Latency and Packet Loss

High network latency or packet loss between OSDs can cause flapping. Use ping and mtr (My Traceroute) tools to diagnose these issues:

To check for packet loss or high latency, you can ping the affected hosts from each other. For example:

ping -c 10 <affected-host-IP>

To get more detailed insights on the path and potential network issues, use the mtr command:

mtr <affected-host-IP>

Step 3: Verify MTU Settings

Mismatched MTU settings can lead to packet loss and OSD flapping. Verify the MTU settings on all interfaces involved in the cluster or public network communication:

ip a | grep mtu

Ensure the MTU values are consistent across your network devices and interfaces.

Step 4: Inspect Network Interface and Daemon Logs

Examine logs for any signs of network issues or errors on the affected hosts. Review the Ceph daemon logs located at /var/log/ceph and system network interface logs for clues.

Investigate system logs for network-related messages, focusing on the time frames when flapping occurred.

Step 5: Review Ceph Cluster Health and Peering Status

An overall health check can provide insights into related issues that might contribute to OSD flapping:

juju exec --unit ceph-mon/X -- ceph health detail

Look for any warnings or errors related to OSDs, network connectivity, or other potential issues impacting the cluster.

Step 6: Examine Physical Network Infrastructure

If the above steps do not reveal the issue, consider examining the physical network infrastructure. This includes checking network cables, switch ports, and router configurations for any signs of problems that could affect connectivity between the OSDs.

Final Steps

After identifying and resolving the underlying network issue, monitor the cluster’s stability and the previously affected OSDs to ensure the problem does not recur. Implementing network monitoring tools can help detect and prevent future issues.

Alert: CephOSDFull (osd)

Overview

The “CephOSDFull” alert indicates a critical state in the Ceph storage cluster where one or more Object Storage Daemons (OSDs) have reached their capacity limit, as defined by the “FULL” threshold. When an OSD hits this threshold, it cannot accept any more write operations, effectively blocking writes to the Ceph pools utilizing that OSD. This state not only affects performance but can also lead to data availability issues if not addressed promptly.

The underlying causes can vary, but include insufficient cluster capacity due to physical storage limitations, an unexpected surge in data being written to the cluster, or failure to properly balance data across the cluster’s OSDs.

Troubleshooting

To troubleshoot and resolve the “CephOSDFull (osd)” alert in a Charmed Ceph cluster, follow these steps:

Step 1: Identify Affected OSDs

Use the Juju command to execute ceph health detail on one of the monitor (mon) units:
```
juju exec --unit ceph-mon/0 -- ceph health detail
```
This command will provide detailed health information about the cluster, including which OSD(s) are full.

Step 2: Check OSD Storage Utilization

Check the disk usage of all OSDs to understand the storage utilization better:
```
juju exec --unit ceph-osd/0 -- ceph osd df
```
This command lists all OSDs along with their storage utilization metrics.

Step 3: Balance the Cluster (If Necessary)

If certain OSDs are disproportionately full, consider rebalancing the cluster. This can be initiated with:
```
juju exec --unit ceph-osd/0 -- ceph osd reweight-by-utilization
```

Step 4: Add More Storage

If the cluster is genuinely at capacity, consider adding more OSDs to the cluster to increase storage capacity. This involves deploying new OSD units with the Ceph OSD Charm and adding physical disks or volumes.

Step 5: Clean Up Unwanted Data

If applicable, delete unwanted or unnecessary data from the cluster to free up space. This might involve purging old data, reducing replication factors temporarily, or compressing data.

Step 6: Monitor Recovery and Health

After taking corrective actions, monitor the cluster’s health and recovery process:
```
juju ssh ceph-mon/0 -- 'sudo ceph -w'
```
This command provides real-time updates on the cluster’s health and recovery status.

Alert: CephOSDHostDown (osd)

Overview

The “CephOSDHostDown” alert indicates that one or more OSD (Object Storage Daemon) hosts within your Ceph cluster are currently offline or unreachable. This impacts the storage cluster’s redundancy and performance. When an OSD host goes offline, it means that all OSD daemons on that host are not communicating with the rest of the cluster, potentially leading to data unavailability and risking the loss of data if the situation is not addressed promptly.

Troubleshooting

Step 1: Identify the Offline OSDs

Start by identifying which OSDs are offline. Run the following command:

juju exec --unit ceph-mon/leader -- ceph osd tree

This will display the state of all OSDs in the cluster. Offline or down OSDs will be marked as “down”.

Step 2: Check the Status of the Host

Use the juju status command to check the status of the Charmed Ceph units. This will help in identifying if the issue is with the host or a specific OSD service.

juju status

Step 3: Inspect Logs for Errors

Review the Ceph logs for any errors related to the OSDs that are down. Specifically, focus on recent entries in the OSD log files located in /var/log/ceph on the affected host. Look for messages indicating why the OSD has been marked as down.

Step 4: Check Network Connectivity

Ensure the OSD host has proper network connectivity to the rest of the Ceph cluster. You can use tools like ping or traceroute (from the OSD host or another host in the cluster) to check connectivity to the OSD host and other cluster components.

Step 5: Review Hardware and System Health

Check the health of the OSD host’s hardware. This includes reviewing the status of the disks, memory, CPU, and network interfaces. Use commands like smartctl for disk health, top or htop for CPU/memory usage, and ip a or ifconfig to check network interface status.

Step 6: Restart the OSD Service

If the host and network appear healthy, and the OSD is still marked as down without clear reasons from the logs, consider restarting the OSD service on the affected host using Charmed Ceph functionalities:

juju exec --unit ceph-osd/N -- systemctl restart ceph-osd@ID

Replace N with the unit number and ID with the OSD ID that is down. It’s crucial to ensure that only the OSD service is restarted, and the command is targeted correctly to prevent unintended service disruptions.

Step 7: Monitor the Cluster Health

After addressing the issue(s), monitor the cluster’s health to ensure it returns to a healthy state:

juju exec --unit ceph-mon/leader -- ceph -s

This command provides an overview of the Ceph cluster’s health, including the status of OSDs, MONs, and other critical components.

Alert: CephOSDInternalDiskSizeMismatch (osd)

Overview

The “CephOSDInternalDiskSizeMismatch” alert indicates a serious inconsistency issue within the Charmed Ceph storage system. Specifically, it points out that the metadata recorded by one or more OSD does not match the actual size of the underlying storage devices. Such discrepancies can lead to significant problems, including the potential for the affected OSD(s) to crash unexpectedly. This mismatch can arise from several causes, such as incorrect OSD initialization, hardware changes not reflected in the metadata, or filesystem corruption.

Troubleshooting

Given the critical nature of this alert, it’s crucial to follow a systematic approach to identify the affected OSDs and take corrective action. Below is a troubleshooting plan designed specifically for Ubuntu systems running Charmed Ceph.

Step 1: Identify Affected OSDs

Use the Juju command to list all the units and identify the ones related to Ceph OSDs:
```
juju status | grep osd
```

Step 2: Review OSD Logs for Errors

Access the logs for each identified OSD to look for any error messages related to disk size mismatches. The logs are typically located in /var/log/ceph on the OSD nodes. You can use the following command to search for relevant errors:
```
grep -i "size mismatch" /var/log/ceph/ceph-osd.*.log
```

Step 3: Check Disk and Metadata Sizes

For each affected OSD, verify the actual disk size and compare it with the metadata size recorded by Ceph. Use the following commands to gather this information:
```
sudo fdisk -l /dev/<osd-device>
ceph osd df | grep <osd-id>
```
Replace <osd-device> with the actual device name and <osd-id> with the specific ID of the OSD being investigated.

Step 4: Redeploy Affected OSDs

Once you’ve confirmed which OSDs are affected, you’ll need to redeploy them to correct the mismatch. Use the Charmed Ceph functionality for this purpose, avoiding direct usage of cephadm or ceph-deploy.

To redeploy an OSD, run the following. Note that this will keep the OSD id intact.

juju run <osd-unit-name> remove-disk osd-ids=<osd-id>
juju run <osd-unit-name> add-disk osd-devices=<osd-device> osd-ids=<osd-id>

Step 5: Verify the Resolution

After redeploying the affected OSDs, monitor the cluster’s health status and logs to ensure the issue is resolved:
```
ceph health detail
grep -i "size mismatch" /var/log/ceph/ceph-osd.*.log
```
Use the ceph osd df command again to verify that the metadata now correctly matches the disk sizes.

Alert: CephOSDNearFull (osd)

Overview

The “CephOSDNearFull” alert indicates that one or more OSD in your Ceph cluster are approaching their storage capacity limits. The NEARFULL threshold is a configurable limit intended to warn administrators before the OSDs become entirely full, potentially leading to cluster instability or data write failures. When an OSD approaches its capacity, it can affect the cluster’s ability to balance data and respond to failures.

Troubleshooting

Step 1: Identify Near Full OSD(s)

Use the command juju exec --unit ceph-mon/0 -- ceph health detail to get detailed health information about your Ceph cluster. This command will show which OSDs are nearing full capacity.
Additionally, run juju exec --unit ceph-mon/0 -- ceph osd df to display the disk usage for all OSDs. Look for OSDs with high utilization percentages.

Step 2: Check Cluster Balancing

Ensure the cluster is rebalancing correctly. Sometimes, imbalances can cause certain OSDs to fill up quicker. To address such imbalances, you can run juju exec --unit ceph-mon/0 -- ceph osd reweight-by-utilization.

Step 3: Increase Capacity

If possible, add more storage to the affected failure domain (e.g., add more disks or more OSDs). Use the ceph-osd charm to add more OSDs dynamically.
For increasing disk size, ensure to do it carefully to prevent data loss. Adding disks or OSDs may require careful planning and execution to ensure data integrity and cluster stability.

Step 4: Restore Down or Out OSDs

Sometimes, OSDs might be marked down or out. Use juju exec --unit ceph-mon/0 -- ceph osd tree to check the status of all OSDs. If any OSD is down, investigate the cause, and bring it back online.
You can bring an OSD back by ensuring it’s properly connected and healthy, and then using juju exec --unit ceph-osd/N -- ceph osd in <osd.id> where N is the specific OSD unit number and <osd.id> is the ID of the down OSD.

Step 5: Delete Unwanted Data

If increasing capacity is not an option, consider deleting unwanted or unnecessary data (files in CephFS, volumes in an RBD pool, etc). Ensure to do this carefully, verifying that the data is indeed not needed.

Alert: CephOSDReadErrors (osd)

Overview

The “CephOSDReadErrors” alert signals that an OSD in the Ceph storage system has encountered read errors while attempting to access data stored on its physical storage device. Although the system has managed to recover from these errors by retrying the read operations, the occurrence of such errors is concerning as it might point to underlying issues with the hardware (such as disk failure or corruption) or problems within the kernel that manages disk access.

Troubleshooting

To troubleshoot the CephOSDReadErrors alert, follow this plan:

Step 1: Identify the Affected OSD(s)

Use the command juju exec --unit ceph-osd/0 -- ceph osd status to get the status of all OSDs and identify any in an ERR or DOWN state.

Step 2: Check OSD Logs

Inspect the logs for the affected OSDs in /var/log/ceph on the machines running the problematic OSDs. Look for errors or warnings related to disk reads. This can be done using commands like grep -i error /var/log/ceph/ceph-osd.<id>.log or grep -i 'read error' /var/log/ceph/ceph-osd.<id>.log where <id> is the OSD ID.

Step 3: Hardware Checks

Perform hardware diagnostics to check the health of the disks. This can include using tools such as smartctl for S.M.A.R.T. tests (smartctl -t long /dev/<osd-device>), or a standard tool like fdisk.

Step 4: Check for Kernel Messages

Look for relevant kernel messages that might indicate hardware issues or file system problems by using dmesg | grep -i error.

Step 5: Check Ceph Health

Execute juju exec --unit ceph-mon/0 -- ceph health detail to check the overall health of the Ceph cluster and see if there are any additional warnings or errors that need addressing.

Step 6: Storage Device Performance

Test the storage device’s performance to ensure it is within expected parameters. Tools like hdparm can be used to test read speed. For example, you can use hdparm -Tt /dev/<osd-device>.

Step 7: Ceph OSD Reinitialization

If the hardware checks out and no other issues are found, consider reinitializing the OSD. This should be done with caution and as a last resort. Use the Ceph Charms to remove and then re-add the OSD to the cluster. This can be done using Juju actions specific to the ceph-osd charm.

Alert: CephOSDTimeoutsClusterNetwork (osd)

Overview

The alert CephOSDTimeoutsClusterNetwork indicates that the OSD (Object Storage Daemon) heartbeats within the Ceph cluster’s backend or “cluster” network are experiencing delays. These heartbeat messages are crucial for the Ceph cluster as they are used to ascertain the health and availability of OSDs. If these messages are delayed, it can indicate network latency or packet loss issues within the cluster network that can affect the overall performance and reliability of the Ceph storage system.

This specific alert suggests that there might be underlying network issues affecting the communication between OSDs on the designated cluster network subnet. It is essential to investigate and resolve these network issues to ensure the Ceph cluster operates efficiently and reliably.

Troubleshooting

To address and troubleshoot the CephOSDTimeoutsClusterNetwork alert, follow the comprehensive plan outlined below:

Step 1: Verify the Alert

- First, confirm the alert's validity by running `ceph health detail` using Juju:
    ```shell
    juju run --unit ceph-mon/0 -- ceph health detail
    ```
- Look for messages regarding OSD heartbeat delays to identify the affected OSDs.

Step 2: Check Network Configuration

- Ensure that the network configuration for the Ceph cluster network is correctly set up and that the subnet designated for the cluster network is not experiencing congestion or is misconfigured.
- Use `juju config` to review the network settings of the Ceph charms:
    ```shell
    juju config ceph-osd
    juju config ceph-mon
    ```

Step 3: Investigate Network Latency

- Utilize network diagnostic tools such as `ping` or `traceroute` from one of the Ceph nodes to check for latency or packet loss within the cluster network:
    ```shell
    juju exec --unit ceph-osd/0 -- ping <CEPH_OSD_IP>
    juju exec --unit ceph-osd/0 -- traceroute <CEPH_OSD_IP>
    ```
- Replace `<CEPH_OSD_IP>` with the IP address of another Ceph OSD node within the cluster network to test connectivity and latency.

Step 4: Examine OSD Logs

- Inspect the logs of the affected OSDs for any errors or warnings that might indicate network issues. This can be done by accessing the log files located at `/var/log/ceph` on the OSD nodes:
    ```shell
    juju ssh ceph-osd/0
    sudo cat /var/log/ceph/ceph-osd.<OSD_ID>.log | grep -i 'heartbeat'
    ```
- Replace `<OSD_ID>` with the ID of the affected OSD to filter for relevant log messages.

Step 5: Review Ceph Network Performance

- Use the `ceph osd perf` command to review the performance of each OSD, which can help identify any OSDs experiencing higher than average latency:
    ```shell
    juju run --unit ceph-mon/0 -- ceph osd perf
    ```

Step 6: Assess Physical Network Components

- If the issue persists, it might be related to physical network hardware such as switches, routers, or network interfaces on the Ceph nodes. Check the physical connections, and review the switch and router logs for any signs of issues.

Step 7: Engage Network Team

- If the problem is confirmed to be a network issue beyond the immediate Ceph configuration or if physical hardware issues are suspected, engaging your organization's network team for further investigation and resolution is advisable.

Step 8: Apply Fixes and Monitor

- After applying any configuration changes or fixes, continuously monitor the cluster's health using `ceph health` and check if the OSD heartbeat delays are resolved.

By systematically following these steps, you should be able to diagnose and resolve the network issues causing OSD heartbeat delays in the Ceph cluster network.

Alert: CephOSDTimeoutsPublicNetwork (osd)

Overview

The “CephOSDTimeoutsPublicNetwork” alert indicates that the OSD heartbeats within the Ceph cluster are experiencing delays over the public network interface. These heartbeats are critical for maintaining the health and synchronization of the cluster, as they signal the operational status of each OSD to the rest of the cluster. Slow or delayed heartbeats can be symptomatic of underlying network issues such as high latency or packet loss.

Troubleshooting

Follow the steps below to troubleshoot the CephOSDTimeoutsPublicNetwork (osd) alert:

Step 1: Verify Current Ceph Health Status

Begin by checking the current health status of your Ceph cluster to identify which OSDs are affected and to confirm the presence of network-related issues.
```
juju exec --unit ceph-mon/0 -- ceph health detail
```
This command will provide detailed health information about your Ceph cluster, including which OSDs are experiencing heartbeats delays.

Step 2: Check Network Latency and Packet Loss

Use the ping command from different nodes in the cluster to check for network latency and packet loss to the affected OSDs.
```
ping <OSD_IP_ADDRESS>
```
Consider using tools such as iperf or mtr to perform more comprehensive network diagnostics.

Step 3: Review Ceph OSD Logs

Inspect the logs for the affected OSDs to look for any errors or warnings related to networking.
```
tail -n 100 /var/log/ceph/ceph-osd.<OSD_ID>.log
```
This step can help identify specific issues that may not be apparent from the Ceph health status alone.

Step 4: Check for Interface Configuration Issues

Ensure that the network interfaces for the public network on the OSD nodes are correctly configured and operational.
```
juju exec --unit ceph-osd/0 -- ip addr show
```
Verify the network configuration settings such as IP addresses, netmasks, and gateways to ensure they are correctly assigned and consistent across the cluster.

Step 5: Evaluate Physical Network Infrastructure

If the above steps indicate network issues, inspect the physical network infrastructure (e.g., switches, routers, cables) for faults or misconfigurations that could be causing network delays or packet loss.

Step 6: Ceph OSD Reconfiguration (if needed)

Based on the findings from the above steps, it may be necessary to reconfigure the Ceph OSDs to use an alternative public network interface or to adjust network settings.
- Use Juju to reconfigure the Ceph charms with updated network settings if required.

Alert: CephOSDTooManyRepairs (osd)

Overview

The “CephOSDTooManyRepairs” alert indicates that an OSD within the Ceph cluster is consistently encountering read errors to the point where it has to rely on secondary placement groups (PGs) to fetch data. This behavior is a strong indicator of a potentially failing physical drive. Ceph’s design allows for data redundancy across multiple OSDs; however, frequent access to secondary PGs for data retrieval is not optimal and can significantly impact the performance and reliability of the storage cluster.

Troubleshooting

To troubleshoot the CephOSDTooManyRepairs alert, follow this comprehensive plan:

Step 1: Identify Affected OSDs

Use the Juju command to list all OSDs and their status:
```
juju exec --unit ceph-osd/0 -- 'ceph health detail'
```
Note the IDs of OSDs marked as down or having a high number of repairs.

Step 2: Check OSD Logs

Inspect the Ceph OSD logs for any error messages related to read failures or hardware issues. The logs can be found at /var/log/ceph/ on the node hosting the problematic OSD.
```
cat /var/log/ceph/ceph-osd.<osd-id>.log | grep -i error
```

Step 3: Examine Hardware Health

On the server hosting the affected OSD, run SMART tests to check the health of the physical drive:
```
smartctl -a /dev/<osd-device>
```
Replace /dev/<osd-device> with the appropriate device identifier for the drive in question.

Step 4: Check Ceph Health Detail

Get detailed health information from the Ceph cluster:
```
juju exec --unit ceph-mon/0 -- 'ceph health detail'
```
Look for messages related to the OSDs identified in step 1.

Step 5: Repair OSD

If a specific OSD is identified as problematic, attempt to repair it using Ceph’s repair utilities.
- Mark the OSD out (temporarily remove it from the cluster to redistribute its data):
```
juju exec --unit ceph-osd/0 -- 'ceph osd out osd.<osd-id>'
```
- After confirming that the data has been redistributed, you can attempt to clean the OSD:
```
juju exec --unit ceph-osd/0 -- 'ceph osd repair osd.<osd-id>'
```

Step 6: Monitor the OSD Rebalance

Keep an eye on the cluster rebalancing after marking the OSD out:
```
juju ssh ceph-mon/0 -- 'sudo ceph -w'
```
Ensure that the cluster returns to a healthy state with no data distribution issues.

Step 7: Replace Hardware if Necessary

If hardware tests indicate a failing drive, plan to replace the drive as soon as possible. After replacing the drive, add the new OSD to the cluster, while maintaining the OSD id:
```
juju run ceph-osd/0 remove-disk osd-ids=<osd-id>
juju run ceph-osd/0 add-disk osd-devices=<new-device> osd-ids=<osd-id>
```

Step 8: Verify Cluster Health

Finally, verify that the Ceph cluster is healthy:
```
juju exec --unit ceph-mon/0 -- 'ceph health'
```

Alert: CephPGBackfillAtRisk

Overview

The “CephPGBackfillAtRisk” alert indicates a critical condition within a Ceph cluster. This alert signifies that the cluster is unable to perform backfill operations, which are crucial for maintaining data redundancy and availability. This blockage is due to the cluster reaching or exceeding the ‘backfillfull’ threshold on one or more OSDs. This condition can be triggered by various factors, including unexpected data growth, failure to add new storage in a timely manner, or inefficiencies in data distribution across the cluster.

Troubleshooting

The troubleshooting process for a CephPGBackfillAtRisk alert involves several diagnostic steps and corrective actions:

Step 1: Verify Cluster Health and OSD Status

Use the juju exec command to run ceph status on one of the MON nodes to get an overview of the cluster’s health and to identify which OSDs are affected.
```
juju exec --unit ceph-mon/0 -- ceph status
```

List all OSDs and their statuses with:

juju exec --unit ceph-mon/0 -- ceph osd tree

Step 2: Check OSD Utilization

Check the detailed utilization of all OSDs to pinpoint those reaching or exceeding the backfillfull threshold:
```
juju exec --unit ceph-mon/0 -- ceph osd df
```

Step 3: Inspect OSD Logs for Errors

Check the OSD logs on affected nodes for any warnings or errors that might indicate why the OSD has reached the backfillfull threshold. Logs are located in /var/log/ceph.
```
less /var/log/ceph/ceph-osd.<osd-id>.log
```

Replace <osd-id> with the ID of the OSD you wish to inspect.

Step 4: Review Cluster Capacity and Plan for Expansion

Evaluate the overall capacity and usage of the cluster to determine if additional storage needs to be added or if data can be deleted to free up space.
Consider using the ceph osd pool set command to adjust the target_max_bytes or target_max_objects of specific pools to better manage capacity.

Step 5: Add More Storage

If additional capacity is required, plan to add more OSDs to the cluster. Utilize the Ceph Charms to add new storage resources efficiently.
- To add new OSDs, you can use the ceph-osd charm and relate it to the existing cluster.
```
juju deploy ceph-osd --config osd-devices=<device list>
juju relate ceph-osd ceph-mon
```
- Ensure you replace <device list> with the actual devices you’re adding.

Step 6: Purge Unwanted Data

If adding more storage is not feasible, identify and delete unnecessary data or snapshots taking up space. Use the rados and rbd tools to manage data within pools and images, respectively.
```
juju exec --unit ceph-mon/0 -- rados df
juju exec --unit ceph-mon/0 -- rbd du
```

Step 7: Monitor Backfill Progress

After taking corrective action, monitor the cluster’s backfill progress to ensure that the issue is resolved.
```
juju ssh ceph-mon/0 -- 'sudo ceph -w'
```

Alert: CephPGImbalance (osd)

Overview

The “CephPGImbalance” alert indicates an imbalance in the distribution of Placement Groups (PGs) across the OSDs within a Ceph cluster. Specifically, it means that one or more OSDs have a PG count that deviates by more than 30% from the average PG count across all OSDs. This imbalance can lead to inefficient distribution of workload and may affect the performance and reliability of the Ceph storage cluster.

Such An imbalance might occur due to various reasons, including:

Changes to the cluster, such as adding or removing OSDs.
Inadequate initial PG count configuration.
Unequal storage capacity across OSDs.

Troubleshooting

The plan to troubleshoot a CephPGImbalance alert involves several steps to diagnose and correct the imbalance of PGs across OSDs. Follow these steps systematically to identify and resolve the issue:

Step 1: Identify Affected OSD(s)

- Execute the following command to list PG distribution across OSDs: `juju exec --unit ceph-mon/leader "ceph osd df tree"`

Step 2: Review Cluster Configuration

- Inspect the Ceph configuration for any potential misconfigurations that might lead to imbalance: `juju exec --unit ceph-mon/leader "ceph osd pool autoscale-status"`
- Ensure that the autoscaling parameters are correct for all relevant pools.

Step 3: Analyze Historical Changes

- Review the history of OSD additions or removals that might have contributed to the imbalance: Inspect log files in `/var/log/ceph` for relevant entries.

Step 4: Adjust OSD Weights

- If specific OSDs are overloaded, manually adjust their weights: `juju exec --unit ceph-mon/leader "ceph osd reweight <osd-id> <new-weight>"`

Step 5: Rebalance PGs

- If necessary, trigger a manual rebalance of PGs across OSDs: `juju exec --unit ceph-mon/leader "ceph osd reweight-by-pg"`

Step 6: Monitor the Rebalancing Process

- Continuously monitor the rebalancing process and check the cluster's health: `juju exec --unit ceph-mon/leader "ceph health detail"`

Step 7: Verify Resolution

- Once the rebalancing process completes, verify that the PG distribution is now balanced: `juju exec --leader ceph-mon/leader "ceph osd df tree"`

Alert: CephPGNotDeepScrubbed (pgs)

Overview

The “CephPGNotDeepScrubbed” alert signifies that one or more Placement Groups (PGs) in your Ceph cluster have not undergone a deep scrubbing process within the expected time frame. Deep scrubbing is a crucial maintenance operation designed to safeguard against bit rot and data corruption. Failing to perform deep scrubs as scheduled can be symptomatic of issues such as an inadequately configured scrubbing interval or PGs being in an unclean state, preventing the deep scrub from initiating.

Troubleshooting

To address the “CephPGNotDeepScrubbed” alert, follow this comprehensive troubleshooting plan:

Step 1: Verify Cluster Health

First, assess the overall health of your Charmed Ceph cluster to ensure there are no broader issues affecting its operations.

Execute: juju exec --unit ceph-mon/0 -- ceph health detail. This will provide detailed information about the cluster’s health, including any warnings or errors that might be impacting the deep scrub operations.

Step 2: Check Scrubbing Intervals

Analyze the configuration for the scrubbing intervals to ensure they are set appropriately.

Execute: juju exec --unit ceph-mon/0 -- ceph tell osd.* config show and look for the osd_deep_scrub_interval value, which determines the frequency of deep scrubbing.

Step 3: Review PG States

Ensure that the PGs identified are in a ‘clean’ state, which is required for deep scrubbing to proceed.

Execute: juju exec --unit ceph-mon/0 -- ceph pg stat to view the current states of all PGs. PGs must be in a clean state to be eligible for deep scrubbing.

Step 4: Manually Trigger Deep Scrub

For PGs identified as not being deep scrubbed, you can manually trigger a deep scrub.

Execute: juju exec --unit ceph-mon/0 -- ceph pg deep-scrub {pg-id}. Replace {pg-id} with the actual ID of the PG you intend to deep scrub. This should be done cautiously and preferably during off-peak hours to minimize impact on the cluster’s performance.

Step 5: Inspect Logs

Review the Ceph logs for any warnings or errors related to deep scrub operations. Focus on logs located in /var/log/ceph on the Ceph OSD units.
Look for messages indicating issues with deep scrubbing, such as failures to initiate a deep scrub or errors during the scrubbing process.

Step 6: Adjust Configuration If Needed

If the scrubbing intervals are identified as being set too narrowly, causing PGs to miss their deep scrub windows, consider adjusting these intervals.

Utilize the juju config command to adjust the osd_deep_scrub_interval setting for the Ceph OSDs accordingly.

Step 7: Monitor for Improvement

After making configuration changes or manually triggering deep scrubs, continuously monitor the cluster’s health and the status of the PGs to ensure that they are being deep scrubbed as expected.

Alert: CephPGNotScrubbed (pgs)

Overview

The “CephPGNotScrubbed” alert indicates that one or more Placement Groups (PGs) within the Ceph cluster have not undergone the scheduled scrubbing process recently. Scrubbing is a critical maintenance task that checks the integrity of metadata and ensures consistency across data replicas and failure to scrub PGs as scheduled can result in an overly narrow scrub window or in PGs not being in a ‘clean’ state when they should be scrubbed. Not addressing this alert can lead to undetected data corruption, risking the overall health and data integrity of the Ceph cluster.

Troubleshooting

Follow the steps below to diagnose and resolve issues related to the “CephPGNotScrubbed” alert:

Step 1: Verify PGs Status

juju exec --unit ceph-mon/leader -- ceph pg stat

Look for PGs marked as stale or not in a clean state.

Step 2: Check Scrubbing Configuration

juju exec --unit ceph-mon/leader -- ceph tell osd.* config show

Look for the relevant values as described here: https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#scrubbing

Step 3: Manually Initiate Scrubbing

For PGs identified as not recently scrubbed and are in a clean state, manually initiate a scrub.

juju exec --unit ceph-mon/0 -- ceph pg scrub <pgid>

Replace <pgid> with the actual ID of the Placement Group. Repeat for each PG that requires scrubbing.

Step 4: Monitor Scrubbing Progress

juju ssh ceph-mon/0 -- 'sudo ceph -w'

This command provides real-time updates on cluster activities, including scrubbing processes.

Step 5: Review Logs for Errors

Check the Ceph logs for any errors related to the scrubbing process.

grep -i scrub /var/log/ceph/*.log

This step helps identify any underlying issues that could be preventing successful scrubbing.

Step 6: Adjust Scrubbing Windows if Necessary

juju config ceph-osd osd_scrub_min_interval=<value>
juju config ceph-osd osd_scrub_max_interval=<value>

Replace <value> with the desired time in seconds.

Step 7: Verify Cluster Health

juju exec --unit ceph-mon/leader -- ceph health detail

Ensure there are no outstanding issues related to PG scrubbing.

Alert: CephPGRecoveryAtRisk (pgs)

Overview

The “CephPGRecoveryAtRisk” alert indicates a critical situation where data redundancy within the Ceph cluster is compromised due to high storage utilization on one or more OSDs. Any OSD that hits or surpasses the ‘full’ threshold are unable to participate in data recovery processes, which is essential for maintaining data redundancy and integrity.

Troubleshooting

Step 1: Verify OSD Storage Utilization

Use the juju exec command to run ceph osd df on a mon node to list OSDs and their storage utilization.

juju exec --unit ceph-mon/0 -- ceph osd df

This command provides a detailed view of each OSD’s storage utilization, helping identify which OSDs are at or near the ‘full’ threshold.

Step 2: Check Cluster Health

Run the following command to check the overall health of the Ceph cluster:

juju exec --unit ceph-mon/0 -- ceph health detail

This will provide insights into the health status and highlight any issues beyond OSD fullness that may need addressing.

Step 3: Review OSD Logs

Inspect the OSD logs located in /var/log/ceph on the OSD nodes for any errors or warnings that might indicate issues affecting storage utilization or performance.

juju exec --application ceph-osd -- tail -n 100 /var/log/ceph/ceph-osd.*.log

This command tails the last 100 lines of each OSD log file, which can be useful for identifying recent errors or warnings.

Step 4: Add More Storage

If the cluster is genuinely running low on storage, consider adding more OSDs to the cluster to increase capacity. Use the Ceph charms (ceph-osd) to add more storage by deploying new OSD units or adding disks to existing ones.

Step 5: Restore Down/Out OSDs

Identify any OSDs that are down and attempt to bring them back online. Use the following command to list down OSDs:

juju exec --unit ceph-mon/0 -- ceph osd tree | grep down

Investigate why these OSDs are down by reviewing their logs and system health. Attempt to resolve any issues and bring them back into the cluster.

Step 6: Purge Unnecessary Data

If adding more storage is not an immediate option, identify and remove unnecessary data from the cluster to free up space. This may involve deleting old snapshots, reducing the number of replicas for non-critical data, or cleaning up any test data.

Step 7: Adjust Full Ratios

As a temporary measure, consider adjusting the osd full and nearfull ratios to provide breathing space for recovery actions. This should be done cautiously and only as a temporary measure until additional storage can be added or data can be cleaned up.

juju exec --unit ceph-mon/0 -- ceph osd set-nearfull-ratio <ratio>
juju exec --unit ceph-mon/0 -- ceph osd set-full-ratio <ratio>

Replace <ratio> with the desired threshold values.

Step 8: Further troubleshooting

- For additional information, please see: https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/

Alert: CephPGsDamaged (pgs)

Overview

The alert “CephPGsDamaged” indicates that during the data consistency checks, commonly known as a scrub, one or more Placement Groups (PGs) have been identified as damaged or inconsistent. This is a critical alert because PGs are fundamental to Ceph’s data distribution and redundancy model. Damaged or inconsistent PGs can lead to data loss or corruption, impacting the overall health and performance of the Ceph cluster.

Troubleshooting

Step 1: Identify Affected Placement Groups

Before attempting any repair, you need to identify which PGs are affected. Use the rados list-inconsistent-pg <pool> command. To run this command on all nodes, you can use Juju’s juju exec command to execute it cluster-wide.

juju exec --unit ceph-mon/leader -- rados list-inconsistent-pg <pool>

Replace <pool> with the name of the pool you want to check. If the pool name is unknown, you can list all pools by running:

juju exec --unit ceph-mon/leader -- ceph osd lspools

Step 2: Inspect Logs for Additional Insights

Before proceeding with repairs, it’s beneficial to gather more context on the nature of the damage. Check the Ceph logs for any warnings, errors, or anomalies related to the identified PGs.

Logs can be found in /var/log/ceph on the Ceph OSD nodes. Use Juju to access a specific OSD node and inspect the logs. For example:

juju ssh ceph-osd/0

Once logged in, use commands like grep, less, or tail to search through the log files for relevant entries.

Step 3: Repair the Affected Placement Groups

After identifying the damaged PGs and understanding the extent of the damage, proceed with the repair. Use the ceph pg repair <pg_num> command for this. As this operation can impact data availability, ensure you understand the potential consequences.

To run the repair command, use:

juju exec --unit ceph-mon/leader -- ceph pg repair <pg_num>

Replace <pg_num> with the actual PG number identified in step 1.

Step 4: Verify the Repair

After the repair, it’s critical to verify that the PGs are now healthy and that the cluster is stable. Use the ceph health command to check the overall health status of the cluster.

juju exec --unit ceph-mon/leader -- ceph health

If the repair was successful, the cluster status should return to HEALTH_OK. Otherwise, further investigation might be needed.

Step 5: Monitor the Cluster

Continue to monitor the cluster closely for a while after the repair. Check for any new alerts or unusual behavior. It’s also a good idea to perform a full cluster check using:

juju exec --unit ceph-mon/leader -- ceph health detail

Additional Considerations

Ensure that there are recent backups of critical data before attempting any repair operations.
Consult the Ceph documentation and community for guidance on handling severe or complex issues.
Consider engaging with professional support if the issue persists or if there is uncertainty about the repair process.

Alert: CephPGsHighPerOSD (pgs)

Overview

The “CephPGsHighPerOSD” alert indicates that the number of placement groups (PGs) per OSD in your Ceph cluster exceeds the recommended limit set by the mon_max_pg_per_osd setting. Ceph uses placement groups to manage object storage and distribute data across the OSDs in the cluster evenly. However, having too many PGs per OSD can lead to performance issues and increased resource consumption.

The alert suggests that the cause might be due to the pg_autoscaler being disabled for one or more pools, or the autoscaler settings (such as target_size_ratio or pg_num) not being appropriately configured. The pg_autoscaler is a Ceph feature that automatically adjusts the number of PGs based on the pool’s size and the overall cluster capacity to optimize performance and resource usage.

Troubleshooting

Follow the steps below to troubleshoot and resolve the issue of having too many placement groups per OSD:

Step 1: Verify pg_autoscaler Status

Use the juju exec command to run ceph osd pool autoscale-status on one of your Ceph MON nodes to check if the pg_autoscaler is enabled and functioning correctly for all pools.

juju exec --unit ceph-mon/0 -- ceph osd pool autoscale-status

Step 2: Check for Disabled pg_autoscaler

If the output from the above command indicates that the pg_autoscaler has been disabled for any pool, re-enable it using the Ceph commands through juju exec or juju run against the appropriate Ceph MON or OSD unit.

juju exec --unit ceph-mon/0 -- ceph osd pool set <pool_name> pg_autoscale_mode on

Step 3: Adjust Target Size Ratio

If specific pools are consuming a disproportionate amount of resources, adjust their target_size_ratio to guide the autoscaler. This can be done with the following command:

juju exec --unit ceph-mon/0 -- ceph osd pool set <pool_name> target_size_ratio .1

Replace <pool_name> with the name of your pool, and .1 with the appropriate ratio based on the pool’s expected relative size.

Step 4: Change pg_autoscaler Mode to ‘warn’

If necessary, set the pg_autoscaler mode to ‘warn’ for more aggressive tuning:

juju exec --unit ceph-mon/0 -- ceph osd pool set <pool_name> pg_autoscale_mode warn

Step 5: Adjust `pg_num` Appropriately

Manually adjust pg_num for one or more pools if you believe the autoscaler’s decisions are not optimal. This is a more advanced operation and should be done with caution:

juju exec --unit ceph-mon/0 -- ceph osd pool set <pool_name> pg_num <new_pg_num>

Replace <new_pg_num> with the desired number of placement groups. This action should be performed carefully, as incorrect settings can adversely affect cluster performance and stability.

Step 6: Review Ceph Logs

Inspect the Ceph logs located in /var/log/ceph on the Ceph MON and OSD nodes for any warnings or errors related to PGs and OSDs.

juju exec --unit ceph-mon/0 -- cat /var/log/ceph/ceph.log
juju exec --unit ceph-osd/0 -- cat /var/log/ceph/ceph.log

Step 7: Additional strategies

- If everything else fails, refer to: https://docs.ceph.com/en/quincy/rados/operations/placement-groups/ for additional ways to fine tune PG distribution and usage.

Alert: CephPGsInactive (pgs)

Overview

The CephPGsInactive alert indicates that one or more placement groups (PGs) within the Ceph storage cluster are currently inactive. When a PG becomes inactive, it is unable to serve any read or write requests. This situation can severely affect the performance and availability of data within the affected pool(s). An inactive status for more than 5 minutes suggests a significant issue within the cluster that requires immediate attention to restore functionality and ensure data availability and integrity.

Troubleshooting

Step 1: Check Cluster Health

Firstly, assess the overall health of the Ceph cluster using the Charmed Ceph tools:
```
juju exec --unit ceph-mon/0 -- ceph health detail
```
This command will provide a detailed health report of the cluster, highlighting any potential issues including those related to inactive PGs.

Step 2: Identify Inactive PGs

To specifically identify the inactive PGs, use the following command:
```
juju exec --unit ceph-mon/0 -- ceph pg dump_stuck inactive
```
This command lists all the PGs that are currently inactive, which can help in pinpointing the specific pools affected.

Step 3: Check for OSD Issues

Inactive PGs can often be due to OSD (Object Storage Daemon) issues. Execute the following to check the status of OSDs:
```
juju exec --unit ceph-osd/0 -- ceph osd status
```
Look for any OSDs that are down or in an unhealthy state, as these can directly impact PG availability.

Step 4: Review Ceph Logs

Analyzing log files can provide insights into why PGs are inactive. Focus on logs related to OSDs and monitors (MONs):
- For OSD logs:
```
cat /var/log/ceph/ceph-osd.{osd-id}.log
```
- For MON logs:
```
cat /var/log/ceph/ceph-mon.{mon-id}.log
```
Replace {osd-id} and {mon-id} with the actual IDs of the daemons you’re investigating.

Step 5: Check for Pool Configuration Issues

Misconfigurations at the pool level can also lead to inactive PGs. Use the following command to inspect the configuration of the affected pool(s):
```
juju exec --unit ceph-mon/0 -- ceph osd pool get {pool-name} all
```
Replace {pool-name} with the name of the pool that contains the inactive PGs. Look for any configurations that might be affecting the PGs’ activity status.

Step 6: Engage with the Charmed Ceph Community

If the issue persists after performing the above steps, engaging with the Charmed Ceph community or seeking professional support may be beneficial. Sharing specific logs and error messages can help in diagnosing the problem more effectively.

Alert: CephPGsUnclean (pgs)

Overview

The “CephPGsUnclean” alert indicates a critical issue related to Placement Groups (PGs) being marked as “unclean”. When a PG is marked as “unclean,” it means that the data within that PG has not been successfully replicated according to the policy defined for the pool it belongs to. This can occur due to a variety of reasons such as hardware failure, network issues, or misconfiguration, which prevents the cluster from achieving data replication or recovery tasks successfully.

The alert specifically mentions that PGs have been unclean for more than 15 minutes, indicating a persistent issue that the cluster has not automatically resolved, which is a common behavior for transient problems. This situation requires immediate attention as it can lead to data loss or unavailability if not addressed promptly.

Troubleshooting

Step 1: Identify the Affected PGs and Pools

To begin troubleshooting, identify which PGs are marked as unclean and the pools they belong to.

Use the juju exec command to run ceph pg stat on one of the ceph-mon units to get an overview of the placement group status.
```
juju exec --unit ceph-mon/0 -- ceph pg stat
```

For a more detailed view, list unclean PGs specifically:

juju exec --unit ceph-mon/0 -- ceph pg dump pgs_brief | grep unclean

Step 2: Check Cluster Health and OSD Status

Understanding the overall health of the cluster and the status of OSDs (Object Storage Daemons) is crucial for diagnosing issues with unclean PGs.

Check the overall cluster health:

juju exec --unit ceph-mon/0 -- ceph health detail

List all OSDs and their status:

juju exec --unit ceph-mon/0 -- ceph osd tree

Step 3: Inspect Logs for Errors

Errors or warnings in the Ceph logs can provide further insight into the issue.

Inspect the Ceph MON and OSD logs located in /var/log/ceph on the corresponding units for any error messages or warnings related to the unclean PGs.
```
cat /var/log/ceph/ceph-mon.<hostname>.log
cat /var/log/ceph/ceph-osd.<id>.log
```

Step 4: Manual Intervention and Recovery

If the issue persists after verifying the cluster’s health and inspecting logs, manual intervention might be necessary.

If specific OSDs are down or out, consider marking them back in if they are healthy:
```
juju exec --unit ceph-mon/0 -- ceph osd in <osd.id>
```

Step 5: Monitor Recovery Progress

After taking corrective actions, continuously monitor the recovery process and cluster health.

Watch the recovery and backfilling process:
```
juju ssh ceph-mon/leader -- sudo ceph -w
```

Alert: CephPGUnavilableBlockingIO (pgs)

Overview

The “CephPGUnavailableBlockingIO” alert is indicative of a critical issue within the Ceph storage cluster. It signifies that one or more Placement Groups (PGs) are in a state which prevents them from servicing I/O operations. This state can be caused by various issues such as OSD failures, network disruptions, misconfigurations, or hardware malfunctions, leading to reduced data availability and potentially impacting the cluster’s overall performance and reliability.

Troubleshooting

The troubleshooting plan for resolving the “CephPGUnavailableBlockingIO” alert involves several diagnostic steps and remedial actions. Please follow these steps carefully:

Step 1: Identify Unavailable PGs

``` bash
juju exec --unit ceph-mon/leader -- ceph pg stat
```

Look for PGs that are in states like `down`, `incomplete`, `stale`, or `peering`.

Step 2: Check Cluster Health

``` bash
juju exec --unit ceph-mon/leader -- ceph health detail
```

This command will provide detailed health information, which can offer clues about the root cause.

Step 3: Review OSD Status

The issue might be related to OSDs. Check the status of OSDs to identify any that are down or in an error state, as well as Ceph health from the above command

``` bash
juju exec --unit ceph-mon/leader -- ceph osd tree
```

Step 4: Inspect Logs

Check the Ceph logs for any errors or warnings that might indicate what caused the PGs to become unavailable. Focus on logs around the time the issue was first observed.
- OSD Logs: /var/log/ceph/ceph-osd.<id>.log
- Monitor Logs: /var/log/ceph/ceph-mon.<hostname>.log
Make sure to replace <id> and <hostname> with the actual OSD IDs and monitor hostnames.

Step 5: PG Query

``` bash
juju exec --unit ceph-osd/0 -- ceph pg <pgid> query
```

Replace `<pgid>` with the ID of one of the problematic PGs. This will provide insights into why the PG is unavailable.

Step 6: Recover Down OSDs

``` bash
juju run --unit ceph-osd/<osd-id> -- 'systemctl restart ceph-osd@<osd-num>.service'
```

Replace `<osd-id>` with the OSD unit number and `<osd-num>` with the OSD's ID.

Step 7: Ensure Data Redundancy and Rebalance

Once the immediate issues are addressed, make sure that data redundancy is restored and the cluster is rebalanced. This might involve waiting for Ceph to automatically recover and rebalance PGs, which can be monitored with:

``` bash
juju ssh ceph-mon/leader -- sudo ceph -w
```

Alert: CephPoolBackfillFull (pools)

Overview

The alert CephPoolBackfillFull indicates that the free space available in a Ceph storage pool is critically low. This situation hampers the system’s ability to perform recovery or backfill operations, which are essential for data redundancy and integrity. When the available space in a pool drops below the configured “near full” threshold, Ceph will raise this alert to notify administrators that the pool’s capacity needs expansion to maintain healthy system operation and ensure data is not at risk.

Troubleshooting

To effectively troubleshoot and resolve the CephPoolBackfillFull alert, follow the steps outlined below:

Step 1: Verify Pool Capacity and Utilization

Use Juju to check the current utilization and capacity of your pools.
```
juju exec --unit ceph-mon/leader -- ceph df detail
```
This will give you an overview of each pool’s usage and help identify which pool(s) are nearing or at the full threshold.

Step 2: Examine Cluster Health

To get a broader view of the cluster’s health and identify any other underlying issues, execute:
```
juju exec --unit ceph-mon/leader "ceph health detail"
```
Look for any warnings or errors that might indicate issues beyond pool capacity, such as OSDs down.

Step 3: Check OSD Status

Verify the status of your OSDs to ensure they are all up and in.
```
juju exec --unit ceph-mon/leader "ceph osd stat"
```
An OSD being down could lower the available capacity.

Step 4: Inspect Logs for Additional Insights

Review the Ceph logs for any errors or warnings that could indicate issues related to the backfill/full situation.
```
less /var/log/ceph/ceph.log
```
Pay special attention to any messages related to OSDs or pools nearing full capacity.

Step 5: Expand Pool Capacity

If the investigation confirms that the pool is indeed near or at capacity, consider adding more OSDs to the cluster to increase storage capacity.
This will likely involve provisioning new hardware or adding disks and then using the ceph-osd charm to integrate the new capacity into the cluster.

Alert: CephPoolFull (pools)

Overview

The “CephPoolFull” alert indicates that a storage pool within the Charmed Ceph cluster has reached its maximum capacity, either by hitting its predefined quota or because the OSDs supporting the pool have reached their full threshold. When this happens, the system prevents any further write operations to the affected pool to avoid data loss or corruption. This is a critical alert that requires immediate attention to prevent service disruptions and potential data loss scenarios. The alert also provides a breakdown of the top 5 pools by usage, highlighting the ones that are most affected and need urgent intervention.

Troubleshooting

To troubleshoot and resolve the “CephPoolFull” alert, follow this comprehensive plan:

Step 1: Identify the Full Pool

```bash
juju exec --unit ceph-mon/leader -- ceph df
```

This command will provide a summary of the pool usage, including which pools are nearing or at capacity.

Step 2: Check OSD Status

```bash
juju exec --unit ceph-mon/leader -- ceph osd df
```

Look for any OSDs with high usage. This indicates that the physical storage backing these OSDs is at capacity.

Step 3: Review Pool Quotas

```bash
juju exec --unit ceph-mon/leader -- ceph osd pool get-quota <pool_name>
```

Replace `<pool_name>` with the name of the pool identified in step 1. This will show if the pool has a set quota and if it has been exceeded.

Step 4: Inspect Log Files

Inspect the Ceph log files located in /var/log/ceph for any error messages or warnings related to full pools or OSDs. This can give additional context to the issue.

Step 5: Increasing Pool Quota

- If the pool has hit its quota, consider increasing the quota to allow more data. This is done via:

```bash
juju exec ceph-mon/leader -- ceph osd pool set-quota <pool_name> max_bytes <new_bytes>
```

Replace `<pool_name>` with the name of the affected pool and `<new_bytes>` with the new quota in bytes.

Step 6: Adding More Capacity

- If the overall cluster is running out of space, consider adding more OSDs to increase capacity. This involves deploying more `ceph-osd` units via Juju and ensuring they are correctly added to the cluster.

```bash
juju add-unit ceph-osd --to <node>
```

Replace `<node>` with the target node where you wish to add the new OSD. Make sure the new OSDs are properly integrated into the cluster and that rebalancing occurs.

Step 7: Monitor Cluster Health

```bash
juju exec --unit ceph-mon/leader -- ceph health detail
```

This command provides detailed health information about the Ceph cluster, including the status of the pools and OSDs.

Alert: CephPoolGrowthWarning (pools)

Overview

The CephPoolGrowthWarning alert indicates that a specific storage pool within the Charmed Ceph cluster is at risk of exceeding its storage capacity within the next five days, based on the growth rate observed over the past 48 hours. This could potentially lead to a scenario where new data cannot be stored, and the overall health and performance of the Ceph cluster may be adversely affected. Immediate attention is required to prevent the pool from running out of space, which may involve data cleanup, pool expansion, or reconfiguration.

Troubleshooting

Follow the steps below to troubleshoot and resolve the CephPoolGrowthWarning alert.

Step 1: Identify the Pool

```sh
juju exec --unit ceph-mon/leader -- ceph df
```

Step 2: Check Pool Utilization

```sh
juju exec --unit ceph-mon/leader -- ceph osd pool stats <pool_name>
```

Step 3: Review Pool Configuration

```sh
juju exec --unit ceph-mon/leader -- ceph osd pool get <pool_name> all
```

Step 4: Check for Unusual Activity

Determine if there has been an unusual spike in data ingestion or a change in data patterns that could explain the rapid growth. Review log files in /var/log/ceph for any unusual activity or errors related to the pool in question.

Step 5: Storage Expansion

Consider expanding cluster capacity by adding more storage capacity to the cluster. This will likely involve provisioning new hardware or adding disks and then using the ceph-osd charm to integrate the new capacity into the cluster.

Alert: CephPoolNearFull (pools)

Overview

The CephPoolNearFull alert indicates that one or more pools in your Ceph cluster have exceeded the configured warning threshold for space utilization. This scenario can arise either because the actual data stored in the pool is approaching its maximum capacity or because the underlying OSDs that store the pool’s data are nearing their full capacity (reaching the NEARFULL threshold). When this happens, the cluster is at risk of moving into a read-only state to prevent data corruption due to insufficient space for writes. This alert serves as an early warning to take corrective action before the situation escalates.

Troubleshooting

To address the CephPoolNearFull alert, follow this troubleshooting plan for Charmed Ceph:

Step 1: Determine the Affected Pool

Use Juju to run the ceph df detail command on one of the Ceph monitor units to identify which pool(s) are nearing full capacity. Look for QUOTA BYTES and STORED columns to understand the capacity and usage.

juju exec --unit ceph-mon/leader -- ceph df detail

Step 2: Check Overall Cluster Health and OSD Utilization

To understand if the near-full state is due to overall cluster capacity issues, check the cluster health and OSD utilization:

juju exec --unit ceph-mon/0 -- ceph health detail
juju exec --unit ceph-mon/0 -- ceph osd df

Step 3: Increase Pool Quota or Cluster Capacity

If the pool quota is the limiting factor, adjust it upwards using the Ceph charms. This might be necessary if the application’s data needs have grown.
If the cluster itself is running out of capacity, consider adding more storage to the OSDs or adding additional OSDs to the cluster. This will likely involve provisioning new hardware or adding disks and then using the ceph-osd charm to integrate the new capacity into the cluster.

Step 4: Ensure the Balancer is Active

The Ceph balancer helps to distribute data evenly across the OSDs, which can prevent individual OSDs from reaching the NEARFULL state prematurely. Verify that the balancer is enabled and active:

juju exec --unit ceph-mon/0 -- ceph balancer status

Alert: CephRGWMultisiteFetchErrorCritical

Overview

The “CephRGWMultisiteFetchErrorCritical” alert indicates a critical condition within a Ceph multisite configuration, specifically regarding the replication of objects between zones. When this alert triggers, it signifies that the number of unsuccessful object replications from the source zone has surpassed a predefined threshold, set tp 50 errors within 15-minutes by default. This situation can impact data consistency across the Ceph multisite deployment and requires immediate attention to diagnose and resolve underlying issues causing replication failures.

Troubleshooting

Note: radosgw-admin commands expect --id parameter which should be provided as rgw.<hostname>. You may use juju to fetch this hostname and use in all such invocations like the following example.

hn=$(juju exec --unit ceph-radosgw/leader -- hostname)
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn sync status

Step 1: Verify Network Connectivity

Ensure network connectivity between zones is intact. Use ping or traceroute from the Ceph nodes to verify network paths are operational.

Step 2: Inspect RGW Logs for Errors

Inspect the RGW logs for errors related to replication. The logs can be found in /var/log/ceph on the RGW nodes. Use commands like grep -i error /var/log/ceph/rgw.log to filter for error messages.

Step 3: Examine Multisite Configuration

Ensure the multisite configuration is correctly set up. This includes checking zone groups, zones, and period configurations.

``` bash
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zonegroup get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zone get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn period get
```

Step 4: Check for Sync Status

- Check the sync status of buckets failing to replicate.

    ``` bash
    juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn bucket sync status --bucket
    ```

- Or the overall sync status with

    ``` bash
    juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn sync status

Step 5: Monitor Replication Errors

Use radosgw-admin sync error list to identify objects failing to replicate. Analyze the output for patterns or specific errors that could indicate the root cause.

Step 6: Retry Failed Replications

After resolving the underlying issues, use radosgw-admin sync error trim to clear old errors and retry replication for previously failed objects.

Step 7: Inspecting Network Traffic with tcpdump

If you suspect network issues, use tcpdump on the RGW instances to monitor traffic and identify if objects are failing to transfer. This can help pinpoint network-related issues.
Based on findings from the previous steps, address any network connectivity issues or misconfigurations identified. This may involve adjusting network settings, fixing endpoint configurations, or updating firewall rules.

Step 8: Retry Failed Replications

After resolving the underlying issues, use radosgw-admin sync error trim to clear old errors and retry replication for previously failed objects.

Step 9: Review Ceph Configuration for RGW

- Check the Ceph configuration files on RGW nodes for any misconfigurations that might be affecting replication.
- Look into `/etc/ceph/ceph.conf` and verify that the settings are correct for your multisite setup.

Alert: CephRGWMultisiteFetchError

Overview

The “CephRGWMultisiteFetchError” alert indicates a problem with the replication of objects between zones in a Ceph multisite configuration. Specifically, it signifies that the number of unsuccessful object replications from the source zone has surpassed a pre-defined threshold, which is set to 2 errors every 15 minutes by default. This issue can impact data consistency across zones, potentially leading to data unavailability or loss in some scenarios.

This could be caused by several underlying issues such as network instability, misconfigurations in the Ceph multisite setup, problems with the source or destination Ceph Realms, Zones, or Gateways.

Troubleshooting

Note: radosgw-admin commands expect --id parameter which should be provided as rgw.<hostname>. You may use juju to fetch this hostname and use in all such invocations like the following example.

hn=$(juju exec --unit ceph-radosgw/leader -- hostname)
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn sync status

Step 1: Verify Network Stability and Connectivity

- Ensure there's stable network connectivity between the source and destination zones.
- Use tools like `ping`, `traceroute`, or `mtr` to check for network latency or packet loss.

Step 2: Inspect RGW Logs for Errors

- Check the log files of the Ceph RADOS Gateway in `/var/log/ceph` on the machines that host the RGW instances.
- Look for any errors or warnings related to object replication.

Step 3: Examine Multisite Configuration

Ensure the multisite configuration is correctly set up. This includes checking zone groups, zones, and period configurations.

``` bash
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zonegroup get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zone get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn period get
```

Step 4: Check for Sync Status

- Check the sync status of buckets failing to replicate.

    ``` bash
    juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn bucket sync status --bucket
    ```

- Or the overall sync status with

    ``` bash
    juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn sync status
    ```

Step 5: Monitor Replication Errors

Use radosgw-admin sync error list to identify objects failing to replicate. Analyze the output for patterns or specific errors that could indicate the root cause.

Step 6: Retry Failed Replications

After resolving the underlying issues, use radosgw-admin sync error trim to clear old errors and retry replication for previously failed objects.

Step 7: Inspecting Network Traffic with tcpdump

- If you suspect network issues, use `tcpdump` on the RGW instances to monitor traffic and identify if objects are failing to transfer. This can help pinpoint network-related issues.

Step 8: Retry Failed Replications

After resolving the underlying issues, use radosgw-admin sync error trim to clear old errors and retry replication for previously failed objects.

Step 9: Review Ceph Configuration for RGW

- Check the Ceph configuration files on RGW nodes for any misconfigurations that might be affecting replication.
- Look into `/etc/ceph/ceph.conf` and verify that the settings are correct for your multisite setup.

Alert: CephRGWMultisitePollErrorCritical

Overview

The “CephRGWMultisitePollErrorCritical” alert is triggered when the number of unsuccessful replication log request errors exceeds a predefined threshold, set to 50 errors within 15-minutes by default. This indicates a critical issue with the replication process in a Ceph multisite configuration, where one or more secondary sites are unable to successfully request or retrieve replication logs from the primary site. This failure can have significant implications for data consistency and availability across the Ceph multisite deployment.

Troubleshooting

Note: radosgw-admin commands expect --id parameter which should be provided as rgw.<hostname>. You may use juju to fetch this hostname and use in all such invocations like the following example.

hn=$(juju exec --unit ceph-radosgw/leader -- hostname)
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn sync status

Step 1: Verify Replication Status

First, assess the overall health and status of the replication process. Use the radosgw-admin tool with the sync status command.

juju exec --unit ceph-radosgw/0 -- radosgw-admin sync status

Step 2: Verify Network Stability and Connectivity

- Ensure there's stable network connectivity between the source and destination zones.
- Use tools like `ping`, `traceroute`, or `mtr` to check for network latency or packet loss.

Step 3: Assess Ceph Health

Check the overall health of the Ceph cluster, particularly focusing on the primary site.

juju exec --unit ceph-mon/0 -- ceph health detail

Look for any health warnings or errors that could be impacting the availability or performance of replication logs.

Step 4: Examine Multisite Configuration

Ensure the multisite configuration is correctly set up. This includes checking zone groups, zones, and period configurations.

``` bash
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zonegroup get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zone get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn period get
```

Misconfigurations here could lead to replication issues.

Step 5: Check Logs for Errors

Inspect the Ceph RADOS Gateway logs for errors related to replication. The logs can be found in /var/log/ceph on the gateway nodes.

grep -i "error" /var/log/ceph/ceph-client.rgw.*.log

Alert: CephRGWMultisitePollError

Overview

The CephRGWMultisitePollError alert indicates that the threshold for unsuccessful replication log request errors in a Ceph multi-site configuration has been surpassed. This threshold is set to 2 errors every 15 minutes by default. This situation can hinder data synchronization between sites, potentially leading to data inconsistency issues.

Troubleshooting

Note: radosgw-admin commands expect --id parameter which should be provided as rgw.<hostname>. You may use juju to fetch this hostname and use in all such invocations like the following example.

hn=$(juju exec --unit ceph-radosgw/leader -- hostname)
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn sync status

Step 1: Verify Replication Status

First, assess the overall health and status of the replication process. Use the radosgw-admin tool with the sync status command.

juju exec --unit ceph-radosgw/0 -- radosgw-admin sync status

Step 2: Verify Network Stability and Connectivity

- Ensure there's stable network connectivity between the source and destination zones.
- Use tools like `ping`, `traceroute`, or `mtr` to check for network latency or packet loss.

Step 3: Assess Ceph Health

Check the overall health of the Ceph cluster, particularly focusing on the primary site.

juju exec --unit ceph-mon/0 -- ceph health detail

Look for any health warnings or errors that could be impacting the availability or performance of replication logs.

Step 4: Examine Multisite Configuration

Ensure the multisite configuration is correctly set up. This includes checking zone groups, zones, and period configurations.

``` bash
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zonegroup get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn zone get
juju exec --unit ceph-radosgw/leader -- radosgw-admin --id rgw.$hn period get
```

Misconfigurations here could lead to replication issues.

Step 5: Check Logs for Errors

Inspect the Ceph RADOS Gateway logs for errors related to replication. The logs can be found in /var/log/ceph on the gateway nodes.

grep -i "error" /var/log/ceph/ceph-client.rgw.*.log

Alert: CephRGWMultisitePollLatency

Overview

The “CephRGWMultisitePollLatency” alert indicates that the latency for poll requests in a Ceph storage cluster has exceeded a predefined threshold. This threshold is set at 600 seconds of latency per every 15 minutes by default. This suggests that the operations related to the multisite synchronization or data replication tasks within CephRGW are experiencing significant delays. Such latency issues can impact the overall performance of the storage system, particularly in scenarios involving data synchronization across geographical locations.

Troubleshooting

Step 1: Verify System Health and Load

Check the overall health of the Ceph cluster using the Juju command: juju run --unit ceph-mon/0 -- ceph status.
Inspect the load on the Rados Gateway (RGW) instances by executing: juju exec --unit ceph-radosgw/leader -- top -bn1 | grep Cpu.

Step 2: Review RGW Logs

Access the Rados Gateway logs located in /var/log/ceph on the RGW nodes to identify any error messages or warnings related to latency or timeouts. Use: juju exec --unit ceph-radosgw/leader -- tail -n 100 /var/log/ceph/ceph-client.rgw.<rgw-id>.log.

Step 3: Check Network Latency

Measure network latency between the RGW instances, especially if they are geographically distributed. Use tools like ping or mtr to identify network delays. Example: juju exec --application ceph-radosgw -- ping <other-rgw-instance-ip>.

Step 4: Review Hardware and Storage Performance

Check the hardware and storage performance metrics to ensure there are no underlying issues causing the latency. This involves inspecting disk I/O, CPU, and memory usage on the RGW units.

Step 5: Adjust Polling Intervals

Consider adjusting the polling intervals to accommodate for network delays or synchronization complexities. This can be done by modifying the relevant configuration settings in the RGW setup.

Alert: CephSlowOps (healthchecks)

Overview

The “CephSlowOps” alert is an indication that the OSD operations within the Ceph storage cluster are experiencing delays. Specifically, this alert is triggered when the time taken to process OSD requests exceeds a predefined threshold known as osd_op_complaint_time. In simple terms, the read, write, or other data operation requests sent to the OSDs are taking longer than expected, which can hint at underlying performance issues or bottlenecks within the cluster.

Troubleshooting

To troubleshoot the “CephSlowOps” alert, follow the steps outlined below.

Step 1: Verify Cluster Health

Use the Juju to check the overall health of the cluster and identify any immediate issues.

juju exec --unit ceph-mon/0 -- ceph status

Step 2: Check OSD Performance Metrics

Identify slow OSDs by checking the OSD performance metrics.

juju exec --unit ceph-mon/leader -- ceph osd perf

Step 3: Review OSD Logs

Inspect the logs of the OSDs showing slow performance. Logs are located in /var/log/ceph/ on the OSD nodes. Look for any errors or warnings that might indicate problems.

Step 4: Monitor Network Latency

Network latency can significantly affect OSD performance. Use tools like ping and iperf to measure latency between OSD nodes.

Step 5: Inspect Disk Health and Performance

Run smartctl -a /dev/sdX on OSD nodes to check for any hardware issues with the disks.
Utilize iostat and sar to monitor disk I/O performance and identify bottlenecks.

Step 6: Balance the Cluster

If some OSDs are under more load than others, rebalance the cluster using the command.

juju exec --unit ceph-mon/leader -- ceph osd reweight-by-utilization

Step 7: Increase OSD Resources

If the hardware is identified as a bottleneck, consider increasing resources (CPU, RAM) or adding more OSD nodes to the cluster.

Charmed Ceph Alerts Guide

Alert: CephDaemonCrash (generic)

Overview

Troubleshooting

Step 1: Identify Crashed Daemons and Gather Crash IDs

Step 2: Review Crash Reports

Step 3: Check Logs

Step 4: System Health and Resource Utilization

Step 5: Acknowledge and Archive the Crash

Step 6: Monitor and Follow-up

Alert: CephDeviceFailurePredicted (osd)

Overview

Troubleshooting

1. Identify Failing Device(s)

2. Review Specific Device Information

3. Remove OSD

4. Monitor Data Migration

5. Replace Hardware

6. Redeploy OSD

8. Verify Cluster Health

Alert: CephDeviceFailurePredictionTooHigh (osd)

Overview

Troubleshooting

Step 1: Verify Cluster Health

Step 2: Identify Predicted Failing OSDs

Step 3: Check OSD Logs

Step 4: SMART Data Analysis

Step 5: Add New OSDs

Step 6: Data Relocation and Rebalancing

Alert: CephDeviceFailureRelocationIncomplete (osd)

Overview

Troubleshooting

Step 1: Verify Cluster Free Space

Step 2: Identify At-risk Devices

Step 3: Check for OSD Fullness

Step 4: Adjust Balancer (if applicable)

Step 5: Add Capacity (if necessary)

Alert: CephFilesystemDamaged (mds)

Overview

Troubleshooting

Check Ceph Health and Status

Retrieve Ceph MDS Diagnostics

Alert: CephFilesystemDegraded (mds)

Overview

Troubleshooting

Step 1: Identify the Failed MDS Daemons

Step 2: Check MDS Daemon Logs

Step 3: Health and Status of the Cluster

Step 4: Restart Failed MDS Daemons

Step 5: Contact Support

Alert: CephFilesystemFailureNoStandby (mds)

Overview

Troubleshooting

Step 1: Check MDS Status

Step 2: Review MDS Logs

Step 3: Verify Cluster Health

Step 4: Check for Adequate Resources

Step 5: Add a Standby MDS

Step 6: Monitor for Recovery

Step 7: Plan for Redundancy

Alert: CephFilesystemInsufficientStandby (mds)

Overview

Troubleshooting

Step 1: Verify Current Number of MDS Daemons

Step 2: Check standby_count_wanted

Step 3: Compare the Numbers

Step 4: Adjust Standby Count or Add MDS Daemons

Step 5: Verify Adjustments

Step 6: Inspect Logs

Alert: CephFilesystemMDSRanksLow (mds)

Overview

Troubleshooting

Step 1: Verify MDS Daemon Status

Step 2: Check the max_mds Setting

Step 3: Inspect MDS Logs for Errors

Step 4: Review Ceph Cluster Overall Health

Step 5: Adjust max_mds Setting If Necessary

Step 6: Resolve Any Identified Issues

Alert: CephFilesystemOffline (mds)

Overview

Step 2: Check `standby_count_wanted`

Step 2: Check the `max_mds` Setting

Step 5: Adjust `max_mds` Setting If Necessary