CephState

CephState #

Ceph instance unhealthy

Alert Rule
alert: CephState
annotations:
  description: |-
    Ceph instance unhealthy
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephstate/
  summary: Ceph State (instance {{ $labels.instance }})
expr: ceph_health_status != 0
for: 0m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule CephState:

Meaning #

The CephState alert is triggered when the Ceph cluster’s health status is not healthy (i.e., ceph_health_status != 0). This indicates that there is an issue with the Ceph cluster that needs to be addressed promptly to prevent data loss or unavailability.

Impact #

The impact of this alert is critical, as an unhealthy Ceph cluster can lead to:

  • Data loss or corruption
  • Unavailability of storage resources
  • Disruption to dependent services and applications

Diagnosis #

To diagnose the issue, follow these steps:

  1. Check the Ceph cluster’s status using ceph -s command
  2. Review the Ceph log files for errors or warnings
  3. Verify that all Ceph nodes are online and reachable
  4. Check for any ongoing maintenance or upgrade activities that may be causing the issue
  5. Review the LABELS and VALUE provided in the alert to identify the specific instance and error code

Mitigation #

To mitigate the issue, follow these steps:

  1. Identify and address the root cause of the health issue (e.g., fix any hardware or software issues, resolve network connectivity problems, etc.)
  2. Restart any failed Ceph services or nodes
  3. Run ceph heal command to initiate the healing process
  4. Verify that the Ceph cluster’s health status has returned to normal
  5. If the issue persists, consider escalating to a senior engineer or Ceph expert for further assistance.

Remember to update the runbook with any additional steps or procedures specific to your environment.