CephState #

Ceph instance unhealthy

Alert Rule

alert: CephState
annotations:
  description: |-
    Ceph instance unhealthy
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephstate/
  summary: Ceph State (instance {{ $labels.instance }})
expr: ceph_health_status != 0
for: 0m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule CephState:

Meaning #

The CephState alert is triggered when the Ceph cluster’s health status is not healthy (i.e., ceph_health_status != 0). This indicates that there is an issue with the Ceph cluster that needs to be addressed promptly to prevent data loss or unavailability.

Impact #

The impact of this alert is critical, as an unhealthy Ceph cluster can lead to:

Data loss or corruption
Unavailability of storage resources
Disruption to dependent services and applications

Diagnosis #

To diagnose the issue, follow these steps:

Check the Ceph cluster’s status using ceph -s command
Review the Ceph log files for errors or warnings
Verify that all Ceph nodes are online and reachable
Check for any ongoing maintenance or upgrade activities that may be causing the issue
Review the LABELS and VALUE provided in the alert to identify the specific instance and error code

Mitigation #

To mitigate the issue, follow these steps:

Identify and address the root cause of the health issue (e.g., fix any hardware or software issues, resolve network connectivity problems, etc.)
Restart any failed Ceph services or nodes
Run ceph heal command to initiate the healing process
Verify that the Ceph cluster’s health status has returned to normal
If the issue persists, consider escalating to a senior engineer or Ceph expert for further assistance.

Remember to update the runbook with any additional steps or procedures specific to your environment.