CephState #
Ceph instance unhealthy
Alert Rule
alert: CephState
annotations:
description: |-
Ceph instance unhealthy
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephstate/
summary: Ceph State (instance {{ $labels.instance }})
expr: ceph_health_status != 0
for: 0m
labels:
severity: critical
Here is a runbook for the Prometheus alert rule CephState
:
Meaning #
The CephState
alert is triggered when the Ceph cluster’s health status is not healthy (i.e., ceph_health_status != 0
). This indicates that there is an issue with the Ceph cluster that needs to be addressed promptly to prevent data loss or unavailability.
Impact #
The impact of this alert is critical, as an unhealthy Ceph cluster can lead to:
- Data loss or corruption
- Unavailability of storage resources
- Disruption to dependent services and applications
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Ceph cluster’s status using
ceph -s
command - Review the Ceph log files for errors or warnings
- Verify that all Ceph nodes are online and reachable
- Check for any ongoing maintenance or upgrade activities that may be causing the issue
- Review the
LABELS
andVALUE
provided in the alert to identify the specific instance and error code
Mitigation #
To mitigate the issue, follow these steps:
- Identify and address the root cause of the health issue (e.g., fix any hardware or software issues, resolve network connectivity problems, etc.)
- Restart any failed Ceph services or nodes
- Run
ceph heal
command to initiate the healing process - Verify that the Ceph cluster’s health status has returned to normal
- If the issue persists, consider escalating to a senior engineer or Ceph expert for further assistance.
Remember to update the runbook with any additional steps or procedures specific to your environment.