CephPgUnavailable #
Some Ceph placement groups are unavailable.
Alert Rule
alert: CephPgUnavailable
annotations:
description: |-
Some Ceph placement groups are unavailable.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephpgunavailable/
summary: Ceph PG unavailable (instance {{ $labels.instance }})
expr: ceph_pg_total - ceph_pg_active > 0
for: 0m
labels:
severity: critical
Here is a sample runbook for the CephPgUnavailable alert:
Meaning #
The CephPgUnavailable alert indicates that one or more Ceph placement groups (PGs) are currently unavailable. This is a critical alert as it can lead to data unavailability and potential data loss.
Impact #
The impact of this alert is high, as it affects the availability of data stored in Ceph. If left unresolved, this issue can lead to:
- Data unavailability
- Data loss
- Downtime for critical systems and applications
- Revenue loss due to system unavailability
Diagnosis #
To diagnose the root cause of this issue, follow these steps:
- Check the Ceph cluster status using
ceph -s
orceph status
command. - Verify the PG status using
ceph pg dump
command. - Check the Ceph logs for any errors or warnings related to PG unavailability.
- Identify the specific PGs that are unavailable using
ceph pg ls
command. - Verify the health of the Ceph nodes and OSDs using
ceph osd ls
andceph osd df
commands.
Mitigation #
To mitigate this issue, follow these steps:
- Identify the root cause of the PG unavailability (e.g. node failure, network issues, disk errors).
- Resolve the underlying issue (e.g. replace failed node, fix network issues, repair disk errors).
- Run
ceph pg repair
command to repair the affected PGs. - Verify the PG status using
ceph pg dump
command to ensure that all PGs are now available. - Monitor the Ceph cluster status and PG status to ensure that the issue is resolved and does not reoccur.
Remember to also update the alert status in Prometheus to reflect the resolution of the issue.