CephPgDown #
Some Ceph placement groups are down. Please ensure that all the data are available.
Alert Rule
alert: CephPgDown
annotations:
description: |-
Some Ceph placement groups are down. Please ensure that all the data are available.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephpgdown/
summary: Ceph PG down (instance {{ $labels.instance }})
expr: ceph_pg_down > 0
for: 0m
labels:
severity: critical
Meaning #
The CephPgDown alert is triggered when one or more Ceph placement groups (PGs) are in a down state. This indicates that data availability is at risk, as PGs are responsible for storing and managing data in a Ceph cluster.
Impact #
The impact of this alert is high, as it can lead to data unavailability and potential data loss. If left unaddressed, this issue can cause:
- Data inaccessibility
- Application downtime
- Data loss or corruption
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Ceph cluster’s overall health using the Ceph dashboard or command-line tools.
- Identify the specific PGs that are down using the Ceph command
ceph pg dump
. - Check the Ceph node’s system logs for any error messages related to the down PGs.
- Verify that all Ceph nodes are running and reachable.
- Check the network connectivity between Ceph nodes.
Look for common causes of PG downtime, such as:
- Node failures or restarts
- Network connectivity issues
- Disk failures or corruption
- Configuration errors
Mitigation #
To mitigate the issue, follow these steps:
- Identify the root cause of the PG downtime and address it accordingly.
- If a node is down, restart it or replace it if necessary.
- If network connectivity is the issue, restore connectivity between nodes.
- If disk failures or corruption are the cause, replace the faulty disks and run
ceph pg repair
to recover data. - If configuration errors are the cause, correct the configuration and restart the affected nodes.
- Once the root cause is addressed, run
ceph pg repair
to recover the down PGs. - Monitor the Ceph cluster’s health and verify that all PGs are up and running.
- Perform a data scrub to ensure data integrity.
Remember to follow proper change management procedures and communicate with stakeholders before making any changes to the Ceph cluster.