CephPgInconsistent #
Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.
Alert Rule
alert: CephPgInconsistent
annotations:
description: |-
Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephpginconsistent/
summary: Ceph PG inconsistent (instance {{ $labels.instance }})
expr: ceph_pg_inconsistent > 0
for: 0m
labels:
severity: warning
Here is a runbook for the CephPgInconsistent alert rule:
Meaning #
The CephPgInconsistent alert is triggered when there are inconsistent placement groups (PGs) in the Ceph cluster. This means that data is available, but it’s inconsistent across nodes. This can lead to data discrepancies and potential data loss.
Impact #
The impact of this alert is moderate to high, as it can lead to:
- Data inconsistencies across nodes
- Potential data loss
- Performance degradation
- Increased risk of cluster instability
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Ceph cluster’s health using the
ceph healthcommand. - Identify the PGs that are inconsistent using the
ceph pg dumpcommand. - Investigate the nodes that are reporting inconsistencies.
- Check the Ceph logs for any error messages related to the inconsistent PGs.
- Verify that the network connectivity between nodes is stable and not experiencing any issues.
Mitigation #
To mitigate the issue, follow these steps:
- Run
ceph pg repairto repair the inconsistent PGs. - Verify that the repair was successful by running
ceph pg dumpagain. - If the issue persists, consider rebalancing the PGs using
ceph pg rebalance. - Check the Ceph cluster’s configuration to ensure it’s properly set up for data replication and consistency.
- Consider adding more nodes or increasing the replication factor to improve data durability and reduce the risk of inconsistencies.
Remember to monitor the Ceph cluster’s health and performance closely after mitigation to ensure the issue is fully resolved.