CephPgInconsistent #
Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.
Alert Rule
alert: CephPgInconsistent
annotations:
description: |-
Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephpginconsistent/
summary: Ceph PG inconsistent (instance {{ $labels.instance }})
expr: ceph_pg_inconsistent > 0
for: 0m
labels:
severity: warning
Here is a runbook for the CephPgInconsistent alert rule:
Meaning #
The CephPgInconsistent alert is triggered when there are inconsistent placement groups (PGs) in the Ceph cluster. This means that data is available, but it’s inconsistent across nodes. This can lead to data discrepancies and potential data loss.
Impact #
The impact of this alert is moderate to high, as it can lead to:
- Data inconsistencies across nodes
- Potential data loss
- Performance degradation
- Increased risk of cluster instability
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Ceph cluster’s health using the
ceph health
command. - Identify the PGs that are inconsistent using the
ceph pg dump
command. - Investigate the nodes that are reporting inconsistencies.
- Check the Ceph logs for any error messages related to the inconsistent PGs.
- Verify that the network connectivity between nodes is stable and not experiencing any issues.
Mitigation #
To mitigate the issue, follow these steps:
- Run
ceph pg repair
to repair the inconsistent PGs. - Verify that the repair was successful by running
ceph pg dump
again. - If the issue persists, consider rebalancing the PGs using
ceph pg rebalance
. - Check the Ceph cluster’s configuration to ensure it’s properly set up for data replication and consistency.
- Consider adding more nodes or increasing the replication factor to improve data durability and reduce the risk of inconsistencies.
Remember to monitor the Ceph cluster’s health and performance closely after mitigation to ensure the issue is fully resolved.