CephPgInconsistent #

Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.

Alert Rule

alert: CephPgInconsistent
annotations:
  description: |-
    Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephpginconsistent/
  summary: Ceph PG inconsistent (instance {{ $labels.instance }})
expr: ceph_pg_inconsistent &gt; 0
for: 0m
labels:
  severity: warning

Here is a runbook for the CephPgInconsistent alert rule:

Meaning #

The CephPgInconsistent alert is triggered when there are inconsistent placement groups (PGs) in the Ceph cluster. This means that data is available, but it’s inconsistent across nodes. This can lead to data discrepancies and potential data loss.

Impact #

The impact of this alert is moderate to high, as it can lead to:

Data inconsistencies across nodes
Potential data loss
Performance degradation
Increased risk of cluster instability

Diagnosis #

To diagnose the issue, follow these steps:

Check the Ceph cluster’s health using the ceph health command.
Identify the PGs that are inconsistent using the ceph pg dump command.
Investigate the nodes that are reporting inconsistencies.
Check the Ceph logs for any error messages related to the inconsistent PGs.
Verify that the network connectivity between nodes is stable and not experiencing any issues.

Mitigation #

To mitigate the issue, follow these steps:

Run ceph pg repair to repair the inconsistent PGs.
Verify that the repair was successful by running ceph pg dump again.
If the issue persists, consider rebalancing the PGs using ceph pg rebalance.
Check the Ceph cluster’s configuration to ensure it’s properly set up for data replication and consistency.
Consider adding more nodes or increasing the replication factor to improve data durability and reduce the risk of inconsistencies.

Remember to monitor the Ceph cluster’s health and performance closely after mitigation to ensure the issue is fully resolved.