CassandraRepairPendingTasks #

Some Cassandra repair tasks are pending

Alert Rule

alert: CassandraRepairPendingTasks
annotations:
  description: |-
    Some Cassandra repair tasks are pending
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/criteo-cassandra-exporter/cassandrarepairpendingtasks/
  summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
expr: cassandra_stats{name=&#34;org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value&#34;}
  &gt; 2
for: 2m
labels:
  severity: warning

Here is a runbook for the CassandraRepairPendingTasks alert:

Meaning #

The CassandraRepairPendingTasks alert is triggered when the number of pending repair tasks in Cassandra exceeds 2 for more than 2 minutes. This indicates that Cassandra is not able to keep up with the rate of repairs, which can lead to data inconsistencies and availability issues.

Impact #

Data inconsistencies: Unrepaired data can lead to inconsistencies across the cluster, which can cause issues with data accuracy and integrity.
Availability issues: If the pending tasks continue to accumulate, it can lead to node failures, causing unavailability of the cluster.
Performance degradation: Excessive pending repair tasks can cause Cassandra to slow down, leading to performance degradation and increased latency.

Diagnosis #

To diagnose the issue, follow these steps:

Check the Cassandra logs for any errors or exceptions related to repair tasks.
Verify that the Cassandra nodes are properly configured and have sufficient resources (CPU, memory, disk space) to handle the repair tasks.
Check the Cassandra metrics for any signs of high latency, high CPU usage, or disk usage.
Verify that the cassandra-repair tool is correctly configured and running as expected.

Mitigation #

To mitigate the issue, follow these steps:

Restart the Cassandra node to clear out any stuck repair tasks.
Verify that the cassandra-repair tool is correctly configured and running as expected.
Increase the resources (CPU, memory, disk space) allocated to the Cassandra nodes to handle the repair tasks.
Consider scaling out the Cassandra cluster to distribute the repair tasks across more nodes.
If the issue persists, consider running a manual repair using the cassandra-repair tool.

Remember to monitor the Cassandra metrics closely to ensure that the issue is resolved and does not recur.