CassandraRepairPendingTasks #
Some Cassandra repair tasks are pending
Alert Rule
alert: CassandraRepairPendingTasks
annotations:
description: |-
Some Cassandra repair tasks are pending
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/criteo-cassandra-exporter/cassandrarepairpendingtasks/
summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"}
> 2
for: 2m
labels:
severity: warning
Here is a runbook for the CassandraRepairPendingTasks alert:
Meaning #
The CassandraRepairPendingTasks alert is triggered when the number of pending repair tasks in Cassandra exceeds 2 for more than 2 minutes. This indicates that Cassandra is not able to keep up with the rate of repairs, which can lead to data inconsistencies and availability issues.
Impact #
- Data inconsistencies: Unrepaired data can lead to inconsistencies across the cluster, which can cause issues with data accuracy and integrity.
- Availability issues: If the pending tasks continue to accumulate, it can lead to node failures, causing unavailability of the cluster.
- Performance degradation: Excessive pending repair tasks can cause Cassandra to slow down, leading to performance degradation and increased latency.
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Cassandra logs for any errors or exceptions related to repair tasks.
- Verify that the Cassandra nodes are properly configured and have sufficient resources (CPU, memory, disk space) to handle the repair tasks.
- Check the Cassandra metrics for any signs of high latency, high CPU usage, or disk usage.
- Verify that the cassandra-repair tool is correctly configured and running as expected.
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Cassandra node to clear out any stuck repair tasks.
- Verify that the cassandra-repair tool is correctly configured and running as expected.
- Increase the resources (CPU, memory, disk space) allocated to the Cassandra nodes to handle the repair tasks.
- Consider scaling out the Cassandra cluster to distribute the repair tasks across more nodes.
- If the issue persists, consider running a manual repair using the cassandra-repair tool.
Remember to monitor the Cassandra metrics closely to ensure that the issue is resolved and does not recur.