CassandraCommitlogPendingTasks #
Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}
Alert Rule
alert: CassandraCommitlogPendingTasks
annotations:
description: |-
Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/instaclustr-cassandra-exporter/cassandracommitlogpendingtasks/
summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
expr: cassandra_commit_log_pending_tasks > 15
for: 2m
labels:
severity: warning
Here is a runbook for the CassandraCommitlogPendingTasks alert:
Meaning #
The CassandraCommitlogPendingTasks alert is triggered when the number of pending tasks in the Cassandra commit log exceeds 15 for more than 2 minutes. This indicates that the Cassandra node is experiencing high latency or is overwhelmed with write requests, leading to a backlog of uncommitted data in the commit log.
Impact #
A high number of pending tasks in the commit log can cause:
- Increased latency for writes and reads
- Increased memory usage on the Cassandra node
- Risk of data loss or corruption if the node fails or is restarted
- Performance degradation of the entire Cassandra cluster
Diagnosis #
To diagnose the cause of the alert, follow these steps:
- Check the Cassandra node’s system metrics (e.g., CPU, memory, disk usage) to identify any resource bottlenecks.
- Review the Cassandra logs to identify any errors or exceptions related to write operations.
- Check the Cassandra cluster’s configuration and topology to ensure that the node is properly configured and not overloaded.
- Verify that the Cassandra node is receiving an unusual amount of write traffic.
Mitigation #
To mitigate the alert, follow these steps:
- Reduce the write load on the Cassandra node by:
- Load balancing write traffic across multiple nodes
- Implementing rate limiting or queuing mechanisms for writes
- Optimizing application code to reduce write frequency or size
- Increase the resources available to the Cassandra node, such as:
- Adding more CPU or memory resources
- Upgrading the node’s hardware or infrastructure
- Implement data compression or compaction to reduce the size of the commit log
- Consider implementing a more robust backup and recovery strategy to minimize data loss in case of node failure.
Remember to investigate the root cause of the issue and implement a long-term solution to prevent the alert from recurring.