CassandraConnectionTimeoutsTotal #
Some connection between nodes are ending in timeout
Alert Rule
alert: CassandraConnectionTimeoutsTotal
annotations:
description: |-
Some connection between nodes are ending in timeout
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/criteo-cassandra-exporter/cassandraconnectiontimeoutstotal/
summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m])
> 5
for: 2m
labels:
severity: critical
Here is a runbook for the CassandraConnectionTimeoutsTotal alert rule:
Meaning #
The CassandraConnectionTimeoutsTotal alert is triggered when the rate of Cassandra connection timeouts exceeds 5 in a 1-minute period, sustained for 2 minutes. This indicates that some connections between nodes are ending in timeouts, which can impact the performance and reliability of the Cassandra cluster.
Impact #
- Connection timeouts can lead to data inconsistencies and loss of availability
- Impacts read and write performance, leading to slower response times and errors
- Can cause cascading failures and affect overall system reliability
- May indicate underlying issues with network connectivity, Cassandra configuration, or node health
Diagnosis #
- Check Cassandra logs for errors and warnings related to connection timeouts
- Verify network connectivity between nodes using tools like
ping
andtelnet
- Run
nodetool
commands to check node status and connection metrics - Review Cassandra configuration files (e.g.
cassandra.yaml
) for any misconfigurations - Check system resource utilization (e.g. CPU, memory, disk space) to rule out resource-related issues
Mitigation #
- Investigate and resolve any network issues or misconfigurations
- Tune Cassandra configuration settings to optimize connection timeouts and retries
- Implement connection pooling or load balancing to reduce the load on individual nodes
- Consider upgrading Cassandra versions or patching known issues related to connection timeouts
- Restart nodes or restart Cassandra service if necessary to clear out stuck connections
- Monitor Cassandra metrics closely to detect any recurrences of the issue and take proactive measures to prevent it.