CassandraHintsCount #
Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down
Alert Rule
alert: CassandraHintsCount
annotations:
description: |-
Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/criteo-cassandra-exporter/cassandrahintscount/
summary: Cassandra hints count (instance {{ $labels.instance }})
expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m])
> 3
for: 0m
labels:
severity: critical
Here is a runbook for the CassandraHintsCount alert:
Meaning #
The CassandraHintsCount alert is triggered when the number of hints in a Cassandra cluster changes rapidly, indicating potential issues with data consistency and node availability.
Impact #
If left unaddressed, this issue can lead to:
- Data inconsistencies across nodes in the Cassandra cluster
- Node failures or crashes, resulting in downtime and data loss
- Performance degradation and increased latency for applications relying on Cassandra
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Cassandra cluster’s overall health and performance using metrics such as node up/down status, CPU usage, and disk usage.
- Investigate the specific node(s) where the hints count changed rapidly using tools like
nodetool
or the Cassandra GUI. - Review the Cassandra logs for any error messages or warnings related to hints or node communication.
- Verify that the cluster is properly configured and that there are no network connectivity issues.
Mitigation #
To mitigate the issue, follow these steps:
- Identify and fix any underlying issues causing nodes to go down or hints to build up, such as:
- Network connectivity problems
- Disk space issues
- High CPU usage
- Misconfigured Cassandra settings
- Run
nodetool repair
to ensure data consistency across nodes. - Consider increasing the
phi_convict_threshold
value to prevent nodes from being incorrectly marked as down. - Monitor the cluster closely for any further issues and adjust the
cassandra.yaml
configuration as needed to prevent similar issues in the future.
Remember to check the provided links for more information on Cassandra configuration and troubleshooting:
By following these steps, you should be able to diagnose and mitigate the issue causing the CassandraHintsCount alert.