CassandraTombstoneDump #
Too much tombstones scanned in queries
Alert Rule
alert: CassandraTombstoneDump
annotations:
description: |-
Too much tombstones scanned in queries
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/criteo-cassandra-exporter/cassandratombstonedump/
summary: Cassandra tombstone dump (instance {{ $labels.instance }})
expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"}
> 1000
for: 0m
labels:
severity: critical
Here is a sample runbook for the Prometheus alert rule “CassandraTombstoneDump”:
Meaning #
This alert is triggered when the 99th percentile of tombstones scanned in Cassandra queries exceeds 1000. Tombstones are a mechanism used by Cassandra to handle deletes, and excessive tombstone scanning can indicate inefficiencies in query planning or data modeling.
Impact #
- Increased latency and slower query performance due to excessive tombstone scanning
- Potential for query timeouts or failures
- Impact on overall Cassandra cluster performance and stability
Diagnosis #
- Check the Cassandra server metrics to identify the specific table or tables affected by the high tombstone scanning.
- Investigate recent changes to the data model or query patterns that may be contributing to the issue.
- Review the Cassandra configuration to ensure that proper tuning and optimization have been applied.
- Verify that there are no issues with data consistency or data corruption that could be contributing to the high tombstone count.
Mitigation #
- Immediately investigate and address any data model or query pattern changes that may be contributing to the issue.
- Optimize Cassandra configuration to improve query performance and reduce tombstone scanning (e.g., adjust the tombstone threshold, or enable parallel query execution).
- Consider implementing data GC (garbage collection) or data compaction to remove tombstones and improve data efficiency.
- Monitor Cassandra server metrics closely to ensure the issue is resolved and performance returns to normal.
- If the issue persists, consider escalating to Cassandra experts or seeking additional support.