CassandraTombstoneDump #

Too much tombstones scanned in queries

Alert Rule

alert: CassandraTombstoneDump
annotations:
  description: |-
    Too much tombstones scanned in queries
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/criteo-cassandra-exporter/cassandratombstonedump/
  summary: Cassandra tombstone dump (instance {{ $labels.instance }})
expr: cassandra_stats{name=&#34;org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile&#34;}
  &gt; 1000
for: 0m
labels:
  severity: critical

Here is a sample runbook for the Prometheus alert rule “CassandraTombstoneDump”:

Meaning #

This alert is triggered when the 99th percentile of tombstones scanned in Cassandra queries exceeds 1000. Tombstones are a mechanism used by Cassandra to handle deletes, and excessive tombstone scanning can indicate inefficiencies in query planning or data modeling.

Impact #

Increased latency and slower query performance due to excessive tombstone scanning
Potential for query timeouts or failures
Impact on overall Cassandra cluster performance and stability

Diagnosis #

Check the Cassandra server metrics to identify the specific table or tables affected by the high tombstone scanning.
Investigate recent changes to the data model or query patterns that may be contributing to the issue.
Review the Cassandra configuration to ensure that proper tuning and optimization have been applied.
Verify that there are no issues with data consistency or data corruption that could be contributing to the high tombstone count.

Mitigation #

Immediately investigate and address any data model or query pattern changes that may be contributing to the issue.
Optimize Cassandra configuration to improve query performance and reduce tombstone scanning (e.g., adjust the tombstone threshold, or enable parallel query execution).
Consider implementing data GC (garbage collection) or data compaction to remove tombstones and improve data efficiency.
Monitor Cassandra server metrics closely to ensure the issue is resolved and performance returns to normal.
If the issue persists, consider escalating to Cassandra experts or seeking additional support.