PrometheusTsdbCompactionsFailed #

Prometheus encountered {{ $value }} TSDB compactions failures

Alert Rule

alert: PrometheusTsdbCompactionsFailed
annotations:
  description: |-
    Prometheus encountered {{ $value }} TSDB compactions failures
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheustsdbcompactionsfailed/
  summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
expr: increase(prometheus_tsdb_compactions_failed_total[1m]) &gt; 0
for: 0m
labels:
  severity: critical

Here is a runbook for the PrometheusTsdbCompactionsFailed alert rule:

Meaning #

The PrometheusTsdbCompactionsFailed alert rule is triggered when Prometheus encounters failures while compacting its time series database (TSDB). TSDB compactions are a critical maintenance task that ensures efficient storage and query performance. Failure to compact the TSDB can lead to performance degradation, increased storage usage, and even crashes.

Impact #

The impact of this alert is critical, as it can cause:

Performance degradation: Uncompacted TSDB can lead to slower query responses and increased latency.
Storage usage increase: Uncompacted data can occupy more disk space, leading to storage capacity issues.
Instability: In extreme cases, TSDB compaction failures can cause Prometheus to crash or become unresponsive.

Diagnosis #

To diagnose the root cause of the TSDB compaction failures, follow these steps:

Check the Prometheus logs for error messages related to TSDB compactions.
Verify that the disk space is sufficient, and the file system is not full.
Check for any recent changes or updates to the Prometheus configuration or environment.
Review the TSDB compaction metrics, such as prometheus_tsdb_compactions_failed_total, to identify any patterns or trends.

Mitigation #

To mitigate the TSDB compaction failures, follow these steps:

Check the Prometheus configuration to ensure that the TSDB compaction settings are correct and sufficient for the current data volume.
Verify that the Prometheus instance has sufficient resources (CPU, memory, and disk space) to perform compactions efficiently.
Consider increasing the storage.local.target-avail-bytes configuration option to allow for more aggressive compaction.
If the issue persists, consider restarting the Prometheus instance to recover from any potential internal state issues.
If none of the above steps resolve the issue, consider seeking assistance from the Prometheus community or a qualified administrator.

Remember to update the Prometheus configuration and environment according to the findings and resolutions to prevent future TSDB compaction failures.