PrometheusTsdbCheckpointDeletionFailures #

Prometheus encountered {{ $value }} checkpoint deletion failures

Alert Rule

alert: PrometheusTsdbCheckpointDeletionFailures
annotations:
  description: |-
    Prometheus encountered {{ $value }} checkpoint deletion failures
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheustsdbcheckpointdeletionfailures/
  summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance
    }})
expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) &gt; 0
for: 0m
labels:
  severity: critical

Meaning #

The PrometheusTsdbCheckpointDeletionFailures alert is triggered when Prometheus fails to delete checkpoint files from its TSDB (Time Series Database) storage. Checkpoint files are used to store the current state of Prometheus’s internal data structures, such as the in-memory index and the WAL (Write-Ahead Log). The deletion of these files is crucial to prevent disk space exhaustion and maintain a healthy Prometheus instance.

Impact #

If this alert is not addressed, it can lead to:

Disk space exhaustion, causing Prometheus to run out of storage capacity, leading to performance issues, and potentially even crashes.
Inconsistent data, as the failed checkpoint deletion can cause inconsistencies in the TSDB, leading to incorrect query results or even data loss.
Increased memory usage, as the system may retain unnecessary data in memory, further exacerbating performance issues.

Diagnosis #

To diagnose the issue, follow these steps:

Check the Prometheus server’s disk usage and available storage capacity.
Verify that the TSDB is functioning correctly and that the WAL is being properly truncated.
Review the Prometheus server logs for any errors or warnings related to checkpoint deletion failures.
Check the $labels.instance label to identify the specific Prometheus instance affected.

Mitigation #

To mitigate the issue, follow these steps:

Immediately free up disk space by deleting unnecessary files and data.
Restart the Prometheus server to allow it to recover and retry the checkpoint deletion.
Verify that the TSDB is functioning correctly and that the WAL is being properly truncated.
Implement a regular maintenance schedule to ensure that disk space is regularly cleaned up and that the TSDB is healthy.
Consider increasing the storage capacity or implementing a more robust storage solution to prevent future disk space exhaustion issues.