PrometheusTsdbCheckpointCreationFailures #

Prometheus encountered {{ $value }} checkpoint creation failures

Alert Rule

alert: PrometheusTsdbCheckpointCreationFailures
annotations:
  description: |-
    Prometheus encountered {{ $value }} checkpoint creation failures
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheustsdbcheckpointcreationfailures/
  summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance
    }})
expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) &gt; 0
for: 0m
labels:
  severity: critical

Here is a runbook for the PrometheusTsdbCheckpointCreationFailures alert:

Meaning #

The PrometheusTsdbCheckpointCreationFailures alert is triggered when Prometheus encounters failures while creating TSDB checkpoints. TSDB checkpoints are essential for data durability and recovery in Prometheus. A failure to create these checkpoints can lead to data loss or corruption.

Impact #

The impact of this alert is critical, as it can result in:

Data loss or corruption
Incomplete or inaccurate metrics data
Downtime or slow performance of Prometheus and dependent services
Inability to recover data in case of a failure or restart

Diagnosis #

To diagnose the issue, follow these steps:

Check the Prometheus logs for errors related to TSDB checkpoint creation.
Verify that the disk space is sufficient and not running out of disk space.
Check the underlying storage system for any issues or errors.
Investigate any recent changes to the Prometheus configuration or deployment.
Check the value of prometheus_tsdb_checkpoint_creations_failed_total metric to understand the extent of the issue.

Mitigation #

To mitigate the issue, follow these steps:

Check the Prometheus configuration and ensure that the storage.tsdb.retention=time setting is correct and not too aggressive.
Increase the disk space available to Prometheus if it’s running low.
Investigate and resolve any underlying storage system issues.
Restart the Prometheus service to attempt to recreate the checkpoints.
If the issue persists, consider rolling back recent changes to the Prometheus configuration or deployment.
If none of the above steps resolve the issue, seek assistance from a Prometheus expert or the development team.