PrometheusTsdbCheckpointCreationFailures #
Prometheus encountered {{ $value }} checkpoint creation failures
Alert Rule
alert: PrometheusTsdbCheckpointCreationFailures
annotations:
description: |-
Prometheus encountered {{ $value }} checkpoint creation failures
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheustsdbcheckpointcreationfailures/
summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance
}})
expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
for: 0m
labels:
severity: critical
Here is a runbook for the PrometheusTsdbCheckpointCreationFailures alert:
Meaning #
The PrometheusTsdbCheckpointCreationFailures alert is triggered when Prometheus encounters failures while creating TSDB checkpoints. TSDB checkpoints are essential for data durability and recovery in Prometheus. A failure to create these checkpoints can lead to data loss or corruption.
Impact #
The impact of this alert is critical, as it can result in:
- Data loss or corruption
- Incomplete or inaccurate metrics data
- Downtime or slow performance of Prometheus and dependent services
- Inability to recover data in case of a failure or restart
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Prometheus logs for errors related to TSDB checkpoint creation.
- Verify that the disk space is sufficient and not running out of disk space.
- Check the underlying storage system for any issues or errors.
- Investigate any recent changes to the Prometheus configuration or deployment.
- Check the value of
prometheus_tsdb_checkpoint_creations_failed_total
metric to understand the extent of the issue.
Mitigation #
To mitigate the issue, follow these steps:
- Check the Prometheus configuration and ensure that the
storage.tsdb.retention=time
setting is correct and not too aggressive. - Increase the disk space available to Prometheus if it’s running low.
- Investigate and resolve any underlying storage system issues.
- Restart the Prometheus service to attempt to recreate the checkpoints.
- If the issue persists, consider rolling back recent changes to the Prometheus configuration or deployment.
- If none of the above steps resolve the issue, seek assistance from a Prometheus expert or the development team.