PrometheusTsdbWalCorruptions #
Prometheus encountered {{ $value }} TSDB WAL corruptions
Alert Rule
alert: PrometheusTsdbWalCorruptions
annotations:
description: |-
Prometheus encountered {{ $value }} TSDB WAL corruptions
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheustsdbwalcorruptions/
summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
for: 0m
labels:
severity: critical
Here is a sample runbook for the PrometheusTsdbWalCorruptions alert:
Meaning #
The PrometheusTsdbWalCorruptions alert is triggered when Prometheus encounters corruptions in its Write-Ahead Log (WAL) storage, which is used to store incoming samples temporarily before they are written to the long-term storage. This corruption can lead to data loss and inconsistencies in the Prometheus storage.
Impact #
The impact of this alert is critical, as it can result in:
- Data loss: Corrupted WAL data can lead to the loss of metric data, which can impact the reliability of monitoring and alerting.
- Inconsistent data: Corrupted WAL data can also lead to inconsistencies in the Prometheus storage, making it difficult to trust the accuracy of the data.
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Prometheus server logs for errors related to WAL corruptions.
- Verify that the WAL storage is properly configured and has sufficient disk space.
- Check the Prometheus instance’s disk usage and ensure that it has enough available disk space.
- Run the
promtool tsdb
command to inspect the WAL and identify any corrupted segments. - Check the
prometheus_tsdb_wal_corruptions_total
metric to see the total number of corruptions encountered.
Mitigation #
To mitigate the issue, follow these steps:
- Immediately stop the Prometheus instance to prevent further data corruption.
- Identify and fix the underlying cause of the corruption, such as disk space issues or configuration errors.
- Run the
promtool tsdb
command to repair the corrupted WAL segments. - Restart the Prometheus instance and verify that it is functioning correctly.
- Monitor the
prometheus_tsdb_wal_corruptions_total
metric to ensure that no further corruptions occur.
Note: It is essential to act quickly to mitigate this issue to prevent further data loss and inconsistencies.