ThanosCompactorHighCompactionFailures #
Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.
Alert Rule
alert: ThanosCompactorHighCompactionFailures
annotations:
description: |-
Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-compactor/thanoscompactorhighcompactionfailures/
summary: Thanos Compactor High Compaction Failures (instance {{ $labels.instance
}})
expr: (sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m]))
/ sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m]))
* 100 > 5)
for: 15m
labels:
severity: warning
Meaning #
The ThanosCompactorHighCompactionFailures alert is triggered when the percentage of failed compactions in a Thanos compactor exceeds 5% over a 5-minute period. This alert indicates that the compactor is experiencing issues with executing compactions, which can lead to data inconsistencies and affect the overall performance of the system.
Impact #
The impact of this alert includes:
- Data inconsistencies: Failed compactions can result in incomplete or corrupted data, leading to inaccuracies in metrics and dashboards.
- Performance degradation: High compaction failure rates can cause the compactor to slow down, leading to increased latency and decreased system performance.
- Alert fatigue: Frequent or persistent compaction failures can lead to alert fatigue, causing teams to ignore or overlook critical issues.
Diagnosis #
To diagnose the root cause of the ThanosCompactorHighCompactionFailures alert, follow these steps:
- Check the Thanos compactor logs for errors or exceptions related to compaction failures.
- Verify that the compactor has sufficient resources (e.g., CPU, memory, and disk space) to execute compactions successfully.
- Investigate any recent changes to the compactor configuration or deployment that may be contributing to the failures.
- Check the status of the underlying storage system to ensure it is healthy and responsive.
Mitigation #
To mitigate the ThanosCompactorHighCompactionFailures alert, follow these steps:
- Investigate and resolve the underlying cause of the compaction failures, as identified during diagnosis.
- Increase the resources allocated to the compactor to ensure it can handle the compaction workload.
- Implement retries or fallback mechanisms to handle temporary failures or exceptions during compaction.
- Consider scaling out the compactor deployment to distribute the compaction workload and improve system resilience.
- Update the alert threshold or notification settings to reduce alert fatigue and ensure that critical issues are addressed promptly.