ThanosRuleRuleEvaluationLatencyHigh #
Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.
Alert Rule
alert: ThanosRuleRuleEvaluationLatencyHigh
annotations:
description: |-
Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-ruler/thanosruleruleevaluationlatencyhigh/
summary: Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance
}})
expr: (sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"})
> sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}))
for: 5m
labels:
severity: warning
Here is a runbook for the Prometheus alert rule ThanosRuleRuleEvaluationLatencyHigh
:
Meaning #
The ThanosRuleRuleEvaluationLatencyHigh
alert is triggered when the evaluation latency of a Thanos rule group exceeds its interval. This means that the rule group is taking longer to evaluate than its intended interval, which can lead to delayed alerting and increased latency in the system.
Impact #
The impact of this alert is that Thanos rule evaluations may be delayed, leading to:
- Delayed alerting and notification
- Increased latency in the system
- Potential for missed alerts or notifications
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Thanos rule group’s configuration and ensure that the interval is set correctly.
- Verify that the rule group’s evaluation latency is not excessively high by checking the
prometheus_rule_group_last_duration_seconds
metric. - Investigate any recent changes to the rule group’s configuration or the underlying system that may have caused the evaluation latency to increase.
- Check the system’s resource utilization (e.g., CPU, memory) to ensure that it is not overloaded.
Mitigation #
To mitigate the issue, follow these steps:
- Adjust the rule group’s interval to a higher value to give the system more time to evaluate the rules.
- Optimize the rule group’s configuration to reduce the evaluation latency (e.g., simplify rules, reduce the number of rules).
- Scale up the underlying system to increase its capacity and reduce the evaluation latency.
- Consider implementing a caching mechanism to reduce the load on the system and improve evaluation latency.