PrometheusRuleEvaluationFailures #
Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.
Alert Rule
alert: PrometheusRuleEvaluationFailures
annotations:
description: |-
Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheusruleevaluationfailures/
summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
for: 0m
labels:
severity: critical
Here is the runbook for the PrometheusRuleEvaluationFailures alert:
Meaning #
The PrometheusRuleEvaluationFailures alert is triggered when there is an increase in the total number of Prometheus rule evaluation failures within a 3-minute window. This means that Prometheus is experiencing issues evaluating rules, which can lead to missed or delayed alerts.
Impact #
The impact of this alert is critical, as it can result in:
- Missed alerts: Rules that are not evaluated may not trigger alerts, leading to potential issues going undetected.
- Delayed alerts: Even if rules are eventually evaluated, the delay can cause alerts to be triggered late, reducing the effectiveness of monitoring and incident response.
- Lack of visibility: Unevaluated rules can lead to a lack of visibility into the system’s state, making it harder to identify and respond to issues.
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Prometheus logs for errors related to rule evaluation.
- Verify that the Prometheus instance is running correctly and not experiencing high CPU usage or memory issues.
- Review the rules that are failing to evaluate and check for any syntax errors or conflicts with other rules.
- Check the Prometheus configuration file for any issues or typos that could be causing rule evaluation failures.
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Prometheus instance to clear any temporary issues.
- Verify that the Prometheus configuration file is correct and up-to-date.
- Review and correct any syntax errors or conflicts with other rules that are causing evaluation failures.
- Consider increasing the resources (e.g., CPU, memory) allocated to the Prometheus instance to ensure it can handle the load.
- Implement additional monitoring and alerting to detect rule evaluation failures earlier and reduce the impact.