PrometheusRuleEvaluationFailures #

Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.

Alert Rule

alert: PrometheusRuleEvaluationFailures
annotations:
  description: |-
    Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheusruleevaluationfailures/
  summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
expr: increase(prometheus_rule_evaluation_failures_total[3m]) &gt; 0
for: 0m
labels:
  severity: critical

Here is the runbook for the PrometheusRuleEvaluationFailures alert:

Meaning #

The PrometheusRuleEvaluationFailures alert is triggered when there is an increase in the total number of Prometheus rule evaluation failures within a 3-minute window. This means that Prometheus is experiencing issues evaluating rules, which can lead to missed or delayed alerts.

Impact #

The impact of this alert is critical, as it can result in:

Missed alerts: Rules that are not evaluated may not trigger alerts, leading to potential issues going undetected.
Delayed alerts: Even if rules are eventually evaluated, the delay can cause alerts to be triggered late, reducing the effectiveness of monitoring and incident response.
Lack of visibility: Unevaluated rules can lead to a lack of visibility into the system’s state, making it harder to identify and respond to issues.

Diagnosis #

To diagnose the issue, follow these steps:

Check the Prometheus logs for errors related to rule evaluation.
Verify that the Prometheus instance is running correctly and not experiencing high CPU usage or memory issues.
Review the rules that are failing to evaluate and check for any syntax errors or conflicts with other rules.
Check the Prometheus configuration file for any issues or typos that could be causing rule evaluation failures.

Mitigation #

To mitigate the issue, follow these steps:

Restart the Prometheus instance to clear any temporary issues.
Verify that the Prometheus configuration file is correct and up-to-date.
Review and correct any syntax errors or conflicts with other rules that are causing evaluation failures.
Consider increasing the resources (e.g., CPU, memory) allocated to the Prometheus instance to ensure it can handle the load.
Implement additional monitoring and alerting to detect rule evaluation failures earlier and reduce the impact.