PrometheusRuleEvaluationSlow #

Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.

Alert Rule

alert: PrometheusRuleEvaluationSlow
annotations:
  description: |-
    Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheusruleevaluationslow/
  summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
expr: prometheus_rule_group_last_duration_seconds &gt; prometheus_rule_group_interval_seconds
for: 5m
labels:
  severity: warning

Here is a runbook for the PrometheusRuleEvaluationSlow alert:

Meaning #

The PrometheusRuleEvaluationSlow alert is triggered when the time it takes to evaluate Prometheus rules exceeds the scheduled interval. This can indicate a slower storage backend access or overly complex queries.

Impact #

If left unchecked, slow rule evaluation can lead to:

Delays in alerting and notification delivery
Increased load on the Prometheus server
Potential loss of Prometheus data due to slow processing
Inability to detect issues in a timely manner

Diagnosis #

To diagnose the issue, follow these steps:

Check the Prometheus server logs for any errors or slowdowns related to rule evaluation.
Investigate the storage backend performance, such as disk I/O or database query times.
Review the complexity of the rules being evaluated, looking for any unnecessary or inefficient queries.
Check the prometheus_rule_group_last_duration_seconds and prometheus_rule_group_interval_seconds metrics to understand the scale of the issue.

Mitigation #

To mitigate the issue, follow these steps:

Optimize the storage backend configuration for better performance.
Simplify or optimize complex rules to reduce evaluation time.
Increase the prometheus_rule_group_interval_seconds to give the server more time to evaluate rules.
Consider scaling up the Prometheus server or distributing the load across multiple servers.
Implement a more efficient storage solution, such as(SSD) or a distributed database.