ThanosQueryHttpRequestQueryRangeErrorRateHigh #
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of “query_range” requests.
Alert Rule
alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
annotations:
description: |-
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-query/thanosqueryhttprequestqueryrangeerrorratehigh/
summary: Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance
}})
expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*",
handler="query_range"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-query.*",
handler="query_range"}[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
Here is a runbook for the Prometheus alert rule ThanosQueryHttpRequestQueryRangeErrorRateHigh
using level 2 headers:
Meaning #
This alert is triggered when the error rate for “query_range” requests to Thanos Query exceeds 5% over a 5-minute period. The error rate is calculated as the ratio of failed requests to total requests, aggregated by job.
Impact #
A high error rate for “query_range” requests can indicate a problem with Thanos Query, which may lead to:
- Incomplete or inaccurate query results
- Increased latency or timeouts for query requests
- Degraded performance or unavailability of dependent systems
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Thanos Query logs for errors related to “query_range” requests
- Verify that the Thanos Query instance is running with the correct configuration and resources (e.g., CPU, memory)
- Investigate any recent changes to the Thanos Query configuration, infrastructure, or dependent systems
- Check the system metrics (e.g., CPU usage, memory usage, disk usage) to identify any resource bottlenecks
- Verify that the network connectivity and firewall rules are allowing traffic to/from the Thanos Query instance
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Thanos Query instance to reset the query_range request handler
- Investigate and resolve any underlying issues identified during diagnosis (e.g., configuration errors, resource bottlenecks)
- Implement retry mechanisms or circuits breakers to handle temporary failures in the query_range request handler
- Consider scaling up or out the Thanos Query instance to handle increased load or traffic
- Monitor the error rate and system metrics closely to ensure the issue is resolved and does not recur.