ThanosQueryHttpRequestQueryErrorRateHigh #
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of “query” requests.
Alert Rule
alert: ThanosQueryHttpRequestQueryErrorRateHigh
annotations:
description: |-
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-query/thanosqueryhttprequestqueryerrorratehigh/
summary: Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance
}})
expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*",
handler="query"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-query.*",
handler="query"}[5m]))) * 100 > 5
for: 5m
labels:
severity: critical
Meaning #
The ThanosQueryHttpRequestQueryErrorRateHigh alert is triggered when the error rate for “query” requests in Thanos Query exceeds 5% over a 5-minute period. This indicates that Thanos Query is experiencing issues handling queries, which can lead to.data loss, delayed processing, or incomplete results.
Impact #
- Data inconsistencies or loss due to failed queries
- Delayed processing or incomplete results
- Potential impact on dependent services or applications
- Increased error rates can lead to decreased system performance and reliability
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Thanos Query logs for error messages related to query handling.
- Verify that the Thanos Query instance is running and configured correctly.
- Check the system resources (CPU, memory, disk space) to ensure they are not causing the issue.
- Verify that the query load is within expected limits and that there are no sudden spikes.
- Check the networking and connectivity to ensure there are no issues.
Mitigation #
To mitigate the issue, follow these steps:
- Check the Thanos Query configuration and adjust it if necessary to handle the query load.
- Scale up the Thanos Query instance to handle the increased query load.
- Implement query optimization techniques to reduce the load on the system.
- Check for any software or configuration updates that may resolve the issue.
- Consider implementing a load balancer or queueing mechanism to handle sudden spikes in query load.
Additional resources:
- Refer to the Thanos Query documentation for configuration and optimization guidelines.
- Consult with the development team to identify any application-level issues that may be contributing to the error rate.
- Review system monitoring and logging to identify any other potential issues.