ThanosQueryRangeLatencyHigh #
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.
Alert Rule
alert: ThanosQueryRangeLatencyHigh
annotations:
description: |-
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-query/thanosqueryrangelatencyhigh/
summary: Thanos Query Range Latency High (instance {{ $labels.instance }})
expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*",
handler="query_range"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*",
handler="query_range"}[5m])) > 0)
for: 10m
labels:
severity: critical
Meaning #
The ThanosQueryRangeLatencyHigh
alert is triggered when the 99th percentile latency of range queries in Thanos Query exceeds 90 seconds for a specific job/instance. This alert indicates that the query range requests are experiencing high latency, which may impact the performance and responsiveness of the system.
Impact #
The high latency of range queries can have a significant impact on the overall system performance, leading to:
- Slow query responses, resulting in delayed decision-making and potentially affecting business operations
- Increased latency can cause cascading failures, leading to a broader system outage
- Poor user experience, especially for applications that rely heavily on query range requests
Diagnosis #
To diagnose the root cause of the high latency, follow these steps:
- Check the Thanos Query logs for any error messages or exceptions related to query range requests
- Analyze the system metrics, such as CPU utilization, memory usage, and disk I/O, to identify any resource bottlenecks
- Verify if there are any recent changes or updates to the Thanos Query configuration, deployment, or underlying infrastructure that may be contributing to the high latency
- Use tools like Prometheus and Grafana to visualize the query latency and identify any trends or patterns
Mitigation #
To mitigate the high latency of range queries, follow these steps:
- Investigate and resolve any underlying infrastructure issues: Identify and address any resource bottlenecks, such as high CPU utilization or disk I/O issues, and ensure that the underlying infrastructure is properly scaled and configured.
- Optimize Thanos Query configuration: Review and optimize the Thanos Query configuration, such as adjusting the query concurrency, caching, and indexing settings to improve performance.
- Implement query optimization techniques: Apply query optimization techniques, such as query rewriting, indexing, and caching, to improve query performance.
- Consider horizontal scaling: Consider horizontally scaling the Thanos Query instances to distribute the load and improve performance.
- Monitor and analyze query patterns: Continuously monitor and analyze query patterns to identify opportunities for optimization and improvement.
Remember to follow the runbook’s guidelines and take corrective actions to resolve the issue and prevent future occurrences.