ThanosQueryHighDNSFailures #
Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.
Alert Rule
alert: ThanosQueryHighDNSFailures
annotations:
description: |-
Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-query/thanosqueryhighdnsfailures/
summary: Thanos Query High D N S Failures (instance {{ $labels.instance }})
expr: (sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m]))
/ sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m])))
* 100 > 1
for: 15m
labels:
severity: warning
Meaning #
The ThanosQueryHighDNSFailures alert is triggered when the percentage of failing DNS queries for Thanos Query store endpoints exceeds 1%. This alert indicates that there is an issue with DNS resolution for Thanos Query instances, which may impact the overall performance and reliability of the system.
Impact #
- High DNS failure rates can lead to timeouts and errors in Thanos Query, resulting in delayed or failed queries.
- This may cause:
- Increased latency and decreased performance for dependent systems.
- Incomplete or incorrect data being returned to users.
- Potential data loss or corruption.
Diagnosis #
To diagnose the issue, follow these steps:
- Check DNS resolution: Verify that DNS resolution is working correctly for the affected Thanos Query instance(s).
- Investigate DNS server logs: Analyze DNS server logs to identify any errors or issues that may be causing the high failure rate.
- Verify network connectivity: Ensure that there are no network connectivity issues between the Thanos Query instance and the DNS servers.
- Check Thanos Query configuration: Review the Thanos Query configuration to ensure that it is correctly configured for DNS resolution.
Mitigation #
To mitigate the issue, follow these steps:
- Restart DNS service: Restart the DNS service to ensure it is running correctly.
- Check and update DNS server configuration: Review and update the DNS server configuration to resolve any issues identified during diagnosis.
- Increase DNS timeout: Temporarily increase the DNS timeout to allow for more time to resolve DNS queries.
- Implement DNS caching: Consider implementing DNS caching to reduce the load on DNS servers and improve performance.
- Monitor and investigate: Continuously monitor the situation and investigate the root cause of the issue to prevent future occurrences.