ThanosRuleQueryHighDNSFailures #

Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.

Alert Rule

alert: ThanosRuleQueryHighDNSFailures
annotations:
  description: |-
    Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-ruler/thanosrulequeryhighdnsfailures/
  summary: Thanos Rule Query High D N S Failures (instance {{ $labels.instance }})
expr: (sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~&#34;.*thanos-rule.*&#34;}[5m]))
  / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~&#34;.*thanos-rule.*&#34;}[5m]))
  * 100 &gt; 1)
for: 15m
labels:
  severity: warning

Meaning #

The ThanosRuleQueryHighDNSFailures alert is triggered when the percentage of failed DNS queries for Thanos Rule query endpoints exceeds 1% over a 5-minute period. This indicates that there may be issues with DNS resolution or connectivity to the query endpoints, which can impact the availability and performance of Thanos Rule.

Impact #

Failed DNS queries can lead to:

Delays or failures in querying rules, causing issues with alerting and notification systems
Increased latency and errors in Thanos Rule, potentially impacting downstream systems
Difficulty in troubleshooting issues due to incomplete or missing data

If left unchecked, high DNS failure rates can lead to a cascading failure of the entire monitoring and alerting system.

Diagnosis #

To diagnose the issue, follow these steps:

Check the Thanos Rule logs for errors related to DNS resolution or query endpoint connectivity.
Verify that the DNS servers are functioning correctly and reachable from the Thanos Rule instances.
Investigate any recent changes to the network configuration, DNS setup, or query endpoint URLs.
Check the system resources (e.g., CPU, memory, and disk space) to ensure they are within acceptable limits.
Review the Thanos Rule configuration to ensure it is correct and up-to-date.

Mitigation #

To mitigate the issue, follow these steps:

Check and update the DNS server configuration to ensure it is correct and reachable.
Verify that the query endpoint URLs are correct and reachable.
Implement retry mechanisms or circuit breakers to handle temporary DNS failures.
Consider implementing a DNS cache or proxy to reduce the load on the DNS servers.
Review and optimize the Thanos Rule configuration to reduce the load on the system resources.

Remember to investigate and address any underlying causes of the DNS failures to prevent future occurrences.