ThanosRuleAlertmanagerHighDNSFailures #

Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.

Alert Rule

alert: ThanosRuleAlertmanagerHighDNSFailures
annotations:
  description: |-
    Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-ruler/thanosrulealertmanagerhighdnsfailures/
  summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance
    }})
expr: (sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~&#34;.*thanos-rule.*&#34;}[5m]))
  / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~&#34;.*thanos-rule.*&#34;}[5m]))
  * 100 &gt; 1)
for: 15m
labels:
  severity: warning

Meaning #

The ThanosRuleAlertmanagerHighDNSFailures alert is triggered when the percentage of failing DNS queries for Alertmanager endpoints exceeds 1% for a Thanos Rule instance. This alert indicates a potential issue with the DNS resolution for Alertmanager endpoints, which may lead to alert delivery failures.

Impact #

Delayed or failed alert delivery to notification channels
Incomplete or inaccurate alerting and notifications
Increased latency or timeouts in alert processing
Potential impact on incident response and resolution times

Diagnosis #

Check the Thanos Rule instance logs for DNS resolution errors or timeouts
Verify the DNS configuration and settings for the Alertmanager endpoints
Check the network connectivity and latency between the Thanos Rule instance and the Alertmanager endpoints
Investigate any recent changes to the DNS infrastructure or Alertmanager configuration
Review the Prometheus metrics for other related errors or issues

Mitigation #

Investigate and resolve any DNS resolution errors or timeouts on the Thanos Rule instance
Verify and update the DNS configuration to ensure correct resolution of Alertmanager endpoints
Check and optimize the network connectivity and latency between the Thanos Rule instance and the Alertmanager endpoints
Consider implementing redundancy or fallback mechanisms for DNS resolution
Monitor the alerting system for any continued issues or errors and take corrective action as needed.