ThanosRuleAlertmanagerHighDNSFailures #
Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.
Alert Rule
alert: ThanosRuleAlertmanagerHighDNSFailures
annotations:
description: |-
Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-ruler/thanosrulealertmanagerhighdnsfailures/
summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance
}})
expr: (sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m]))
/ sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m]))
* 100 > 1)
for: 15m
labels:
severity: warning
Meaning #
The ThanosRuleAlertmanagerHighDNSFailures alert is triggered when the percentage of failing DNS queries for Alertmanager endpoints exceeds 1% for a Thanos Rule instance. This alert indicates a potential issue with the DNS resolution for Alertmanager endpoints, which may lead to alert delivery failures.
Impact #
- Delayed or failed alert delivery to notification channels
- Incomplete or inaccurate alerting and notifications
- Increased latency or timeouts in alert processing
- Potential impact on incident response and resolution times
Diagnosis #
- Check the Thanos Rule instance logs for DNS resolution errors or timeouts
- Verify the DNS configuration and settings for the Alertmanager endpoints
- Check the network connectivity and latency between the Thanos Rule instance and the Alertmanager endpoints
- Investigate any recent changes to the DNS infrastructure or Alertmanager configuration
- Review the Prometheus metrics for other related errors or issues
Mitigation #
- Investigate and resolve any DNS resolution errors or timeouts on the Thanos Rule instance
- Verify and update the DNS configuration to ensure correct resolution of Alertmanager endpoints
- Check and optimize the network connectivity and latency between the Thanos Rule instance and the Alertmanager endpoints
- Consider implementing redundancy or fallback mechanisms for DNS resolution
- Monitor the alerting system for any continued issues or errors and take corrective action as needed.