HaproxyRetryHigh #
High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend
Alert Rule
alert: HaproxyRetryHigh
annotations:
description: |-
High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/haproxy-exporter-v1/haproxyretryhigh/
summary: HAProxy retry high (instance {{ $labels.instance }})
expr: sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10
for: 2m
labels:
severity: warning
Meaning #
The HaproxyRetryHigh alert is triggered when the rate of retry warnings for a HAProxy backend exceeds 10 per minute. This alert indicates that there is a high rate of retries on a specific backend, which can lead to performance issues, increased latency, and decreased system reliability.
Impact #
- Increased latency and decreased response times for users
- Decreased system reliability and potential for errors
- Performance issues on the backend, potentially leading to cascading failures
- Potential for resource exhaustion and increased load on the backend
Diagnosis #
To diagnose the issue, follow these steps:
- Check the HAProxy logs for errors and retry reasons
- Verify the backend configuration and check for any misconfigurations
- Investigate any recent changes to the backend or HAProxy configuration
- Check the system resources (CPU, memory, and disk usage) to rule out resource constraints
- Verify the network connectivity and latency between the HAProxy instance and the backend
Mitigation #
To mitigate the issue, follow these steps:
- Investigate and address the root cause of the retries (e.g., backend errors, network issues, or misconfigurations)
- Implement retry throttling or exponential backoff to reduce the load on the backend
- Consider load balancing or distributing the traffic across multiple backends to reduce the load on a single backend
- Monitor the HAProxy and backend performance metrics to detect any potential issues before they escalate
- Consider implementing circuit breakers or fallbacks to handle failures and reduce the impact on the system
Note: For more detailed steps and specific procedures, refer to the runbook linked in the alert annotations: https://github.com/srerun/prometheus-alerts/blob/main/content/runbooks/haproxy-exporter-v1/HaproxyRetryHigh.md