HaproxyHighHttp5xxErrorRateBackend #
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}
Alert Rule
alert: HaproxyHighHttp5xxErrorRateBackend
annotations:
description: |-
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/haproxy-exporter-v1/haproxyhighhttp5xxerrorratebackend/
summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance
}})
expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]))
/ sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5
for: 1m
labels:
severity: critical
Meaning #
The HaproxyHighHttp5xxErrorRateBackend
alert is triggered when the rate of HTTP requests with a 5xx status code on a specific backend exceeds 5% of the total requests to that backend over a 1-minute period. This indicates a significant error rate on the backend, which can impact the overall availability and performance of the service.
Impact #
- High error rates can lead to a degradation of service quality, causing frustration for users and potential revenue loss.
- Increased error rates can also put additional load on the system, leading to performance issues and potentially even crashes.
- If left unaddressed, high error rates can lead to a loss of customer trust and reputation damage.
Diagnosis #
- Check the HAProxy logs to identify the specific backend and instance affected by the high error rate.
- Investigate the root cause of the 5xx errors using the HAProxy logs, server logs, and application logs.
- Verify that the backend is correctly configured and that there are no issues with the underlying infrastructure.
- Check for any recent changes or deployments that may have introduced the error.
- Use tools like
curl
orwget
to test the backend and verify that the 5xx errors are still occurring.
Mitigation #
- Identify and address the root cause of the 5xx errors, which may involve:
- Restarting or reloading the backend service.
- Adjusting server or application configurations.
- Implementing retries or circuit breakers to reduce the impact of errors.
- Implement a temporary fix to reduce the error rate, such as:
- Decreasing the traffic to the affected backend.
- Implementing a load balancing strategy to distribute traffic to other backends.
- Perform a thorough investigation and create a permanent fix to prevent similar issues in the future.
- Monitor the error rate and adjust the alert threshold as necessary to ensure it is effective in detecting issues.
- Update the runbook and documentation to reflect the root cause and mitigation steps taken to address the issue.