HaproxyHighHttp5xxErrorRateServer #

Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}

Alert Rule

alert: HaproxyHighHttp5xxErrorRateServer
annotations:
  description: |-
    Too many HTTP requests with status 5xx (&gt; 5%) on server {{ $labels.server }}
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/haproxy-exporter-v1/haproxyhighhttp5xxerrorrateserver/
  summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
expr: sum by (server) (rate(haproxy_server_http_responses_total{code=&#34;5xx&#34;}[1m]) *
  100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) &gt; 5
for: 1m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule HaproxyHighHttp5xxErrorRateServer:

Meaning #

The HaproxyHighHttp5xxErrorRateServer alert is triggered when the rate of HTTP requests with a 5xx status code on a specific HAProxy server exceeds 5% of the total HTTP requests on that server. This indicates a high error rate for the server, which can impact the overall availability and performance of the application.

Impact #

High error rates can lead to a poor user experience, resulting in decreased customer satisfaction and potential revenue loss.
The increased error rate may indicate underlying issues with the application or backend services, which can lead to further escalation and downtime if left unaddressed.
The alert may trigger unnecessary escalations and notifications to teams, causing fatigue and decreasing the overall effectiveness of the monitoring system.

Diagnosis #

To diagnose the issue, follow these steps:

Check the HAProxy server logs for errors and exceptions related to the 5xx status code.
Investigate the backend services and application logs for errors and exceptions that may be contributing to the high error rate.
Verify that the HAProxy server is properly configured and that there are no issues with the load balancing or routing.
Check the system resources (e.g., CPU, memory, disk usage) to ensure that the server is not experiencing any resource constraints.

Mitigation #

To mitigate the issue, follow these steps:

Identify and address the root cause of the high error rate, whether it’s an issue with the application, backend services, or HAProxy configuration.
Implement temporary mitigations, such as redirecting traffic to alternative servers or implementing rate limiting, to reduce the impact on users.
Perform a rolling restart of the HAProxy server to ensure that the issue is not related to a specific process or thread.
Consider implementing additional error handling and retry mechanisms to improve the overall resilience of the application.

If the issue persists, please refer to the HAProxy documentation and Prometheus documentation for further guidance.