NginxHighHttp5xxErrorRate #

Too many HTTP requests with status 5xx (> 5%)

Alert Rule

alert: NginxHighHttp5xxErrorRate
annotations:
  description: |-
    Too many HTTP requests with status 5xx (&gt; 5%)
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/knyar-nginx-exporter/nginxhighhttp5xxerrorrate/
  summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
expr: sum(rate(nginx_http_requests_total{status=~&#34;^5..&#34;}[1m])) / sum(rate(nginx_http_requests_total[1m]))
  * 100 &gt; 5
for: 1m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule “NginxHighHttp5xxErrorRate”:

Meaning #

This alert is triggered when the rate of HTTP requests with a 5xx status code (indicating an error on the server side) exceeds 5% of the total HTTP requests over a 1-minute period. This may indicate a problem with the Nginx server or the application it is serving.

Impact #

High error rates can lead to a poor user experience, as users may encounter errors when trying to access the application.
If left unaddressed, this issue can result in a loss of user trust and revenue.
It can also indicate a potential security issue or misconfiguration of the Nginx server.

Diagnosis #

Check the Nginx error logs to identify the specific errors and their causes.
Verify that the application is functioning correctly and not returning errors.
Investigate recent changes to the Nginx configuration or the application code that may have caused the issue.
Check the system resources (CPU, memory, disk space) to ensure they are not overwhelmed.
Review the access logs to identify any patterns or trends in the errors.

Mitigation #

Immediately investigate and address the root cause of the errors.
Implement a temporary fix, such as increasing the resources available to the Nginx server or load balancing the traffic.
Perform a rolling restart of the Nginx servers to ensure that any stuck processes are terminated.
Consider implementing retries or circuit breakers in the application to reduce the impact of errors on users.
Review and refine the Nginx configuration to prevent similar issues in the future.
Monitor the error rate closely to ensure that it returns to a normal level.