NginxHighHttp5xxErrorRate

NginxHighHttp5xxErrorRate #

Too many HTTP requests with status 5xx (> 5%)

Alert Rule
alert: NginxHighHttp5xxErrorRate
annotations:
  description: |-
    Too many HTTP requests with status 5xx (> 5%)
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/knyar-nginx-exporter/nginxhighhttp5xxerrorrate/
  summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m]))
  * 100 > 5
for: 1m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule “NginxHighHttp5xxErrorRate”:

Meaning #

This alert is triggered when the rate of HTTP requests with a 5xx status code (indicating an error on the server side) exceeds 5% of the total HTTP requests over a 1-minute period. This may indicate a problem with the Nginx server or the application it is serving.

Impact #

  • High error rates can lead to a poor user experience, as users may encounter errors when trying to access the application.
  • If left unaddressed, this issue can result in a loss of user trust and revenue.
  • It can also indicate a potential security issue or misconfiguration of the Nginx server.

Diagnosis #

  • Check the Nginx error logs to identify the specific errors and their causes.
  • Verify that the application is functioning correctly and not returning errors.
  • Investigate recent changes to the Nginx configuration or the application code that may have caused the issue.
  • Check the system resources (CPU, memory, disk space) to ensure they are not overwhelmed.
  • Review the access logs to identify any patterns or trends in the errors.

Mitigation #

  • Immediately investigate and address the root cause of the errors.
  • Implement a temporary fix, such as increasing the resources available to the Nginx server or load balancing the traffic.
  • Perform a rolling restart of the Nginx servers to ensure that any stuck processes are terminated.
  • Consider implementing retries or circuit breakers in the application to reduce the impact of errors on users.
  • Review and refine the Nginx configuration to prevent similar issues in the future.
  • Monitor the error rate closely to ensure that it returns to a normal level.