TraefikHighHttp4xxErrorRateService #

Traefik service 4xx error rate is above 5%

Alert Rule

alert: TraefikHighHttp4xxErrorRateService
annotations:
  description: |-
    Traefik service 4xx error rate is above 5%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/embedded-exporter-v2/traefikhighhttp4xxerrorrateservice/
  summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance
    }})
expr: sum(rate(traefik_service_requests_total{code=~&#34;4.*&#34;}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m]))
  by (service) * 100 &gt; 5
for: 1m
labels:
  severity: critical

Here is a runbook for the TraefikHighHttp4xxErrorRateService alert:

Meaning #

The TraefikHighHttp4xxErrorRateService alert is triggered when the rate of HTTP 4xx errors for a Traefik service exceeds 5% of the total request rate over a 3-minute period. This alert indicates that there is a high error rate for a specific service, which may impact the availability and reliability of the service.

Impact #

A high HTTP 4xx error rate can have a significant impact on the service:

Users may experience errors and failures when accessing the service
The service may become unavailable or unstable
The high error rate may indicate a underlying issue with the service, such as a configuration problem or a resource constraint

Diagnosis #

To diagnose the root cause of the high HTTP 4xx error rate, follow these steps:

Check the Traefik logs for errors and exceptions related to the service
Verify the service configuration and check for any recent changes
Check the resource utilization of the service, such as CPU and memory usage
Check for any network connectivity issues or firewall rules blocking traffic to the service
Use tools like curl or a web browser to test the service and reproduce the error

Mitigation #

To mitigate the high HTTP 4xx error rate, follow these steps:

Review and update the service configuration to ensure it is correct and optimal
Investigate and resolve any underlying issues causing the errors, such as resource constraints or network connectivity problems
Implement retry mechanisms or circuit breakers to handle temporary failures
Consider implementing rate limiting or traffic shaping to prevent overwhelming the service
Monitor the service closely to ensure the error rate returns to a normal level