TraefikHighHttp4xxErrorRateService #
Traefik service 4xx error rate is above 5%
Alert Rule
alert: TraefikHighHttp4xxErrorRateService
annotations:
description: |-
Traefik service 4xx error rate is above 5%
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/embedded-exporter-v2/traefikhighhttp4xxerrorrateservice/
summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance
}})
expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m]))
by (service) * 100 > 5
for: 1m
labels:
severity: critical
Here is a runbook for the TraefikHighHttp4xxErrorRateService alert:
Meaning #
The TraefikHighHttp4xxErrorRateService alert is triggered when the rate of HTTP 4xx errors for a Traefik service exceeds 5% of the total request rate over a 3-minute period. This alert indicates that there is a high error rate for a specific service, which may impact the availability and reliability of the service.
Impact #
A high HTTP 4xx error rate can have a significant impact on the service:
- Users may experience errors and failures when accessing the service
- The service may become unavailable or unstable
- The high error rate may indicate a underlying issue with the service, such as a configuration problem or a resource constraint
Diagnosis #
To diagnose the root cause of the high HTTP 4xx error rate, follow these steps:
- Check the Traefik logs for errors and exceptions related to the service
- Verify the service configuration and check for any recent changes
- Check the resource utilization of the service, such as CPU and memory usage
- Check for any network connectivity issues or firewall rules blocking traffic to the service
- Use tools like curl or a web browser to test the service and reproduce the error
Mitigation #
To mitigate the high HTTP 4xx error rate, follow these steps:
- Review and update the service configuration to ensure it is correct and optimal
- Investigate and resolve any underlying issues causing the errors, such as resource constraints or network connectivity problems
- Implement retry mechanisms or circuit breakers to handle temporary failures
- Consider implementing rate limiting or traffic shaping to prevent overwhelming the service
- Monitor the service closely to ensure the error rate returns to a normal level