IstioHigh4xxErrorRate #
High percentage of HTTP 5xx responses in Istio (> 5%).
Alert Rule
alert: IstioHigh4xxErrorRate
annotations:
description: |-
High percentage of HTTP 5xx responses in Istio (> 5%).
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/istio-internal/istiohigh4xxerrorrate/
summary: Istio high 4xx error rate (instance {{ $labels.instance }})
expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5
for: 1m
labels:
severity: warning
Here is the runbook for the IstioHigh4xxErrorRate alert:
Meaning #
The IstioHigh4xxErrorRate alert is triggered when the percentage of HTTP 4xx error responses in Istio exceeds 5% over a 5-minute period. This alert indicates that there is an issue with the Istio configuration or the underlying application that is causing a high rate of client-side errors.
Impact #
A high 4xx error rate can have a significant impact on the user experience and the overall reliability of the application. It can lead to:
- Increased latency and timeouts
- Decreased throughput and performance
- Frustrated users and potential revenue loss
- Increased load on upstream services and resources
Diagnosis #
To diagnose the root cause of the high 4xx error rate, follow these steps:
- Check the Istio logs for any errors or issues related to the affected service.
- Verify that the service is correctly configured and deployed.
- Check the application logs for any errors or exceptions that may be causing the 4xx errors.
- Use tools like
istioctl
orkubectl
to inspect the Istio configuration and verify that it is correct. - Check for any known issues or bugs in the Istio version being used.
Mitigation #
To mitigate the high 4xx error rate, follow these steps:
- Investigate and resolve any underlying issues with the service or application.
- Verify that the Istio configuration is correct and up-to-date.
- Check for any misconfigured or incorrect routing rules.
- Consider implementing retries or circuit breakers to handle transient errors.
- Monitor the error rate and performance metrics to ensure that the issue is resolved.
Additional resources:
- Istio documentation: Troubleshooting
- Istio documentation: Configuring routing rules