LokiRequestErrors #
The {{ $labels.job }} and {{ $labels.route }} are experiencing errors
Alert Rule
alert: LokiRequestErrors
annotations:
description: |-
The {{ $labels.job }} and {{ $labels.route }} are experiencing errors
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/loki-internal/lokirequesterrors/
summary: Loki request errors (instance {{ $labels.instance }})
expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m]))
by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m]))
by (namespace, job, route) > 10
for: 15m
labels:
severity: critical
Meaning #
The LokiRequestErrors alert is triggered when the rate of errors for Loki requests exceeds 10% of the total requests over a 1-minute window, sustained for 15 minutes. This indicates that there is a significant issue with Loki request processing, which may impact the reliability and accuracy of log data.
Impact #
- Log data may not be properly processed, leading to incomplete or inaccurate logs.
- This can impact the ability to troubleshoot issues, detect security threats, and monitor system performance.
- The affected namespace, job, and route may experience errors, leading to downstream impacts on dependent services and applications.
Diagnosis #
- Check the Loki request logs for errors and exceptions.
- Investigate the root cause of the errors, such as:
- Network connectivity issues between Loki and the client.
- Configuration errors in Loki or the client.
- Resource constraints or overload on Loki or the client.
- Verify that the Loki instance is properly configured and healthy.
- Check for any recent changes or deployments that may have introduced the issue.
- Review the Loki request metrics to identify patterns or trends that may indicate the source of the issue.
Mitigation #
- Investigate and resolve the root cause of the errors, as identified during diagnosis.
- Implement temporary workarounds to reduce the error rate, such as:
- Load balancing or distributing traffic to healthy instances.
- Increasing resource capacity or scaling Loki instances.
- Implementing retries or circuit breakers to handle temporary errors.
- Perform a rolling restart of Loki instances to ensure all instances are healthy and up-to-date.
- Verify that the error rate has decreased and the alert has cleared.
- Implement long-term fixes to prevent similar issues from occurring in the future, such as:
- Improving Loki instance configuration and resource allocation.
- Enhancing logging and monitoring to detect issues earlier.
- Developing automated recovery procedures to minimize downtime.