LokiRequestLatency #
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf “%.2f” $value }}s 99th percentile latency
Alert Rule
alert: LokiRequestLatency
annotations:
description: |-
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/loki-internal/lokirequestlatency/
summary: Loki request latency (instance {{ $labels.instance }})
expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m]))
by (le))) > 1
for: 5m
labels:
severity: critical
Here is a sample runbook for the LokiRequestLatency alert:
Meaning #
The LokiRequestLatency alert is triggered when the 99th percentile request latency for Loki requests exceeds 1 second over a 5-minute window. This alert is critical, indicating a severe performance issue with Loki.
Impact #
- High latency can cause delays in log processing and querying, leading to poor user experience and delayed insights.
- This can also lead to increased memory usage and potential crashes in Loki components.
- Services relying on Loki for logging and monitoring may be impacted, causing cascading failures.
Diagnosis #
To diagnose the root cause of this issue:
- Check the Loki instance(s) indicated in the alert label for any signs of high load, GC pauses, or resource contention.
- Investigate recent changes to the Loki configuration, deployed applications, or infrastructure that may be contributing to the increased latency.
- Review the Loki request logs to identify patterns or anomalies in the requests that may be causing the latency.
- Verify that the Loki cluster is properly sized and scaled to handle the current load.
Mitigation #
To mitigate this issue:
- Immediately investigate and address any underlying resource contention or configuration issues in the Loki instance(s).
- Consider scaling up or horizontally scaling the Loki cluster to handle the increased load.
- Optimize Loki configuration to improve performance, such as adjusting the ingestion rate, batch size, or query concurrency.
- Implement caching or other optimization techniques to reduce the load on Loki.
- Roll back recent changes to the Loki configuration or deployed applications if deemed necessary.
- Monitor the Loki request latency and adjust the alert threshold accordingly to ensure it is set to a reasonable value for the specific Loki deployment.