PromtailRequestLatency #
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf “%.2f” $value }}s 99th percentile latency.
Alert Rule
alert: PromtailRequestLatency
annotations:
description: |-
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/promtail-internal/promtailrequestlatency/
summary: Promtail request latency (instance {{ $labels.instance }})
expr: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m]))
by (le)) > 1
for: 5m
labels:
severity: critical
Meaning #
The PromtailRequestLatency
alert is triggered when the 99th percentile latency of requests to Promtail exceeds 1 second over a 5-minute period. This indicates that a subset of requests are taking significantly longer than expected to process, which could lead to degraded performance or operational issues.
Impact #
High latency in Promtail can result in delayed ingestion and processing of logs. This may cause:
- Delayed visibility into system and application logs.
- Potential loss of critical debugging or monitoring information.
- Increased resource utilization, leading to cascading failures in downstream systems.
Diagnosis #
Confirm the Alert:
- Review the alert details and verify the affected instance(s), job, and route.
- Check the
VALUE
provided in the alert description to understand the severity.
Examine Metrics:
- Query the Prometheus expression:
histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le))
- Identify which instances or routes are contributing to high latency.
- Query the Prometheus expression:
Review Logs:
- Access the logs for the Promtail instances reporting high latency.
- Look for error messages, timeouts, or resource exhaustion indicators.
Check Resource Utilization:
- Inspect CPU, memory, and disk usage for the affected Promtail instances.
- Ensure sufficient resources are allocated to handle the current log ingestion volume.
Network Issues:
- Investigate potential network latency or connectivity issues between Promtail and its upstream or downstream systems (e.g., Loki).
Configuration Changes:
- Review recent configuration or deployment changes that might impact Promtail’s performance.
Mitigation #
Scale Resources:
- Increase CPU, memory, or disk resources for the affected Promtail instances.
- Add additional Promtail replicas to distribute the load.
Optimize Configuration:
- Adjust scrape or ingestion settings to balance performance and resource usage.
- Ensure rate-limiting and backoff mechanisms are correctly configured.
Resolve Bottlenecks:
- Address issues in upstream systems sending logs to Promtail (e.g., excessive log volume or high-frequency writes).
- Investigate and resolve any bottlenecks in downstream systems (e.g., Loki performance).
Restart or Redeploy:
- Restart affected Promtail instances to clear transient issues.
- Roll back recent changes if the issue correlates with a new deployment.
Long-Term Improvements:
- Implement monitoring dashboards to track Promtail performance metrics.
- Perform a root cause analysis (RCA) to address systemic issues and prevent recurrence.
Additional Information #
- Prometheus Query:
histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1