PromtailRequestLatency #

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf “%.2f” $value }}s 99th percentile latency.

Alert Rule

alert: PromtailRequestLatency
annotations:
  description: |-
    The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf &#34;%.2f&#34; $value }}s 99th percentile latency.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/promtail-internal/promtailrequestlatency/
  summary: Promtail request latency (instance {{ $labels.instance }})
expr: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m]))
  by (le)) &gt; 1
for: 5m
labels:
  severity: critical

Meaning #

The PromtailRequestLatency alert is triggered when the 99th percentile latency of requests to Promtail exceeds 1 second over a 5-minute period. This indicates that a subset of requests are taking significantly longer than expected to process, which could lead to degraded performance or operational issues.

Impact #

High latency in Promtail can result in delayed ingestion and processing of logs. This may cause:

Delayed visibility into system and application logs.
Potential loss of critical debugging or monitoring information.
Increased resource utilization, leading to cascading failures in downstream systems.

Diagnosis #

Confirm the Alert:
- Review the alert details and verify the affected instance(s), job, and route.
- Check the VALUE provided in the alert description to understand the severity.
Examine Metrics:
- Query the Prometheus expression:
```
histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le))
```
  - Identify which instances or routes are contributing to high latency.
Review Logs:
- Access the logs for the Promtail instances reporting high latency.
- Look for error messages, timeouts, or resource exhaustion indicators.
Check Resource Utilization:
- Inspect CPU, memory, and disk usage for the affected Promtail instances.
- Ensure sufficient resources are allocated to handle the current log ingestion volume.
Network Issues:
- Investigate potential network latency or connectivity issues between Promtail and its upstream or downstream systems (e.g., Loki).
Configuration Changes:
- Review recent configuration or deployment changes that might impact Promtail’s performance.

Mitigation #

Scale Resources:
- Increase CPU, memory, or disk resources for the affected Promtail instances.
- Add additional Promtail replicas to distribute the load.
Optimize Configuration:
- Adjust scrape or ingestion settings to balance performance and resource usage.
- Ensure rate-limiting and backoff mechanisms are correctly configured.
Resolve Bottlenecks:
- Address issues in upstream systems sending logs to Promtail (e.g., excessive log volume or high-frequency writes).
- Investigate and resolve any bottlenecks in downstream systems (e.g., Loki performance).
Restart or Redeploy:
- Restart affected Promtail instances to clear transient issues.
- Roll back recent changes if the issue correlates with a new deployment.
Long-Term Improvements:
- Implement monitoring dashboards to track Promtail performance metrics.
- Perform a root cause analysis (RCA) to address systemic issues and prevent recurrence.

Additional Information #

Prometheus Query:

histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1