PromtailRequestErrors #

The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf “%.2f” $value }}% errors.

Alert Rule

alert: PromtailRequestErrors
annotations:
  description: |-
    The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf &#34;%.2f&#34; $value }}% errors.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/promtail-internal/promtailrequesterrors/
  summary: Promtail request errors (instance {{ $labels.instance }})
expr: 100 * sum(rate(promtail_request_duration_seconds_count{status_code=~&#34;5..|failed&#34;}[1m]))
  by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m]))
  by (namespace, job, route, instance) &gt; 10
for: 5m
labels:
  severity: critical

Meaning #

This alert triggers when more than 10% of Promtail requests result in 5xx errors or failed responses over a 1-minute window, sustained for at least 5 minutes. It indicates that a significant portion of requests are failing, potentially impacting log ingestion and monitoring pipelines.

Impact #

Log Ingestion Failure: Logs may not be ingested properly, leading to gaps in observability.
Service Degradation: Downstream services relying on logs for monitoring, debugging, or auditing could be affected.
Increased Latency: The issue may lead to bottlenecks or retries, further stressing the system.

Diagnosis #

Key Metrics to Investigate #

Error Rate:

Inspect the percentage of errors in Promtail requests:

sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (namespace, job, route, instance)

Compare it with the total request count:

sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance)

Affected Instances:
- Identify the specific instance(s), namespace(s), job(s), or route(s) showing high error rates from alert labels.

Logs and Debugging Steps #

Promtail Logs:
- Look for errors or warnings in Promtail logs related to the affected instance:
```
kubectl logs -n <namespace> <promtail-pod> | grep -i error
```
Status Codes:
- Analyze the nature of 5xx errors:
  - Are they consistent or intermittent?
  - Do they originate from a specific route or endpoint?
Downstream Dependencies:
- Check if external services or storage backends (e.g., Loki) are causing these failures.
Network Issues:
- Verify network connectivity between Promtail and dependent services.
Resource Constraints:
- Check if the Promtail pods are resource-starved:
```
kubectl top pods -n <namespace>
```

Mitigation #

Immediate Actions #

Restart Promtail Pods:
- Restart the affected Promtail pods to resolve transient issues:
```
kubectl rollout restart deployment <promtail-deployment> -n <namespace>
```

Increase Resources:

If Promtail is resource-starved, scale up resources or increase limits in the deployment:

resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 1
    memory: 1Gi

Route Isolation:
- Temporarily disable or throttle problematic routes, if identifiable.

Long-term Fixes #

Error Budget Monitoring:
- Set up monitoring for error budgets and thresholds to catch issues earlier.
Improve Retries and Backoff:
- Optimize retry and backoff logic for Promtail requests.
Dependency Health:
- Monitor and improve the stability of downstream systems (e.g., Loki, storage backends).
Scaling:
- Ensure Promtail deployments are scaled appropriately for the workload.
Alert Refinement:
- Review alert thresholds and adjust if necessary to reduce false positives.