PromtailRequestErrors #
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf “%.2f” $value }}% errors.
Alert Rule
alert: PromtailRequestErrors
annotations:
description: |-
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/promtail-internal/promtailrequesterrors/
summary: Promtail request errors (instance {{ $labels.instance }})
expr: 100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m]))
by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m]))
by (namespace, job, route, instance) > 10
for: 5m
labels:
severity: critical
Meaning #
This alert triggers when more than 10% of Promtail requests result in 5xx errors or failed responses over a 1-minute window, sustained for at least 5 minutes. It indicates that a significant portion of requests are failing, potentially impacting log ingestion and monitoring pipelines.
Impact #
- Log Ingestion Failure: Logs may not be ingested properly, leading to gaps in observability.
- Service Degradation: Downstream services relying on logs for monitoring, debugging, or auditing could be affected.
- Increased Latency: The issue may lead to bottlenecks or retries, further stressing the system.
Diagnosis #
Key Metrics to Investigate #
- Error Rate:
- Inspect the percentage of errors in Promtail requests:
sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (namespace, job, route, instance)
- Compare it with the total request count:
sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance)
- Inspect the percentage of errors in Promtail requests:
- Affected Instances:
- Identify the specific instance(s), namespace(s), job(s), or route(s) showing high error rates from alert labels.
Logs and Debugging Steps #
- Promtail Logs:
- Look for errors or warnings in Promtail logs related to the affected instance:
kubectl logs -n <namespace> <promtail-pod> | grep -i error
- Look for errors or warnings in Promtail logs related to the affected instance:
- Status Codes:
- Analyze the nature of 5xx errors:
- Are they consistent or intermittent?
- Do they originate from a specific route or endpoint?
- Analyze the nature of 5xx errors:
- Downstream Dependencies:
- Check if external services or storage backends (e.g., Loki) are causing these failures.
- Network Issues:
- Verify network connectivity between Promtail and dependent services.
- Resource Constraints:
- Check if the Promtail pods are resource-starved:
kubectl top pods -n <namespace>
- Check if the Promtail pods are resource-starved:
Mitigation #
Immediate Actions #
- Restart Promtail Pods:
- Restart the affected Promtail pods to resolve transient issues:
kubectl rollout restart deployment <promtail-deployment> -n <namespace>
- Restart the affected Promtail pods to resolve transient issues:
- Increase Resources:
- If Promtail is resource-starved, scale up resources or increase limits in the deployment:
resources: requests: cpu: 500m memory: 512Mi limits: cpu: 1 memory: 1Gi
- If Promtail is resource-starved, scale up resources or increase limits in the deployment:
- Route Isolation:
- Temporarily disable or throttle problematic routes, if identifiable.
Long-term Fixes #
- Error Budget Monitoring:
- Set up monitoring for error budgets and thresholds to catch issues earlier.
- Improve Retries and Backoff:
- Optimize retry and backoff logic for Promtail requests.
- Dependency Health:
- Monitor and improve the stability of downstream systems (e.g., Loki, storage backends).
- Scaling:
- Ensure Promtail deployments are scaled appropriately for the workload.
- Alert Refinement:
- Review alert thresholds and adjust if necessary to reduce false positives.