LokiProcessTooManyRestarts #
A loki process had too many restarts (target {{ $labels.instance }})
Alert Rule
alert: LokiProcessTooManyRestarts
annotations:
description: |-
A loki process had too many restarts (target {{ $labels.instance }})
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/loki-internal/lokiprocesstoomanyrestarts/
summary: Loki process too many restarts (instance {{ $labels.instance }})
expr: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2
for: 0m
labels:
severity: warning
Here is the runbook for the Prometheus alert rule “LokiProcessTooManyRestarts”:
Meaning #
The “LokiProcessTooManyRestarts” alert is triggered when a Loki process restarts more than 2 times within a 15-minute window. This indicates that the Loki process is experiencing instability or issues that are causing it to restart frequently.
Impact #
The impact of this alert is that Loki may not be able to collect and store log data correctly, leading to gaps in log data and potential issues with log-based alerting and monitoring. This can also lead to increased latency and decreased performance in the logging pipeline.
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Loki process logs for errors or exceptions that may be causing the restarts.
- Verify that the Loki configuration is correct and that there are no issues with the underlying infrastructure (e.g. disk space, network connectivity).
- Check the system metrics (e.g. CPU, memory, disk usage) to see if there are any resource constraints that may be contributing to the restarts.
- Check the Loki process status and verify that it is running correctly.
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Loki process and verify that it is running correctly.
- Check and update the Loki configuration to ensure that it is correct and up-to-date.
- Investigate and resolve any underlying infrastructure issues (e.g. disk space, network connectivity).
- Consider increasing the resources allocated to the Loki process (e.g. increasing the available memory or CPU).
- Implement additional logging and monitoring to detect and alert on Loki process restarts.
Note: For more detailed steps and troubleshooting guides, please refer to the Loki documentation and the referenced runbook URL.