PrometheusTooManyRestarts #
Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.
Alert Rule
alert: PrometheusTooManyRestarts
annotations:
description: |-
Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/prometheus-self-monitoring-internal/prometheustoomanyrestarts/
summary: Prometheus too many restarts (instance {{ $labels.instance }})
expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m])
> 2
for: 0m
labels:
severity: warning
Meaning #
This alert triggers when a Prometheus component (Prometheus server, Pushgateway, or Alertmanager) restarts more than twice within a 15-minute window. Frequent restarts can indicate underlying issues such as configuration errors, resource constraints, or software bugs.
Impact #
Frequent restarts of Prometheus components can lead to:
- Gaps in metrics collection, causing incomplete data.
- Delayed or missed alerts, impacting monitoring reliability.
- Potential instability in the monitoring infrastructure.
Diagnosis #
Confirm the Alert:
- Check the alert details to identify the affected instance (
{{ $labels.instance }}
) and the specific component (job
). - Note the value of
process_start_time_seconds
changes to confirm frequent restarts.
- Check the alert details to identify the affected instance (
Review Logs:
- Access the logs of the affected component (Prometheus server, Pushgateway, or Alertmanager).
- Look for recurring errors or exceptions that might indicate the cause of the restarts.
Check Resource Utilization:
- Monitor CPU, memory, and disk usage on the affected instance.
- Look for spikes or exhaustion of resources.
Review Configuration:
- Check for recent changes to the component’s configuration files.
- Validate configurations for syntax errors or invalid settings.
Inspect Dependencies:
- Verify network connectivity and DNS resolution for any external dependencies.
- Ensure that required storage backends or remote write endpoints are operational.
Mitigation #
Resolve Immediate Issues:
- If resource constraints are identified, scale up the resources (e.g., increase CPU, memory, or disk).
- If configuration errors are found, correct and validate the configurations before restarting the component.
Apply Fixes:
- Address any software bugs by upgrading to a stable and supported version of the component.
- Fix network or dependency issues if applicable.
Restart the Component:
- After addressing the root cause, restart the affected component manually to stabilize its state.
Monitor Post-Mitigation:
- Ensure the component stabilizes and the alert clears.
- Monitor for recurring issues to validate the effectiveness of the fix.
Document Findings:
- Record the root cause, mitigation steps, and any follow-up actions in your incident management system.