CortexIngesterUnhealthy #
Cortex has an unhealthy ingester
Alert Rule
alert: CortexIngesterUnhealthy
annotations:
description: |-
Cortex has an unhealthy ingester
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/cortex-internal/cortexingesterunhealthy/
summary: Cortex ingester unhealthy (instance {{ $labels.instance }})
expr: cortex_ring_members{state="Unhealthy", name="ingester"} > 0
for: 0m
labels:
severity: critical
Here is a sample runbook for the CortexIngesterUnhealthy alert:
Meaning #
The CortexIngesterUnhealthy alert is triggered when one or more Cortex ingesters are reported as unhealthy by the Cortex ring. This indicates a critical issue with the Cortex cluster, as unhealthy ingesters can lead to data loss and errors in the system.
Impact #
The impact of an unhealthy ingester can be severe, leading to:
- Data loss or corruption
- Errors in query results
- Increased latency and timeouts
- Reduced system reliability and availability
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Cortex ring membership to identify the unhealthy ingester(s)
- Review the ingester logs to determine the cause of the unhealthy state
- Check the system metrics (e.g. CPU, memory, disk usage) to identify any resource issues
- Verify that the ingester is properly configured and running with the correct version
- Check for any network connectivity issues between the ingester and other Cortex components
Mitigation #
To mitigate the issue, follow these steps:
- Restart the unhealthy ingester(s) to attempt to recover
- If the issue persists, investigate and resolve any underlying causes (e.g. resource issues, configuration errors)
- If necessary, replace the unhealthy ingester with a new instance
- Verify that the Cortex cluster is functioning correctly and data is being ingested properly
- Monitor the system closely to ensure the issue does not recur
Note: This is just a sample runbook, and you should tailor it to your specific environment and requirements.