CortexNotificationAreBeingDropped #
Cortex notification are being dropped due to errors (instance {{ $labels.instance }})
Alert Rule
alert: CortexNotificationAreBeingDropped
annotations:
description: |-
Cortex notification are being dropped due to errors (instance {{ $labels.instance }})
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/cortex-internal/cortexnotificationarebeingdropped/
summary: Cortex notification are being dropped (instance {{ $labels.instance }})
expr: rate(cortex_prometheus_notifications_dropped_total[5m]) > 0
for: 0m
labels:
severity: critical
Meaning #
The CortexNotificationAreBeingDropped
alert is triggered when the rate of dropped Cortex notifications exceeds 0 within a 5-minute window. This indicates that some Cortex notifications are not being processed successfully, potentially leading to missed alerting opportunities or incomplete data.
Impact #
The impact of this alert is critical, as it may result in:
- Missed alerting opportunities, leading to delayed or missed responses to critical issues
- Incomplete or inaccurate data, affecting the reliability of monitoring and analytics
- Increased risk of system downtime or performance degradation due to undetected issues
Diagnosis #
To diagnose the root cause of the issue, follow these steps:
- Check the Cortex logs for errors related to notification processing
- Investigate the instance specified in the alert label (
{{ $labels.instance }}
) for any configuration issues or errors - Verify that the notification pipeline is properly configured and functional
- Check the network connectivity and infrastructure supporting the notification pipeline
- Review the
cortex_prometheus_notifications_dropped_total
metric to identify any patterns or trends in the dropped notifications
Mitigation #
To mitigate the issue, take the following steps:
- Fix any configuration issues or errors in the Cortex instance specified in the alert label
- Restart the Cortex service to ensure a clean slate for notification processing
- Verify that the notification pipeline is properly configured and functional
- Implement additional logging or monitoring to detect and alert on notification processing errors
- Consider increasing the notification queue capacity or retry mechanisms to reduce the likelihood of dropped notifications
- Review and optimize the notification pipeline to minimize processing errors and improve overall reliability