CortexNotificationAreBeingDropped #

Cortex notification are being dropped due to errors (instance {{ $labels.instance }})

Alert Rule

alert: CortexNotificationAreBeingDropped
annotations:
  description: |-
    Cortex notification are being dropped due to errors (instance {{ $labels.instance }})
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/cortex-internal/cortexnotificationarebeingdropped/
  summary: Cortex notification are being dropped (instance {{ $labels.instance }})
expr: rate(cortex_prometheus_notifications_dropped_total[5m]) &gt; 0
for: 0m
labels:
  severity: critical

Meaning #

The CortexNotificationAreBeingDropped alert is triggered when the rate of dropped Cortex notifications exceeds 0 within a 5-minute window. This indicates that some Cortex notifications are not being processed successfully, potentially leading to missed alerting opportunities or incomplete data.

Impact #

The impact of this alert is critical, as it may result in:

Missed alerting opportunities, leading to delayed or missed responses to critical issues
Incomplete or inaccurate data, affecting the reliability of monitoring and analytics
Increased risk of system downtime or performance degradation due to undetected issues

Diagnosis #

To diagnose the root cause of the issue, follow these steps:

Check the Cortex logs for errors related to notification processing
Investigate the instance specified in the alert label ({{ $labels.instance }}) for any configuration issues or errors
Verify that the notification pipeline is properly configured and functional
Check the network connectivity and infrastructure supporting the notification pipeline
Review the cortex_prometheus_notifications_dropped_total metric to identify any patterns or trends in the dropped notifications

Mitigation #

To mitigate the issue, take the following steps:

Fix any configuration issues or errors in the Cortex instance specified in the alert label
Restart the Cortex service to ensure a clean slate for notification processing
Verify that the notification pipeline is properly configured and functional
Implement additional logging or monitoring to detect and alert on notification processing errors
Consider increasing the notification queue capacity or retry mechanisms to reduce the likelihood of dropped notifications
Review and optimize the notification pipeline to minimize processing errors and improve overall reliability