RabbitmqDown #
RabbitMQ node down
Alert Rule
alert: RabbitmqDown
annotations:
description: |-
RabbitMQ node down
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/kbudde-rabbitmq-exporter/rabbitmqdown/
summary: RabbitMQ down (instance {{ $labels.instance }})
expr: rabbitmq_up == 0
for: 0m
labels:
severity: critical
Meaning #
The RabbitMQDown alert is triggered when the rabbitmq_up
metric has a value of 0, indicating that the RabbitMQ node is down. This alert is critical and requires immediate attention.
Impact #
The impact of this alert is severe, as it affects the overall availability and reliability of the RabbitMQ cluster. When a RabbitMQ node is down, it can lead to:
- Loss of message queues and potential data loss
- Unavailability of critical business services that rely on RabbitMQ
- Increased latency and errors in dependent applications
- Potential security breaches due to unprocessed messages
Diagnosis #
To diagnose the issue, follow these steps:
- Check the RabbitMQ node’s logs for errors or Warning signs of failure
- Verify that the RabbitMQ node is reachable and responding to requests
- Check the RabbitMQ cluster’s overall health and status
- Investigate any recent changes or deployments that may have caused the issue
- Check for any signs of resource exhaustion (e.g., CPU, memory, disk space)
Mitigation #
To mitigate the issue, follow these steps:
- Immediate Action: Restart the RabbitMQ node to attempt to restore service
- Short-term Solution: Failover to a standby node or activate a backup cluster (if available)
- Root Cause Analysis: Perform a thorough analysis to identify the root cause of the failure
- Corrective Action: Implement fixes or patches to prevent similar failures in the future
- Verify Recovery: Monitor the RabbitMQ node and cluster to ensure that it has fully recovered and is functioning normally
Remember to follow the runbook URL (https://github.com/srerun/prometheus-alerts/blob/main/content/runbooks/kbudde-rabbitmq-exporter/RabbitmqDown.md) for more detailed steps and procedures specific to your environment.