RabbitmqClusterPartition #
Cluster partition
Alert Rule
alert: RabbitmqClusterPartition
annotations:
description: |-
Cluster partition
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/kbudde-rabbitmq-exporter/rabbitmqclusterpartition/
summary: RabbitMQ cluster partition (instance {{ $labels.instance }})
expr: rabbitmq_partitions > 0
for: 0m
labels:
severity: critical
Here is a runbook for the RabbitMQ Cluster Partition alert:
Meaning #
The RabbitMQ Cluster Partition alert is triggered when the rabbitmq_partitions
metric exceeds 0, indicating a cluster partition in the RabbitMQ system. This means that one or more nodes in the RabbitMQ cluster are no longer able to communicate with each other, resulting in a split in the cluster.
Impact #
A cluster partition in RabbitMQ can have significant consequences, including:
- Message loss or duplication
- Incorrect message ordering
- Decreased system availability
- Increased latency
The impact of a cluster partition can be severe, as it can cause RabbitsMQ to become unavailable or behave erratically. It is essential to address this issue promptly to prevent further damage to the system.
Diagnosis #
To diagnose the cause of the cluster partition, follow these steps:
- Check the RabbitMQ logs for error messages related to node connections or timeouts.
- Verify the status of each node in the cluster using the RabbitMQ management UI or the
rabbitmqctl
command-line tool. - Check for network connectivity issues between nodes in the cluster.
- Review the RabbitMQ configuration to ensure that it is correct and consistent across all nodes.
Mitigation #
To mitigate the effects of a cluster partition, follow these steps:
- Identify the affected nodes and isolate them from the rest of the cluster.
- Restart the RabbitMQ service on the affected nodes to attempt to reconnect them to the cluster.
- If necessary, manually reconnect nodes to the cluster using the
rabbitmqctl
command-line tool. - Verify that the cluster is stable and messages are being processed correctly once the nodes have been reconnected.
- Perform a thorough investigation to determine the root cause of the cluster partition and take steps to prevent it from happening again in the future.