RabbitmqClusterPartition #

Cluster partition

Alert Rule

alert: RabbitmqClusterPartition
annotations:
  description: |-
    Cluster partition
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/kbudde-rabbitmq-exporter/rabbitmqclusterpartition/
  summary: RabbitMQ cluster partition (instance {{ $labels.instance }})
expr: rabbitmq_partitions &gt; 0
for: 0m
labels:
  severity: critical

Here is a runbook for the RabbitMQ Cluster Partition alert:

Meaning #

The RabbitMQ Cluster Partition alert is triggered when the rabbitmq_partitions metric exceeds 0, indicating a cluster partition in the RabbitMQ system. This means that one or more nodes in the RabbitMQ cluster are no longer able to communicate with each other, resulting in a split in the cluster.

Impact #

A cluster partition in RabbitMQ can have significant consequences, including:

Message loss or duplication
Incorrect message ordering
Decreased system availability
Increased latency

The impact of a cluster partition can be severe, as it can cause RabbitsMQ to become unavailable or behave erratically. It is essential to address this issue promptly to prevent further damage to the system.

Diagnosis #

To diagnose the cause of the cluster partition, follow these steps:

Check the RabbitMQ logs for error messages related to node connections or timeouts.
Verify the status of each node in the cluster using the RabbitMQ management UI or the rabbitmqctl command-line tool.
Check for network connectivity issues between nodes in the cluster.
Review the RabbitMQ configuration to ensure that it is correct and consistent across all nodes.

Mitigation #

To mitigate the effects of a cluster partition, follow these steps:

Identify the affected nodes and isolate them from the rest of the cluster.
Restart the RabbitMQ service on the affected nodes to attempt to reconnect them to the cluster.
If necessary, manually reconnect nodes to the cluster using the rabbitmqctl command-line tool.
Verify that the cluster is stable and messages are being processed correctly once the nodes have been reconnected.
Perform a thorough investigation to determine the root cause of the cluster partition and take steps to prevent it from happening again in the future.