RedisClusterFlapping

RedisClusterFlapping #

Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).

Alert Rule
alert: RedisClusterFlapping
annotations:
  description: |-
    Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/oliver006-redis-exporter/redisclusterflapping/
  summary: Redis cluster flapping (instance {{ $labels.instance }})
expr: changes(redis_connected_slaves[1m]) > 1
for: 2m
labels:
  severity: critical

Here is a runbook for the RedisClusterFlapping alert:

Meaning #

The RedisClusterFlapping alert is triggered when there are frequent changes in the number of connected Redis replicas within a short period of time (1 minute). This indicates that the Replica nodes are constantly losing and re-establishing connection to the Master node, causing instability in the Redis cluster.

Impact #

The flapping of Redis replicas can lead to:

  • Data inconsistencies: As replicas disconnect and reconnect, they may not always be in sync with the Master node, leading to data inconsistencies.
  • Performance issues: The constant reconnecting and syncing of replicas can cause performance issues, such as increased latency and decreased throughput.
  • Increased load on Master node: The Master node may experience increased load as it tries to keep the replicas in sync, leading to performance degradation.

Diagnosis #

To diagnose the issue, follow these steps:

  • Check the Redis cluster logs for errors or warnings related to replica connections.
  • Verify that the network connectivity between the Master node and Replica nodes is stable.
  • Check the system resources (CPU, memory, disk space) of the Master node and Replica nodes to ensure they are not overloaded.
  • Review the Redis configuration to ensure that it is correctly set up and that the replica nodes are properly configured.

Mitigation #

To mitigate the issue, follow these steps:

  • Investigate and resolve any underlying network issues that may be causing the replicas to lose connection to the Master node.
  • Check and adjust the Redis configuration to ensure that the replica nodes are properly configured and that the Master node is not overloaded.
  • Consider increasing the redis-connected-slaves metric threshold to reduce the sensitivity of the alert.
  • Implement measures to reduce the load on the Master node, such as load balancing or sharding.
  • Consider upgrading the Redis version or using a more stable Redis cluster setup.