RedisReplicationBroken #

Redis instance lost a slave

Alert Rule

alert: RedisReplicationBroken
annotations:
  description: |-
    Redis instance lost a slave
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/oliver006-redis-exporter/redisreplicationbroken/
  summary: Redis replication broken (instance {{ $labels.instance }})
expr: delta(redis_connected_slaves[1m]) &lt; 0
for: 0m
labels:
  severity: critical

Meaning #

The RedisReplicationBroken alert is triggered when the number of connected Redis slaves decreases over a 1-minute period. This indicates that one or more Redis slaves have disconnected from the master instance, which can lead to data inconsistencies and loss.

Impact #

The impact of this alert is critical, as it can result in:

Data loss or inconsistencies between Redis instances
Increased latency or errors in applications relying on Redis
Potential for data inconsistencies to spread to other nodes in the cluster
Downtime or reduced performance of dependent services

Diagnosis #

To diagnose the issue, follow these steps:

Check the Redis instance’s log files for errors or warnings related to replication or connections.
Verify the network connectivity between the Redis master and slave instances.
Check the Redis configuration to ensure that replication is properly configured.
Review the redis_connected_slaves metric to identify the specific slave instance that disconnected.
Investigate any recent changes or updates to the Redis cluster or underlying infrastructure.

Mitigation #

To mitigate the issue, follow these steps:

Investigate and resolve the underlying cause of the slave disconnection (e.g., network issue, configuration error, etc.).
Re-establish connectivity between the Redis master and slave instances.
Verify that replication is working correctly and data is being synced between instances.
Consider increasing the number of Redis slaves to improve redundancy and fault tolerance.
Review and adjust the Redis cluster’s configuration to prevent similar issues in the future.