RedisReplicationBroken #
Redis instance lost a slave
Alert Rule
alert: RedisReplicationBroken
annotations:
description: |-
Redis instance lost a slave
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/oliver006-redis-exporter/redisreplicationbroken/
summary: Redis replication broken (instance {{ $labels.instance }})
expr: delta(redis_connected_slaves[1m]) < 0
for: 0m
labels:
severity: critical
Meaning #
The RedisReplicationBroken
alert is triggered when the number of connected Redis slaves decreases over a 1-minute period. This indicates that one or more Redis slaves have disconnected from the master instance, which can lead to data inconsistencies and loss.
Impact #
The impact of this alert is critical, as it can result in:
- Data loss or inconsistencies between Redis instances
- Increased latency or errors in applications relying on Redis
- Potential for data inconsistencies to spread to other nodes in the cluster
- Downtime or reduced performance of dependent services
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Redis instance’s log files for errors or warnings related to replication or connections.
- Verify the network connectivity between the Redis master and slave instances.
- Check the Redis configuration to ensure that replication is properly configured.
- Review the
redis_connected_slaves
metric to identify the specific slave instance that disconnected. - Investigate any recent changes or updates to the Redis cluster or underlying infrastructure.
Mitigation #
To mitigate the issue, follow these steps:
- Investigate and resolve the underlying cause of the slave disconnection (e.g., network issue, configuration error, etc.).
- Re-establish connectivity between the Redis master and slave instances.
- Verify that replication is working correctly and data is being synced between instances.
- Consider increasing the number of Redis slaves to improve redundancy and fault tolerance.
- Review and adjust the Redis cluster’s configuration to prevent similar issues in the future.