ZookeeperMissingLeader #
Zookeeper cluster has no node marked as leader
Alert Rule
alert: ZookeeperMissingLeader
annotations:
description: |-
Zookeeper cluster has no node marked as leader
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/dabealu-zookeeper-exporter/zookeepermissingleader/
summary: Zookeeper missing leader (instance {{ $labels.instance }})
expr: sum(zk_server_leader) == 0
for: 0m
labels:
severity: critical
Meaning #
The ZookeeperMissingLeader alert is triggered when Prometheus detects that there is no node marked as the leader in the Zookeeper cluster. This is determined by the zk_server_leader
metric, which should have a non-zero value indicating the presence of a leader node. If the sum of this metric is zero, it means that no node is currently acting as the leader, which can have severe consequences for the cluster’s availability and integrity.
Impact #
The absence of a Zookeeper leader node can lead to:
- Cluster instability and potential data loss
- Inability to perform writes or updates to the Zookeeper database
- Failure of dependent systems and applications that rely on Zookeeper for coordination and configuration management
- Increased latency and errors in distributed systems that use Zookeeper
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Zookeeper cluster status using the Zookeeper CLI or a monitoring tool like Apache Zookeeper UI.
- Verify that all Zookeeper nodes are running and reachable.
- Check the Zookeeper logs for any error messages related to leader election or node connectivity.
- Use Prometheus and Grafana to visualize the
zk_server_leader
metric and identify any trends or patterns that may indicate the cause of the issue.
Mitigation #
To mitigate the issue, follow these steps:
- Identify the cause of the leader node failure and resolve it (e.g., restart the node, fix network connectivity issues, etc.).
- If the leader node is down, promote another node to become the leader using the Zookeeper CLI or API.
- Verify that the
zk_server_leader
metric returns a non-zero value, indicating the presence of a leader node. - Monitor the Zookeeper cluster for any further issues and take corrective action if necessary.
- Consider implementing measures to prevent future leader node failures, such as deploying a highly available Zookeeper cluster or implementing automated leader election and failover.