EtcdNoLeader #

Etcd cluster have no leader

Alert Rule

alert: EtcdNoLeader
annotations:
  description: |-
    Etcd cluster have no leader
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdnoleader/
  summary: Etcd no Leader (instance {{ $labels.instance }})
expr: etcd_server_has_leader == 0
for: 0m
labels:
  severity: critical

Here is the runbook for the EtcdNoLeader alert:

Meaning #

The EtcdNoLeader alert is triggered when the etcd cluster does not have a leader node. etcd is a distributed key-value store that is used to manage and store data in a distributed system. The leader node is responsible for managing the cluster and ensuring that data is replicated correctly. If the cluster does not have a leader, it can lead to data inconsistencies and errors.

Impact #

The impact of this alert is critical, as it can cause the following issues:

Data inconsistencies: Without a leader, the etcd cluster may not be able to replicate data correctly, leading to inconsistencies across the nodes.
Cluster instability: The lack of a leader can cause the cluster to become unstable, leading to errors and failures.
Service disruptions: Depending on the services that rely on etcd, this alert can cause service disruptions and errors.

Diagnosis #

To diagnose the issue, follow these steps:

Check the etcd cluster status using the etcdctl command-line tool.
Verify that all etcd nodes are running and healthy.
Check the etcd logs for any error messages related to leader election or cluster instability.
Verify that the network connectivity between the etcd nodes is working correctly.

Mitigation #

To mitigate the issue, follow these steps:

Check the etcd configuration and ensure that it is correctly configured for leader election.
Verify that all etcd nodes are running with the correct configuration and that there are no network connectivity issues.
If the issue persists, try to manually elect a new leader using the etcdctl tool.
If the issue still persists, consider restarting the etcd nodes or seeking assistance from a qualified administrator.

Remember to refer to the etcd documentation and the official runbook for more detailed information on troubleshooting and mitigating this issue.