EtcdInsufficientMembers #

Etcd cluster should have an odd number of members

Alert Rule

alert: EtcdInsufficientMembers
annotations:
  description: |-
    Etcd cluster should have an odd number of members
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdinsufficientmembers/
  summary: Etcd insufficient Members (instance {{ $labels.instance }})
expr: count(etcd_server_id) % 2 == 0
for: 0m
labels:
  severity: critical

Here is a runbook for the EtcdInsufficientMembers alert:

Meaning #

The EtcdInsufficientMembers alert is triggered when the number of etcd members in the cluster is even. Etcd, a distributed key-value store, requires an odd number of members to maintain a quorum and ensure the cluster’s availability. With an even number of members, the cluster is at risk of becoming unavailable or inconsistent in case of a node failure.

Impact #

If the EtcdInsufficientMembers alert is not addressed, it can lead to:

Cluster unavailability: With an even number of members, etcd may not be able to achieve a quorum, leading to cluster downtime and data inconsistency.
Data loss: In the event of a node failure, the cluster may not be able to recover, resulting in data loss or corruption.

Diagnosis #

To diagnose the issue, follow these steps:

Check the etcd cluster membership: Run the command etcdctl member list to verify the current number of members in the cluster.
Identify the missing member: Check the etcd server logs to determine which member is missing or not participating in the cluster.
Verify etcd configuration: Review the etcd configuration files to ensure that the member count is correctly set and matches the expected number of members.

Mitigation #

To mitigate the issue, follow these steps:

Add or replace the missing member: Bring up a new etcd member or replace the failed node to restore the odd number of members in the cluster.
Verify etcd cluster health: Run the command etcdctl cluster to verify that the cluster is healthy and all members are participating.
Monitor etcd metrics: Keep a close eye on etcd metrics, such as etcd_server_id, to ensure that the cluster remains healthy and available.

Remember to update the etcd configuration files and deployment scripts to prevent similar issues in the future.