EtcdHighNumberOfFailedProposals #
Etcd server got more than 5 failed proposals past hour
Alert Rule
alert: EtcdHighNumberOfFailedProposals
annotations:
description: |-
Etcd server got more than 5 failed proposals past hour
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedproposals/
summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
expr: increase(etcd_server_proposals_failed_total[1h]) > 5
for: 2m
labels:
severity: warning
Meaning #
The EtcdHighNumberOfFailedProposals alert is triggered when the etcd server experiences a high number of failed proposals within a 1-hour time window. This alert is raised when the increase in failed proposals exceeds 5 within the past hour. This indicates a potential issue with etcd’s ability to successfully propose and commit changes to the cluster.
Impact #
A high number of failed proposals can have a significant impact on the overall health and stability of the etcd cluster. It can lead to:
- Increased latency and slowed performance
- Inconsistent data across the cluster
- Potential data loss or corruption
- Decreased availability and reliability of the etcd service
If left unaddressed, this issue can have a cascading effect on dependent systems and services, leading to broader system instability and downtime.
Diagnosis #
To diagnose the root cause of the EtcdHighNumberOfFailedProposals alert, follow these steps:
- Check etcd server logs for errors and warnings related to proposal failures.
- Investigate the etcd cluster’s current state, including the number of nodes, their health, and any ongoing maintenance or upgrades.
- Verify that the etcd servers have sufficient resources (e.g., CPU, memory, and disk space) to handle the current workload.
- Check for network connectivity issues between etcd nodes and other components in the system.
- Review etcd configuration settings, such as the proposal timeout and election timeout, to ensure they are set correctly.
Mitigation #
To mitigate the EtcdHighNumberOfFailedProposals alert, take the following steps:
- Investigate and address the root cause of the failed proposals, based on the diagnosis steps above.
- Restart the etcd server to reset the proposal counter and allow the system to recover.
- Consider increasing the proposal timeout or adjusting other etcd configuration settings to improve proposal success rates.
- Verify that etcd servers have sufficient resources to handle the current workload, and consider scaling up or optimizing resource allocation as needed.
- Implement additional monitoring and alerting to detect and respond to etcd issues more quickly in the future.