EtcdHighNumberOfFailedHttpRequests #
More than 1% HTTP failure detected in Etcd
Alert Rule
alert: EtcdHighNumberOfFailedHttpRequests
annotations:
description: |-
More than 1% HTTP failure detected in Etcd
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedhttprequests/
summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance
}})
expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m]))
BY (method) > 0.01
for: 2m
labels:
severity: warning
alert: EtcdHighNumberOfFailedHttpRequests
annotations:
description: |-
More than 5% HTTP failure detected in Etcd
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedhttprequests/
summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance
}})
expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m]))
BY (method) > 0.05
for: 2m
labels:
severity: critical
Here is a runbook for the Prometheus alert rule EtcdHighNumberOfFailedHttpRequests
:
Meaning #
The EtcdHighNumberOfFailedHttpRequests
alert is triggered when the rate of failed HTTP requests to Etcd exceeds 1% of the total HTTP requests received within a 1-minute window. This indicates a potential issue with Etcd’s ability to handle requests successfully.
Impact #
A high number of failed HTTP requests to Etcd can have several consequences:
- Increased latency and errors for applications relying on Etcd for data storage and retrieval
- Potential data inconsistencies or losses due to failed writes or reads
- Decreased overall system reliability and availability
Diagnosis #
To diagnose the root cause of the issue, follow these steps:
- Check the Etcd server logs for errors or warnings related to HTTP requests
- Verify the network connectivity and configuration between the Etcd server and the clients making HTTP requests
- Check the Etcd server’s resource utilization (CPU, memory, disk space) to ensure it is not overloaded
- Review the Etcd cluster’s health and ensures that all members are in a healthy state
- Check for any recent changes or deployments that may have introduced a regression
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Etcd server to refresh its connection to the clients and clear any temporary issues
- Investigate and resolve any underlying issues with network connectivity or configuration
- Implement retries and exponential backoff strategies in clients to handle temporary failures
- Consider increasing the resources (e.g., CPU, memory) allocated to the Etcd server to handle increased load
- Verify that the Etcd cluster is properly configured and that all members are in a healthy state
Remember to also address the root cause of the issue to prevent it from happening again in the future.