EtcdHttpRequestsSlow #
HTTP requests slowing down, 99th percentile is over 0.15s
Alert Rule
alert: EtcdHttpRequestsSlow
annotations:
description: |-
HTTP requests slowing down, 99th percentile is over 0.15s
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhttprequestsslow/
summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m]))
> 0.15
for: 2m
labels:
severity: warning
Here is a runbook for the EtcdHttpRequestsSlow alert:
Meaning #
The EtcdHttpRequestsSlow alert is triggered when the 99th percentile of etcd HTTP request durations exceeds 0.15 seconds over a 1-minute window. This indicates that etcd is experiencing slower-than-expected HTTP request processing times, which can impact the overall performance and reliability of the etcd cluster.
Impact #
- Slower etcd request processing can lead to increased latency and decreased responsiveness in dependent systems.
- Prolonged periods of slow request processing can cause etcd to become unavailable or even lead to cluster instability.
- This can have a cascading effect on the overall system, causing failures or errors in dependent applications and services.
Diagnosis #
To diagnose the root cause of the slow etcd HTTP requests, follow these steps:
- Review etcd logs for any errors or warnings related to request processing.
- Check the etcd cluster’s resource utilization (CPU, memory, disk) to ensure it is within acceptable limits.
- Verify that the etcd cluster is properly configured and that no network issues are affecting communication between nodes.
- Investigate any recent changes to the etcd configuration, network topology, or dependent systems that may be contributing to the slow request processing.
- Use tools like
etcdctl
orcurl
to verify etcd’s response times and latency.
Mitigation #
To mitigate the impact of slow etcd HTTP requests, follow these steps:
- Check for any stuck or slow requests in etcd and cancel them if necessary.
- Implement request timeouts to prevent slow requests from blocking other requests.
- Consider increasing the etcd cluster’s resources (e.g., adding more nodes) to handle increased load.
- Optimize etcd configuration for better performance, such as adjusting the
--listen-peer-urls
or--listen-client-urls
settings. - Implement load balancing or proxying to reduce the load on individual etcd nodes.
- Consider upgrading etcd to a newer version that includes performance enhancements.
Remember to investigate and address the root cause of the slow request processing to prevent future occurrences of this alert.