EtcdHttpRequestsSlow #

HTTP requests slowing down, 99th percentile is over 0.15s

Alert Rule

alert: EtcdHttpRequestsSlow
annotations:
  description: |-
    HTTP requests slowing down, 99th percentile is over 0.15s
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhttprequestsslow/
  summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m]))
  &gt; 0.15
for: 2m
labels:
  severity: warning

Here is a runbook for the EtcdHttpRequestsSlow alert:

Meaning #

The EtcdHttpRequestsSlow alert is triggered when the 99th percentile of etcd HTTP request durations exceeds 0.15 seconds over a 1-minute window. This indicates that etcd is experiencing slower-than-expected HTTP request processing times, which can impact the overall performance and reliability of the etcd cluster.

Impact #

Slower etcd request processing can lead to increased latency and decreased responsiveness in dependent systems.
Prolonged periods of slow request processing can cause etcd to become unavailable or even lead to cluster instability.
This can have a cascading effect on the overall system, causing failures or errors in dependent applications and services.

Diagnosis #

To diagnose the root cause of the slow etcd HTTP requests, follow these steps:

Review etcd logs for any errors or warnings related to request processing.
Check the etcd cluster’s resource utilization (CPU, memory, disk) to ensure it is within acceptable limits.
Verify that the etcd cluster is properly configured and that no network issues are affecting communication between nodes.
Investigate any recent changes to the etcd configuration, network topology, or dependent systems that may be contributing to the slow request processing.
Use tools like etcdctl or curl to verify etcd’s response times and latency.

Mitigation #

To mitigate the impact of slow etcd HTTP requests, follow these steps:

Check for any stuck or slow requests in etcd and cancel them if necessary.
Implement request timeouts to prevent slow requests from blocking other requests.
Consider increasing the etcd cluster’s resources (e.g., adding more nodes) to handle increased load.
Optimize etcd configuration for better performance, such as adjusting the --listen-peer-urls or --listen-client-urls settings.
Implement load balancing or proxying to reduce the load on individual etcd nodes.
Consider upgrading etcd to a newer version that includes performance enhancements.

Remember to investigate and address the root cause of the slow request processing to prevent future occurrences of this alert.