EtcdHighNumberOfFailedHttpRequests #

More than 1% HTTP failure detected in Etcd

Alert Rule

alert: EtcdHighNumberOfFailedHttpRequests
annotations:
  description: |-
    More than 1% HTTP failure detected in Etcd
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedhttprequests/
  summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance
    }})
expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m]))
  BY (method) &gt; 0.01
for: 2m
labels:
  severity: warning

alert: EtcdHighNumberOfFailedHttpRequests
annotations:
  description: |-
    More than 5% HTTP failure detected in Etcd
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedhttprequests/
  summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance
    }})
expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m]))
  BY (method) &gt; 0.05
for: 2m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule EtcdHighNumberOfFailedHttpRequests:

Meaning #

The EtcdHighNumberOfFailedHttpRequests alert is triggered when the rate of failed HTTP requests to Etcd exceeds 1% of the total HTTP requests received within a 1-minute window. This indicates a potential issue with Etcd’s ability to handle requests successfully.

Impact #

A high number of failed HTTP requests to Etcd can have several consequences:

Increased latency and errors for applications relying on Etcd for data storage and retrieval
Potential data inconsistencies or losses due to failed writes or reads
Decreased overall system reliability and availability

Diagnosis #

To diagnose the root cause of the issue, follow these steps:

Check the Etcd server logs for errors or warnings related to HTTP requests
Verify the network connectivity and configuration between the Etcd server and the clients making HTTP requests
Check the Etcd server’s resource utilization (CPU, memory, disk space) to ensure it is not overloaded
Review the Etcd cluster’s health and ensures that all members are in a healthy state
Check for any recent changes or deployments that may have introduced a regression

Mitigation #

To mitigate the issue, follow these steps:

Restart the Etcd server to refresh its connection to the clients and clear any temporary issues
Investigate and resolve any underlying issues with network connectivity or configuration
Implement retries and exponential backoff strategies in clients to handle temporary failures
Consider increasing the resources (e.g., CPU, memory) allocated to the Etcd server to handle increased load
Verify that the Etcd cluster is properly configured and that all members are in a healthy state

Remember to also address the root cause of the issue to prevent it from happening again in the future.