EtcdHighNumberOfFailedGrpcRequests #

More than 1% GRPC request failure detected in Etcd

Alert Rule

alert: EtcdHighNumberOfFailedGrpcRequests
annotations:
  description: |-
    More than 1% GRPC request failure detected in Etcd
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedgrpcrequests/
  summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance
    }})
expr: sum(rate(grpc_server_handled_total{grpc_code!=&#34;OK&#34;}[1m])) BY (grpc_service,
  grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method)
  &gt; 0.01
for: 2m
labels:
  severity: warning

alert: EtcdHighNumberOfFailedGrpcRequests
annotations:
  description: |-
    More than 5% GRPC request failure detected in Etcd
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedgrpcrequests/
  summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance
    }})
expr: sum(rate(grpc_server_handled_total{grpc_code!=&#34;OK&#34;}[1m])) BY (grpc_service,
  grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method)
  &gt; 0.05
for: 2m
labels:
  severity: critical

Here is a runbook for the EtcdHighNumberOfFailedGrpcRequests alert:

Meaning #

The EtcdHighNumberOfFailedGrpcRequests alert is triggered when the rate of failed GRPC requests in Etcd exceeds 1% of the total requests over a 1-minute period. This alert indicates that there is an issue with Etcd’s GRPC requests, which can impact the overall reliability and performance of the system.

Impact #

If this alert is not addressed, it can lead to:

Increased latency and errors in Etcd operations
Decreased system reliability and availability
Potential data inconsistencies and loss
Increased load on the system, leading to further performance degradation

Diagnosis #

To diagnose the root cause of the issue, follow these steps:

Check the Etcd server logs for errors or warnings related to GRPC requests
Verify that the Etcd cluster is properly configured and that all nodes are synchronized
Check for any network connectivity issues between Etcd nodes
Verify that the GRPC requests are not being throttled or rate-limited
Check for any known bugs or issues in Etcd or related components

Mitigation #

To mitigate the issue, follow these steps:

Check the Etcd documentation for troubleshooting GRPC request failures
Verify that the Etcd cluster is properly scaled to handle the current load
Implement measures to reduce the load on the Etcd cluster, such as load balancing or caching
Consider increasing the resources (CPU, memory, etc.) allocated to the Etcd nodes
If the issue persists, consider rolling back to a previous version of Etcd or related components

Note: The runbook URL provided in the annotation points to a more comprehensive runbook that may include additional steps and details.