EtcdHighNumberOfFailedGrpcRequests #
More than 1% GRPC request failure detected in Etcd
Alert Rule
alert: EtcdHighNumberOfFailedGrpcRequests
annotations:
description: |-
More than 1% GRPC request failure detected in Etcd
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedgrpcrequests/
summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance
}})
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service,
grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method)
> 0.01
for: 2m
labels:
severity: warning
alert: EtcdHighNumberOfFailedGrpcRequests
annotations:
description: |-
More than 5% GRPC request failure detected in Etcd
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighnumberoffailedgrpcrequests/
summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance
}})
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service,
grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method)
> 0.05
for: 2m
labels:
severity: critical
Here is a runbook for the EtcdHighNumberOfFailedGrpcRequests alert:
Meaning #
The EtcdHighNumberOfFailedGrpcRequests alert is triggered when the rate of failed GRPC requests in Etcd exceeds 1% of the total requests over a 1-minute period. This alert indicates that there is an issue with Etcd’s GRPC requests, which can impact the overall reliability and performance of the system.
Impact #
If this alert is not addressed, it can lead to:
- Increased latency and errors in Etcd operations
- Decreased system reliability and availability
- Potential data inconsistencies and loss
- Increased load on the system, leading to further performance degradation
Diagnosis #
To diagnose the root cause of the issue, follow these steps:
- Check the Etcd server logs for errors or warnings related to GRPC requests
- Verify that the Etcd cluster is properly configured and that all nodes are synchronized
- Check for any network connectivity issues between Etcd nodes
- Verify that the GRPC requests are not being throttled or rate-limited
- Check for any known bugs or issues in Etcd or related components
Mitigation #
To mitigate the issue, follow these steps:
- Check the Etcd documentation for troubleshooting GRPC request failures
- Verify that the Etcd cluster is properly scaled to handle the current load
- Implement measures to reduce the load on the Etcd cluster, such as load balancing or caching
- Consider increasing the resources (CPU, memory, etc.) allocated to the Etcd nodes
- If the issue persists, consider rolling back to a previous version of Etcd or related components
Note: The runbook URL provided in the annotation points to a more comprehensive runbook that may include additional steps and details.