ThanosQueryGrpcClientErrorRate #
Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.
Alert Rule
alert: ThanosQueryGrpcClientErrorRate
annotations:
description: |-
Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-query/thanosquerygrpcclienterrorrate/
summary: Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }})
expr: (sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m]))
/ sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))) *
100 > 5
for: 5m
labels:
severity: warning
Meaning #
The ThanosQueryGrpcClientErrorRate alert rule is triggered when the error rate of the Thanos Query gRPC client exceeds 5% over a 5-minute period. This indicates that the Thanos Query instance is experiencing issues when sending requests to the gRPC server.
Impact #
- Thanos Query instance is not functioning correctly, leading to incomplete or inaccurate query results.
- Potential impact on downstream systems that rely on Thanos Query, such as Grafana dashboards or alerting systems.
- Downtime or errors in the monitoring and alerting system.
Diagnosis #
- Check the Thanos Query instance logs for errors related to gRPC client connections.
- Verify the gRPC server is operational and responding to requests.
- Investigate recent changes to the Thanos Query configuration or environment that may be causing the issue.
- Check the network connectivity between the Thanos Query instance and the gRPC server.
Mitigation #
- Restart the Thanos Query instance to reset the gRPC client connections.
- Verify the gRPC server is configured correctly and reachable by the Thanos Query instance.
- Check for and resolve any network connectivity issues between the Thanos Query instance and the gRPC server.
- Implement additional logging and monitoring to detect future issues with the gRPC client connections.
- Consider increasing the gRPC client timeout or retries to improve resilience.
Note: For more detailed steps and troubleshooting guidance, refer to the runbook link provided in the alert annotation: https://github.com/srerun/prometheus-alerts/blob/main/content/runbooks/thanos-query/ThanosQueryGrpcClientErrorRate.md