ThanosRuleGrpcErrorRate #

Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.

Alert Rule

alert: ThanosRuleGrpcErrorRate
annotations:
  description: |-
    Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-ruler/thanosrulegrpcerrorrate/
  summary: Thanos Rule Grpc Error Rate (instance {{ $labels.instance }})
expr: (sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~&#34;Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded&#34;,
  job=~&#34;.*thanos-rule.*&#34;}[5m]))/  sum by (job, instance) (rate(grpc_server_started_total{job=~&#34;.*thanos-rule.*&#34;}[5m]))
  * 100 &gt; 5)
for: 5m
labels:
  severity: warning

Here is the runbook for the ThanosRuleGrpcErrorRate alert:

Meaning #

The ThanosRuleGrpcErrorRate alert is triggered when the rate of gRPC errors for a Thanos Rule instance exceeds 5% of the total requests handled by the instance. This alert indicates that the Thanos Rule instance is experiencing issues with handling requests, which can lead to data loss, rule evaluation failures, or other problems.

Impact #

The impact of this alert can be significant, as it may lead to:

Data loss or inconsistencies in the Thanos database
Rule evaluation failures, causing alerts and notifications to fail
Increased latency or timeouts in the system
Increased load on the system, leading to performance degradation

Diagnosis #

To diagnose the issue, follow these steps:

Check the Thanos Rule instance logs for errors related to gRPC handling
Verify that the instance is not experiencing high CPU or memory usage
Check the network connectivity and configuration for issues
Review the rule evaluation metrics to identify any patterns or trends
Check for any recent changes or updates to the Thanos Rule configuration or code

Mitigation #

To mitigate the issue, follow these steps:

Restart the Thanos Rule instance to clear any temporary issues
Check and update the gRPC configuration and settings
Verify that the instance has sufficient resources (CPU, memory, etc.)
Implement circuit breakers or retries to handle temporary gRPC errors
Review and optimize the rule evaluation configuration to reduce load and errors
Consider enabling gRPC tracing to gain more insight into the issue
If the issue persists, consider rolling back recent changes or updates to the Thanos Rule configuration or code.