ThanosReceiveHttpRequestErrorRateHigh #

Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.

Alert Rule

alert: ThanosReceiveHttpRequestErrorRateHigh
annotations:
  description: |-
    Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-receiver/thanosreceivehttprequesterrorratehigh/
  summary: Thanos Receive Http Request Error Rate High (instance {{ $labels.instance
    }})
expr: (sum by (job) (rate(http_requests_total{code=~&#34;5..&#34;, job=~&#34;.*thanos-receive.*&#34;,
  handler=&#34;receive&#34;}[5m]))/  sum by (job) (rate(http_requests_total{job=~&#34;.*thanos-receive.*&#34;,
  handler=&#34;receive&#34;}[5m]))) * 100 &gt; 5
for: 5m
labels:
  severity: critical

Meaning #

The ThanosReceiveHttpRequestErrorRateHigh alert is triggered when the rate of HTTP requests with a 5xx status code (indicating an error) exceeds 5% of the total requests received by the Thanos Receive component. This alert indicates that Thanos Receive is experiencing issues handling incoming requests, which can lead to data loss or inconsistencies.

Impact #

Data loss: Failed requests may result in missing data points, affecting the accuracy and completeness of metrics.
Increased latency: Errors in request handling can cause delays in data ingestion, leading to slower query performance and delayed alerting.
System instability: Prolonged periods of high error rates can lead to system instability, further exacerbating the issue.

Diagnosis #

To diagnose the issue, follow these steps:

Check Thanos Receive logs: Review the logs of the affected Thanos Receive instance(s) to identify the root cause of the errors. Look for patterns in the error messages or request payloads.
Verify request patterns: Analyze the request patterns and payloads to determine if there are any unusual or malformed requests that may be contributing to the errors.
Investigate Thanos Receive configuration: Review the Thanos Receive configuration to ensure it is correctly set up and functioning as expected.
Check underlying infrastructure: Verify that the underlying infrastructure, including servers, networks, and storage, is functioning correctly and not experiencing issues.

Mitigation #

To mitigate the issue, follow these steps:

Restart Thanos Receive: Restart the affected Thanos Receive instance(s) to reset the request handling and clear any temporary errors.
Adjust Thanos Receive configuration: Review and adjust the Thanos Receive configuration to ensure it is optimized for the current request volume and patterns.
Investigate and fix underlying issues: Address any underlying infrastructure issues, such as server or network problems, that may be contributing to the errors.
Implement request rate limiting: Consider implementing request rate limiting or throttling to prevent excessive requests and reduce the load on Thanos Receive.
Monitor and review: Continuously monitor the alert and review the mitigation steps to ensure the issue is completely resolved and does not reoccur.