ThanosReceiveHighHashringFileRefreshFailures #
Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.
Alert Rule
alert: ThanosReceiveHighHashringFileRefreshFailures
annotations:
description: |-
Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-receiver/thanosreceivehighhashringfilerefreshfailures/
summary: Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance
}})
expr: (sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m]))
/ sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m]))
> 0)
for: 15m
labels:
severity: warning
Here is a runbook for the Prometheus alert rule:
Meaning #
The ThanosReceiveHighHashringFileRefreshFailures alert is triggered when the ratio of failed Thanos Receive hashring file refreshes to total refresh attempts exceeds 0 within a 5-minute window. This indicates that Thanos Receive is experiencing issues refreshing its hashring file, which can lead to data inconsistencies and affect the overall reliability of the system.
Impact #
- Data inconsistencies and potential loss due to failed hashring file refreshes
- Reduced reliability of the system, leading to potential outages or performance degradation
- Increased latency and errors in downstream systems that rely on Thanos Receive
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Thanos Receive logs for errors related to hashring file refreshes
- Verify that the Thanos Receive instance has the necessary permissions and access to the hashring file
- Check the network connectivity and latency between Thanos Receive and the hashring file storage
- Investigate if there are any recent changes to the Thanos Receive configuration or hashring file format
- Review the Thanos Receive metrics to identify any patterns or trends in the failure rates
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Thanos Receive instance to attempt to recover from the failed hashring file refreshes
- Verify that the hashring file is up-to-date and accessible by Thanos Receive
- Check and update the Thanos Receive configuration to ensure it is correct and valid
- Implement temporary workarounds, such as increasing the hashring file refresh interval or reducing the load on Thanos Receive
- Schedule a maintenance window to perform a thorough investigation and resolution of the underlying issue