ThanosBucketReplicateRunLatency #
Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.
Alert Rule
alert: ThanosBucketReplicateRunLatency
annotations:
description: |-
Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-bucket-replicate/thanosbucketreplicaterunlatency/
summary: Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }})
expr: (histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])))
> 20 and sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))
> 0)
for: 5m
labels:
severity: critical
Here is the runbook for the ThanosBucketReplicateRunLatency alert:
Meaning #
The ThanosBucketReplicateRunLatency alert is triggered when the 99th percentile latency of Thanos replicate operations exceeds 20 seconds in a 5-minute window. This alert indicates that the Thanos replicate job is experiencing high latency, which can lead to data inconsistency and availability issues.
Impact #
The impact of this alert can be significant, as high latency in Thanos replicate operations can:
- Cause data inconsistencies between the primary and replicated buckets
- Lead to slower query performance and increased latency for end-users
- Result in failed replicate operations, leading to data loss or corruption
- Potentially trigger cascading failures in dependent systems
Diagnosis #
To diagnose the root cause of this alert, follow these steps:
- Check the Thanos replicate job logs for errors or warnings that may indicate the cause of the high latency.
- Verify that the Thanos replicate job is running with the expected configuration and resources.
- Check the underlying storage system for signs of congestion or high latency.
- Review the Thanos metrics to identify any trends or patterns that may indicate the cause of the high latency.
- Check the network connectivity and bandwidth between the primary and replicated buckets.
Mitigation #
To mitigate the impact of this alert, follow these steps:
- Immediately investigate and resolve any underlying issues causing the high latency, such as storage congestion or network connectivity issues.
- Consider scaling up the resources allocated to the Thanos replicate job to improve its performance.
- Review the Thanos configuration and optimize it for better performance and latency.
- Implement additional monitoring and alerting to detect similar issues earlier.
- Consider implementing redundancy and failover mechanisms to ensure data availability and consistency in case of replicate job failures.