ThanosStoreBucketHighOperationFailures

ThanosStoreBucketHighOperationFailures #

Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.

Alert Rule
alert: ThanosStoreBucketHighOperationFailures
annotations:
  description: |-
    Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/thanos-store/thanosstorebuckethighoperationfailures/
  summary: Thanos Store Bucket High Operation Failures (instance {{ $labels.instance
    }})
expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m]))
  / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m]))
  * 100 > 5)
for: 15m
labels:
  severity: warning

Here is the runbook for the Prometheus alert rule:

Meaning #

The ThanosStoreBucketHighOperationFailures alert is triggered when the percentage of failed operations in a Thanos Store bucket exceeds 5% over a 5-minute window. This indicates that the bucket is experiencing an unusual number of failures, which may impact the availability and reliability of the system.

Impact #

  • High operation failures in a Thanos Store bucket can lead to data loss, corruption, or inconsistency.
  • This may cause issues with downstream systems that rely on the data stored in the bucket.
  • Prolonged failures can lead to a buildup of undelivered data, causing further problems when the bucket recovers.

Diagnosis #

  • Check the Thanos Store bucket logs for error messages indicating the cause of the failures.
  • Investigate the network connectivity and storage system health to rule out infrastructure-related issues.
  • Verify that the bucket is properly configured and that the storage capacity is sufficient.
  • Check the system load and resource utilization to identify any resource constraints.

Mitigation #

  • Investigate and address the root cause of the operation failures, such as network issues, storage system problems, or configuration errors.
  • Temporarily increase the storage capacity or add more resources to alleviate any resource constraints.
  • Implement retries or fall-back mechanisms to minimize the impact of failures on downstream systems.
  • Consider scaling up or out the Thanos Store cluster to improve its resilience and availability.

Note: This runbook is a general guideline and may need to be tailored to your specific environment and use case.