SidekiqSchedulingLatencyTooHigh

SidekiqSchedulingLatencyTooHigh #

Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.

Alert Rule
alert: SidekiqSchedulingLatencyTooHigh
annotations:
  description: |-
    Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/strech-sidekiq-exporter/sidekiqschedulinglatencytoohigh/
  summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
expr: max(sidekiq_queue_latency) > 60
for: 0m
labels:
  severity: critical

Meaning #

The SidekiqSchedulingLatencyTooHigh alert is triggered when the maximum sidekiq queue latency exceeds 60 seconds. This indicates that Sidekiq jobs are taking more than 1 minute to be picked up, which can result in delays in background processing.

Impact #

  • Users may experience delays in background processing, leading to a degraded user experience.
  • Critical business processes may be affected, causing revenue loss or other operational issues.
  • The high latency can also lead to job processing failures, causing data inconsistencies and further cascading failures.

Diagnosis #

  • Check the Sidekiq queue latency metrics in Prometheus to identify the specific queue(s) experiencing high latency.
  • Investigate the root cause of the high latency, such as:
    • High CPU usage or memory pressure on the Sidekiq node.
    • Network connectivity issues or high latency between nodes.
    • Too many pending jobs in the queue, leading to congestion.
    • Misconfigured Sidekiq settings or worker pool size.
  • Review the Sidekiq logs for any errors or exceptions that may indicate the cause of the high latency.

Mitigation #

  • Immediately investigate and address the root cause of the high latency to minimize the impact on users and business processes.
  • Consider increasing the Sidekiq worker pool size to process jobs more efficiently.
  • Optimize Sidekiq settings, such as the concurrency or timeout values, to improve job processing performance.
  • Implement load balancing or queue sharding to distribute the job processing load and reduce latency.
  • Consider implementing a circuit breaker or other resilience mechanisms to prevent cascading failures.
  • Monitor the Sidekiq queue latency metrics closely to ensure the mitigation steps are effective and make adjustments as needed.