SidekiqSchedulingLatencyTooHigh #

Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.

Alert Rule

alert: SidekiqSchedulingLatencyTooHigh
annotations:
  description: |-
    Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/strech-sidekiq-exporter/sidekiqschedulinglatencytoohigh/
  summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
expr: max(sidekiq_queue_latency) &gt; 60
for: 0m
labels:
  severity: critical

Meaning #

The SidekiqSchedulingLatencyTooHigh alert is triggered when the maximum sidekiq queue latency exceeds 60 seconds. This indicates that Sidekiq jobs are taking more than 1 minute to be picked up, which can result in delays in background processing.

Impact #

Users may experience delays in background processing, leading to a degraded user experience.
Critical business processes may be affected, causing revenue loss or other operational issues.
The high latency can also lead to job processing failures, causing data inconsistencies and further cascading failures.

Diagnosis #

Check the Sidekiq queue latency metrics in Prometheus to identify the specific queue(s) experiencing high latency.
Investigate the root cause of the high latency, such as:
- High CPU usage or memory pressure on the Sidekiq node.
- Network connectivity issues or high latency between nodes.
- Too many pending jobs in the queue, leading to congestion.
- Misconfigured Sidekiq settings or worker pool size.
Review the Sidekiq logs for any errors or exceptions that may indicate the cause of the high latency.

Mitigation #

Immediately investigate and address the root cause of the high latency to minimize the impact on users and business processes.
Consider increasing the Sidekiq worker pool size to process jobs more efficiently.
Optimize Sidekiq settings, such as the concurrency or timeout values, to improve job processing performance.
Implement load balancing or queue sharding to distribute the job processing load and reduce latency.
Consider implementing a circuit breaker or other resilience mechanisms to prevent cascading failures.
Monitor the Sidekiq queue latency metrics closely to ensure the mitigation steps are effective and make adjustments as needed.