SidekiqSchedulingLatencyTooHigh #
Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.
Alert Rule
alert: SidekiqSchedulingLatencyTooHigh
annotations:
description: |-
Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/strech-sidekiq-exporter/sidekiqschedulinglatencytoohigh/
summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
expr: max(sidekiq_queue_latency) > 60
for: 0m
labels:
severity: critical
Meaning #
The SidekiqSchedulingLatencyTooHigh alert is triggered when the maximum sidekiq queue latency exceeds 60 seconds. This indicates that Sidekiq jobs are taking more than 1 minute to be picked up, which can result in delays in background processing.
Impact #
- Users may experience delays in background processing, leading to a degraded user experience.
- Critical business processes may be affected, causing revenue loss or other operational issues.
- The high latency can also lead to job processing failures, causing data inconsistencies and further cascading failures.
Diagnosis #
- Check the Sidekiq queue latency metrics in Prometheus to identify the specific queue(s) experiencing high latency.
- Investigate the root cause of the high latency, such as:
- High CPU usage or memory pressure on the Sidekiq node.
- Network connectivity issues or high latency between nodes.
- Too many pending jobs in the queue, leading to congestion.
- Misconfigured Sidekiq settings or worker pool size.
- Review the Sidekiq logs for any errors or exceptions that may indicate the cause of the high latency.
Mitigation #
- Immediately investigate and address the root cause of the high latency to minimize the impact on users and business processes.
- Consider increasing the Sidekiq worker pool size to process jobs more efficiently.
- Optimize Sidekiq settings, such as the concurrency or timeout values, to improve job processing performance.
- Implement load balancing or queue sharding to distribute the job processing load and reduce latency.
- Consider implementing a circuit breaker or other resilience mechanisms to prevent cascading failures.
- Monitor the Sidekiq queue latency metrics closely to ensure the mitigation steps are effective and make adjustments as needed.