CortexFrontendQueriesStuck #

There are queued up queries in query-frontend.

Alert Rule

alert: CortexFrontendQueriesStuck
annotations:
  description: |-
    There are queued up queries in query-frontend.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/cortex-internal/cortexfrontendqueriesstuck/
  summary: Cortex frontend queries stuck (instance {{ $labels.instance }})
expr: sum by (job) (cortex_query_frontend_queue_length) &gt; 0
for: 5m
labels:
  severity: critical

Here is a sample runbook for the Prometheus alert rule “CortexFrontendQueriesStuck”:

Meaning #

The “CortexFrontendQueriesStuck” alert is triggered when there are queued up queries in the Cortex query-frontend, indicating that the frontend is unable to process queries in a timely manner. This can lead to delays in query execution and impact the overall performance of the system.

Impact #

The impact of this alert is high, as it can cause:

Delays in query execution, leading to poor user experience
Increased latency in data ingestion and processing
Potential losses in data fidelity and accuracy
Increased load on the system, leading to potential resource exhaustion

Diagnosis #

To diagnose the issue, follow these steps:

Check the Cortex query-frontend logs for errors or unusual behavior
Verify that the query-frontend is properly configured and running
Check the Cortex cluster’s resource utilization (CPU, memory, disk space) to ensure it is within acceptable limits
Investigate any recent changes to the system or configuration that may have caused the issue
Review the query-frontend queue metrics to understand the volume and type of queries being executed

Mitigation #

To mitigate the issue, follow these steps:

Restart the Cortex query-frontend service to clear any stuck queries
Investigate and address any underlying issues causing the query-frontend to become stuck (e.g. resource utilization, configuration errors)
Consider scaling up the Cortex cluster to handle increased load
Implement query queuing and batch processing to reduce the load on the query-frontend
Consider implementing query timeouts and retries to prevent queries from getting stuck in the queue

Remember to follow established change management procedures when making any changes to the system.