CortexFrontendQueriesStuck #
There are queued up queries in query-frontend.
Alert Rule
alert: CortexFrontendQueriesStuck
annotations:
description: |-
There are queued up queries in query-frontend.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/cortex-internal/cortexfrontendqueriesstuck/
summary: Cortex frontend queries stuck (instance {{ $labels.instance }})
expr: sum by (job) (cortex_query_frontend_queue_length) > 0
for: 5m
labels:
severity: critical
Here is a sample runbook for the Prometheus alert rule “CortexFrontendQueriesStuck”:
Meaning #
The “CortexFrontendQueriesStuck” alert is triggered when there are queued up queries in the Cortex query-frontend, indicating that the frontend is unable to process queries in a timely manner. This can lead to delays in query execution and impact the overall performance of the system.
Impact #
The impact of this alert is high, as it can cause:
- Delays in query execution, leading to poor user experience
- Increased latency in data ingestion and processing
- Potential losses in data fidelity and accuracy
- Increased load on the system, leading to potential resource exhaustion
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Cortex query-frontend logs for errors or unusual behavior
- Verify that the query-frontend is properly configured and running
- Check the Cortex cluster’s resource utilization (CPU, memory, disk space) to ensure it is within acceptable limits
- Investigate any recent changes to the system or configuration that may have caused the issue
- Review the query-frontend queue metrics to understand the volume and type of queries being executed
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Cortex query-frontend service to clear any stuck queries
- Investigate and address any underlying issues causing the query-frontend to become stuck (e.g. resource utilization, configuration errors)
- Consider scaling up the Cortex cluster to handle increased load
- Implement query queuing and batch processing to reduce the load on the query-frontend
- Consider implementing query timeouts and retries to prevent queries from getting stuck in the queue
Remember to follow established change management procedures when making any changes to the system.