NomadJobQueued #
Nomad job queued
Alert Rule
alert: NomadJobQueued
annotations:
description: |-
Nomad job queued
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/nomad-internal/nomadjobqueued/
summary: Nomad job queued (instance {{ $labels.instance }})
expr: nomad_nomad_job_summary_queued > 0
for: 2m
labels:
severity: warning
Here is a sample runbook for the Prometheus alert rule “NomadJobQueued”:
Meaning #
The NomadJobQueued alert is triggered when a Nomad job is queued and not running. This can indicate that the job is unable to start due to resource constraints, configuration issues, or other problems.
Impact #
The impact of this alert is that the Nomad job is not running, which can lead to:
- Delays in task execution
- Increased latency
- Potential data loss or inconsistencies
- Impact on dependent services or applications
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Nomad job configuration to ensure it is correct and up-to-date.
- Verify that the required resources (e.g. CPU, memory, network) are available for the job to run.
- Investigate any recent changes to the Nomad cluster or job configuration that may be causing the issue.
- Check the Nomad logs for any errors or warnings related to the job.
- Verify that the job is not stuck in a failed or paused state.
Mitigation #
To mitigate the issue, follow these steps:
- Check the Nomad job configuration and update it if necessary to ensure it is correct and up-to-date.
- Verify that the required resources are available for the job to run and allocate additional resources if necessary.
- Resolve any recent changes to the Nomad cluster or job configuration that may be causing the issue.
- Restart the Nomad job or update the job configuration to allow it to run successfully.
- Monitor the job to ensure it is running successfully and no further issues occur.
Note: This is a sample runbook and may need to be customized to fit the specific use case and environment.