NomadJobQueued #

Nomad job queued

Alert Rule

alert: NomadJobQueued
annotations:
  description: |-
    Nomad job queued
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/nomad-internal/nomadjobqueued/
  summary: Nomad job queued (instance {{ $labels.instance }})
expr: nomad_nomad_job_summary_queued &gt; 0
for: 2m
labels:
  severity: warning

Here is a sample runbook for the Prometheus alert rule “NomadJobQueued”:

Meaning #

The NomadJobQueued alert is triggered when a Nomad job is queued and not running. This can indicate that the job is unable to start due to resource constraints, configuration issues, or other problems.

Impact #

The impact of this alert is that the Nomad job is not running, which can lead to:

Delays in task execution
Increased latency
Potential data loss or inconsistencies
Impact on dependent services or applications

Diagnosis #

To diagnose the issue, follow these steps:

Check the Nomad job configuration to ensure it is correct and up-to-date.
Verify that the required resources (e.g. CPU, memory, network) are available for the job to run.
Investigate any recent changes to the Nomad cluster or job configuration that may be causing the issue.
Check the Nomad logs for any errors or warnings related to the job.
Verify that the job is not stuck in a failed or paused state.

Mitigation #

To mitigate the issue, follow these steps:

Check the Nomad job configuration and update it if necessary to ensure it is correct and up-to-date.
Verify that the required resources are available for the job to run and allocate additional resources if necessary.
Resolve any recent changes to the Nomad cluster or job configuration that may be causing the issue.
Restart the Nomad job or update the job configuration to allow it to run successfully.
Monitor the job to ensure it is running successfully and no further issues occur.

Note: This is a sample runbook and may need to be customized to fit the specific use case and environment.