NomadJobFailed #
Nomad job failed
Alert Rule
alert: NomadJobFailed
annotations:
description: |-
Nomad job failed
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/nomad-internal/nomadjobfailed/
summary: Nomad job failed (instance {{ $labels.instance }})
expr: nomad_nomad_job_summary_failed > 0
for: 0m
labels:
severity: warning
Here is a runbook for the Prometheus alert rule “NomadJobFailed”:
Meaning #
This alert indicates that a Nomad job has failed. Nomad is a distributed job scheduler and runner that manages and runs tasks and services. A failed job can impact the overall health and availability of the system.
Impact #
The impact of a failed Nomad job can be significant, depending on the job’s purpose and the services it provides. Some possible consequences include:
- Disruption to critical services
- Loss of data or processing
- Increased latency or errors
- Inability to scale or deploy new services
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Nomad UI or CLI to identify the failed job and its associated tasks.
- Review the job’s configuration and dependencies to identify potential causes of the failure.
- Check the Nomad agent logs for errors or warnings related to the failed job.
- Verify that the job’s dependencies, such as databases or file systems, are available and functional.
Mitigation #
To mitigate the issue, follow these steps:
- Identify and address the root cause of the failure, such as a misconfigured job or a dependency issue.
- Restart the failed job or task, if possible.
- Check the job’s configuration and dependencies to ensure they are correct and functional.
- Consider implementing retries or fallbacks for failed jobs to minimize downtime and data loss.
- Monitor the job’s status and performance to ensure it is running correctly and make any necessary adjustments.