KubernetesPodNotHealthy #
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.
Alert Rule
alert: KubernetesPodNotHealthy
annotations:
description: |-
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/kubestate-exporter/kubernetespodnothealthy/
summary: Kubernetes Pod not healthy ({{ $labels.namespace }}/{{ $labels.pod }})
expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})
> 0
for: 15m
labels:
severity: critical
Meaning #
The KubernetesPodNotHealthy alert is triggered when a Kubernetes pod has been in a non-running state (Pending, Unknown, or Failed) for more than 15 minutes. This alert indicates a potential issue with the pod or the underlying cluster resources that is preventing the pod from running successfully.
Impact #
The impact of this alert can be significant, as it may indicate a failure in the application or service that the pod is providing. This can lead to:
- Downtime or unavailability of the application or service
- Loss of data or inconsistent data
- Increased latency or errors
- Negative impact on user experience or business operations
Diagnosis #
To diagnose the issue, follow these steps:
- Check the pod’s status using
kubectl describe pod <pod_name> -n <namespace>
- Check the pod’s logs using
kubectl logs <pod_name> -n <namespace>
- Check the cluster’s resource usage and node status using
kubectl top nodes
andkubectl describe node <node_name>
- Check for any ongoing deployments or rollouts that may be affecting the pod
- Check the pod’s configuration and deployment YAML files for any errors or inconsistencies
Mitigation #
To mitigate the issue, follow these steps:
- Restart the pod using
kubectl rollout restart deployment <deployment_name> -n <namespace>
- Check and resolve any underlying issues with the cluster resources or nodes
- Verify that the pod’s configuration and deployment YAML files are correct and up-to-date
- Check for any ongoing deployments or rollouts and pause or cancel them if necessary
- If the issue persists, consider escalating to a senior engineer or devops team for further assistance.
Note: The runbook URL provided in the alert annotations points to a more detailed runbook that can be used for further guidance and troubleshooting.