KubernetesPodNotHealthy #

Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.

Alert Rule

alert: KubernetesPodNotHealthy
annotations:
  description: |-
    Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/kubestate-exporter/kubernetespodnothealthy/
  summary: Kubernetes Pod not healthy ({{ $labels.namespace }}/{{ $labels.pod }})
expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~&#34;Pending|Unknown|Failed&#34;})
  &gt; 0
for: 15m
labels:
  severity: critical

Meaning #

The KubernetesPodNotHealthy alert is triggered when a Kubernetes pod has been in a non-running state (Pending, Unknown, or Failed) for more than 15 minutes. This alert indicates a potential issue with the pod or the underlying cluster resources that is preventing the pod from running successfully.

Impact #

The impact of this alert can be significant, as it may indicate a failure in the application or service that the pod is providing. This can lead to:

Downtime or unavailability of the application or service
Loss of data or inconsistent data
Increased latency or errors
Negative impact on user experience or business operations

Diagnosis #

To diagnose the issue, follow these steps:

Check the pod’s status using kubectl describe pod <pod_name> -n <namespace>
Check the pod’s logs using kubectl logs <pod_name> -n <namespace>
Check the cluster’s resource usage and node status using kubectl top nodes and kubectl describe node <node_name>
Check for any ongoing deployments or rollouts that may be affecting the pod
Check the pod’s configuration and deployment YAML files for any errors or inconsistencies

Mitigation #

To mitigate the issue, follow these steps:

Restart the pod using kubectl rollout restart deployment <deployment_name> -n <namespace>
Check and resolve any underlying issues with the cluster resources or nodes
Verify that the pod’s configuration and deployment YAML files are correct and up-to-date
Check for any ongoing deployments or rollouts and pause or cancel them if necessary
If the issue persists, consider escalating to a senior engineer or devops team for further assistance.

Note: The runbook URL provided in the alert annotations points to a more detailed runbook that can be used for further guidance and troubleshooting.