KubernetesNodeDiskPressure #

Node {{ $labels.node }} has DiskPressure condition

Alert Rule

alert: KubernetesNodeDiskPressure
annotations:
  description: |-
    Node {{ $labels.node }} has DiskPressure condition
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/kubestate-exporter/kubernetesnodediskpressure/
  summary: Kubernetes disk pressure (node {{ $labels.node }})
expr: kube_node_status_condition{condition=&#34;DiskPressure&#34;,status=&#34;true&#34;} == 1
for: 2m
labels:
  severity: critical

Here is a runbook for the KubernetesNodeDiskPressure alert:

Meaning #

The KubernetesNodeDiskPressure alert is triggered when a Kubernetes node is experiencing disk pressure, meaning that the available disk space is below a certain threshold. This alert is critical because it can cause problems with pod scheduling, deployment, and overall cluster stability.

Impact #

The impact of disk pressure on a Kubernetes node can be significant:

Pods may not be scheduled on the node due to lack of disk space
Deployments may fail or be stuck in a pending state
The node may become unresponsive or crash due to disk full errors
Cluster performance and reliability may be affected

Diagnosis #

To diagnose the issue, follow these steps:

Check the node’s disk usage using kubectl describe node <node_name> or kubectl top node <node_name> --resources
Identify the pods and containers consuming the most disk space using kubectl describe pod <pod_name> or kubectl exec -it <pod_name> -- df -h
Check the node’s disk capacity and available space using kubectl describe node <node_name> | grep -i capacity
Verify that there are no stuck or pending deployments using kubectl get deployments --all-namespaces

Mitigation #

To mitigate the issue, follow these steps:

Identify and remove unnecessary files and directories: Use kubectl exec to connect to the node and remove any unnecessary files and directories consuming disk space.
Scale down or terminate resource-intensive pods: Identify pods consuming excessive disk space and scale them down or terminate them if possible.
Increase disk space: If possible, increase the disk space available on the node by adding more storage or upgrading the node’s hardware.
Check and adjust pod resource requests and limits: Review pod resource requests and limits to ensure they are reasonable and not causing excessive disk usage.
Consider implementing a cleaning process: Schedule a regular cleaning process to remove unnecessary files and directories from the node.

Remember to monitor the node’s disk usage and adjust your mitigation strategy as needed.