KubernetesNodeDiskPressure #
Node {{ $labels.node }} has DiskPressure condition
Alert Rule
alert: KubernetesNodeDiskPressure
annotations:
description: |-
Node {{ $labels.node }} has DiskPressure condition
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/kubestate-exporter/kubernetesnodediskpressure/
summary: Kubernetes disk pressure (node {{ $labels.node }})
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 2m
labels:
severity: critical
Here is a runbook for the KubernetesNodeDiskPressure alert:
Meaning #
The KubernetesNodeDiskPressure alert is triggered when a Kubernetes node is experiencing disk pressure, meaning that the available disk space is below a certain threshold. This alert is critical because it can cause problems with pod scheduling, deployment, and overall cluster stability.
Impact #
The impact of disk pressure on a Kubernetes node can be significant:
- Pods may not be scheduled on the node due to lack of disk space
- Deployments may fail or be stuck in a pending state
- The node may become unresponsive or crash due to disk full errors
- Cluster performance and reliability may be affected
Diagnosis #
To diagnose the issue, follow these steps:
- Check the node’s disk usage using
kubectl describe node <node_name>
orkubectl top node <node_name> --resources
- Identify the pods and containers consuming the most disk space using
kubectl describe pod <pod_name>
orkubectl exec -it <pod_name> -- df -h
- Check the node’s disk capacity and available space using
kubectl describe node <node_name> | grep -i capacity
- Verify that there are no stuck or pending deployments using
kubectl get deployments --all-namespaces
Mitigation #
To mitigate the issue, follow these steps:
- Identify and remove unnecessary files and directories: Use
kubectl exec
to connect to the node and remove any unnecessary files and directories consuming disk space. - Scale down or terminate resource-intensive pods: Identify pods consuming excessive disk space and scale them down or terminate them if possible.
- Increase disk space: If possible, increase the disk space available on the node by adding more storage or upgrading the node’s hardware.
- Check and adjust pod resource requests and limits: Review pod resource requests and limits to ensure they are reasonable and not causing excessive disk usage.
- Consider implementing a cleaning process: Schedule a regular cleaning process to remove unnecessary files and directories from the node.
Remember to monitor the node’s disk usage and adjust your mitigation strategy as needed.