HostOutOfMemory #
Node memory is filling up (< 10% left)
Alert Rule
alert: HostOutOfMemory
annotations:
description: |-
Node memory is filling up (< 10% left)
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostoutofmemory/
summary: Host out of memory (instance {{ $labels.instance }})
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance)
group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: warning
Here is a runbook for the HostOutOfMemory alert:
Meaning #
The HostOutOfMemory alert is triggered when a node’s available memory falls below 10% of its total memory capacity. This indicates that the node is running low on memory resources, which can cause performance issues, slow down applications, and even lead to crashes or errors.
Impact #
If left unaddressed, low memory can have significant impacts on the system and applications running on it. Some potential consequences include:
- Slow performance and response times
- Increased latency and timeouts
- Application crashes or errors
- Data loss or corruption
- Increased risk of security breaches
Diagnosis #
To diagnose the root cause of the HostOutOfMemory alert, follow these steps:
- Check the node’s memory usage patterns:
- Use Prometheus queries to inspect the node’s memory usage over time.
- Look for trends, spikes, or anomalies in memory consumption.
- Identify resource-intensive processes:
- Use tools like
top
,htop
, orps
to identify processes consuming high amounts of memory. - Check for any memory leaks or inefficient memory allocation.
- Use tools like
- Review system configuration and resource allocation:
- Verify that the node has sufficient memory resources allocated.
- Check for any misconfigured resources, such as swap space or memory limits.
Mitigation #
To mitigate the HostOutOfMemory alert, follow these steps:
- Increase node memory capacity:
- Add more physical RAM to the node, if possible.
- Consider migrating to a more powerful instance type or virtual machine.
- Optimize resource-intensive processes:
- Identify and fix memory leaks or inefficient memory allocation in applications.
- Optimize configuration and tuning for resource-intensive processes.
- Implement memory-saving measures:
- Consider implementing caching or buffering mechanisms to reduce memory usage.
- Implement memory-saving features, such as compression or data deduplication.
- Monitor and prevent future occurrences:
- Set up regular memory usage checks and alerts.
- Implement automated scripts to detect and respond to memory usage anomalies.
Remember to update the runbook
annotation in the Prometheus alert rule to point to this runbook, so that operators can easily access these instructions when the alert is triggered.