HostOutOfMemory #

Node memory is filling up (< 10% left)

Alert Rule

alert: HostOutOfMemory
annotations:
  description: |-
    Node memory is filling up (&lt; 10% left)
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostoutofmemory/
  summary: Host out of memory (instance {{ $labels.instance }})
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 &lt; 10) * on(instance)
  group_left (nodename) node_uname_info{nodename=~&#34;.&#43;&#34;}
for: 2m
labels:
  severity: warning

Here is a runbook for the HostOutOfMemory alert:

Meaning #

The HostOutOfMemory alert is triggered when a node’s available memory falls below 10% of its total memory capacity. This indicates that the node is running low on memory resources, which can cause performance issues, slow down applications, and even lead to crashes or errors.

Impact #

If left unaddressed, low memory can have significant impacts on the system and applications running on it. Some potential consequences include:

Slow performance and response times
Increased latency and timeouts
Application crashes or errors
Data loss or corruption
Increased risk of security breaches

Diagnosis #

To diagnose the root cause of the HostOutOfMemory alert, follow these steps:

Check the node’s memory usage patterns:
- Use Prometheus queries to inspect the node’s memory usage over time.
- Look for trends, spikes, or anomalies in memory consumption.
Identify resource-intensive processes:
- Use tools like top, htop, or ps to identify processes consuming high amounts of memory.
- Check for any memory leaks or inefficient memory allocation.
Review system configuration and resource allocation:
- Verify that the node has sufficient memory resources allocated.
- Check for any misconfigured resources, such as swap space or memory limits.

Mitigation #

To mitigate the HostOutOfMemory alert, follow these steps:

Increase node memory capacity:
- Add more physical RAM to the node, if possible.
- Consider migrating to a more powerful instance type or virtual machine.
Optimize resource-intensive processes:
- Identify and fix memory leaks or inefficient memory allocation in applications.
- Optimize configuration and tuning for resource-intensive processes.
Implement memory-saving measures:
- Consider implementing caching or buffering mechanisms to reduce memory usage.
- Implement memory-saving features, such as compression or data deduplication.
Monitor and prevent future occurrences:
- Set up regular memory usage checks and alerts.
- Implement automated scripts to detect and respond to memory usage anomalies.

Remember to update the runbook annotation in the Prometheus alert rule to point to this runbook, so that operators can easily access these instructions when the alert is triggered.