HostNodeOvertemperatureAlarm #
Physical node temperature alarm triggered
Alert Rule
alert: HostNodeOvertemperatureAlarm
annotations:
description: |-
Physical node temperature alarm triggered
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostnodeovertemperaturealarm/
summary: Host node overtemperature alarm (instance {{ $labels.instance }})
expr: ((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1))
* on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 0m
labels:
severity: critical
Here is a runbook for the HostNodeOvertemperatureAlarm
alert rule:
Meaning #
The HostNodeOvertemperatureAlarm
alert is triggered when a node’s temperature exceeds a critical or warning threshold, indicating a potential overheating issue. This alert is critical because high temperatures can cause damage to the node’s hardware, leading to system failures or data loss.
Impact #
- The node may experience reduced performance or shut down unexpectedly, leading to service disruptions and potential data loss.
- Prolonged high temperatures can cause permanent damage to the node’s hardware, requiring costly repairs or replacement.
- If the node is part of a critical service, the downtime can have significant business and financial impacts.
Diagnosis #
To diagnose the issue, follow these steps:
- Check the node’s temperature readings in Prometheus to determine the current temperature and trend.
- Verify that the temperature sensor is functioning correctly and not reporting false values.
- Investigate recent changes to the node’s environment, such as changes in air flow, cooling systems, or nearby heat sources.
- Check the node’s system logs for any error messages related to temperature or cooling systems.
- Perform a visual inspection of the node to ensure that it is properly ventilated and that all fans are functioning correctly.
Mitigation #
To mitigate the issue, follow these steps:
- Immediately shut down any non-essential workloads on the node to reduce heat generation.
- Verify that the node’s cooling system is functioning correctly and adjust as needed.
- Implement temporary cooling measures, such as directing cool air towards the node or using temporary cooling devices.
- Schedule a maintenance window to inspect and clean the node’s air vents and fans.
- Consider relocating the node to a cooler location or providing additional cooling solutions, such as air conditioning or liquid cooling systems.
- Monitor the node’s temperature readings closely to ensure that the issue is resolved and prevent further overheating events.