HostPhysicalComponentTooHot #
Physical hardware component too hot
Alert Rule
alert: HostPhysicalComponentTooHot
annotations:
description: |-
Physical hardware component too hot
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostphysicalcomponenttoohot/
summary: Host physical component too hot (instance {{ $labels.instance }})
expr: ((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node,
sensor) node_hwmon_sensor_label{label!="tctl"} > 75)) * on(instance) group_left
(nodename) node_uname_info{nodename=~".+"}
for: 5m
labels:
severity: warning
Here is a runbook for the HostPhysicalComponentTooHot
alert rule:
Meaning #
The HostPhysicalComponentTooHot
alert is triggered when a physical component on a host (such as a CPU, GPU, or hard drive) exceeds a temperature of 75°C. This alert is generated by Prometheus using data from the Node Exporter, which monitors hardware sensors on the host.
Impact #
A hot physical component can lead to reduced performance, increased power consumption, and even hardware failure. If left unchecked, this can result in:
- Slow system response times
- Increased risk of hardware failure, leading to downtime and data loss
- Increased power consumption, leading to increased costs and environmental impact
Diagnosis #
To diagnose the issue, follow these steps:
- Identify the specific host and component that is overheating using the
instance
label in the alert. - Review the
node_hwmon_temp_celsius
metric to determine the current temperature of the component. - Check the system logs for any error messages related to the overheating component.
- Verify that the component is properly seated and that there are no blockages to airflow around the component.
- Check the system’s cooling system to ensure it is functioning properly.
Mitigation #
To mitigate the issue, follow these steps:
- Immediately investigate the cause of the overheating and take corrective action to prevent further overheating.
- Implement temporary cooling measures, such as fans or air conditioning, to reduce the temperature of the component.
- Schedule maintenance to clean dust from the component and ensure proper airflow.
- Consider replacing the overheating component if it is faulty or worn out.
- Verify that the system’s cooling system is functioning properly and make any necessary adjustments.
Remember to always follow proper safety procedures when working with electrical components, and consult with a qualified technician if you are unsure about any of the mitigation steps.