HostPhysicalComponentTooHot

HostPhysicalComponentTooHot #

Physical hardware component too hot

Alert Rule
alert: HostPhysicalComponentTooHot
annotations:
  description: |-
    Physical hardware component too hot
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostphysicalcomponenttoohot/
  summary: Host physical component too hot (instance {{ $labels.instance }})
expr: ((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node,
  sensor) node_hwmon_sensor_label{label!="tctl"} > 75)) * on(instance) group_left
  (nodename) node_uname_info{nodename=~".+"}
for: 5m
labels:
  severity: warning

Here is a runbook for the HostPhysicalComponentTooHot alert rule:

Meaning #

The HostPhysicalComponentTooHot alert is triggered when a physical component on a host (such as a CPU, GPU, or hard drive) exceeds a temperature of 75°C. This alert is generated by Prometheus using data from the Node Exporter, which monitors hardware sensors on the host.

Impact #

A hot physical component can lead to reduced performance, increased power consumption, and even hardware failure. If left unchecked, this can result in:

  • Slow system response times
  • Increased risk of hardware failure, leading to downtime and data loss
  • Increased power consumption, leading to increased costs and environmental impact

Diagnosis #

To diagnose the issue, follow these steps:

  1. Identify the specific host and component that is overheating using the instance label in the alert.
  2. Review the node_hwmon_temp_celsius metric to determine the current temperature of the component.
  3. Check the system logs for any error messages related to the overheating component.
  4. Verify that the component is properly seated and that there are no blockages to airflow around the component.
  5. Check the system’s cooling system to ensure it is functioning properly.

Mitigation #

To mitigate the issue, follow these steps:

  1. Immediately investigate the cause of the overheating and take corrective action to prevent further overheating.
  2. Implement temporary cooling measures, such as fans or air conditioning, to reduce the temperature of the component.
  3. Schedule maintenance to clean dust from the component and ensure proper airflow.
  4. Consider replacing the overheating component if it is faulty or worn out.
  5. Verify that the system’s cooling system is functioning properly and make any necessary adjustments.

Remember to always follow proper safety procedures when working with electrical components, and consult with a qualified technician if you are unsure about any of the mitigation steps.