SmartDeviceTemperatureCritical #
Device temperature critical (instance {{ $labels.instance }})
Alert Rule
alert: SmartDeviceTemperatureCritical
annotations:
description: |-
Device temperature critical (instance {{ $labels.instance }})
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/smartctl-exporter/smartdevicetemperaturecritical/
summary: Smart device temperature critical (instance {{ $labels.instance }})
expr: smartctl_device_temperature > 80
for: 2m
labels:
severity: critical
Meaning #
The SmartDeviceTemperatureCritical alert is triggered when the temperature of a smart device, as reported by the smartctl
exporter, exceeds 80 degrees Celsius. This indicates a critical temperature threshold has been breached, which can potentially lead to device failure, data loss, or even physical damage.
Impact #
The impact of this alert includes:
- Potential device failure or shutdown, leading to data unavailability and system downtime
- Increased risk of data loss or corruption due to overheating
- Possible physical damage to the device or surrounding components
- Decreased reliability and lifespan of the device
Diagnosis #
To diagnose the issue, follow these steps:
- Check the device’s temperature reading using the
smartctl
exporter - Verify that the temperature reading is accurate and not a sensor error
- Investigate possible causes of the high temperature, such as:
- High ambient temperature
- Poor cooling or ventilation
- Overloaded or malfunctioning device
- Faulty temperature sensor
- Review system logs for any related errors or warnings
- Consult device documentation and manufacturer guidelines for recommended operating temperatures
Mitigation #
To mitigate the issue, take the following steps:
- Immediately shutdown the device to prevent further damage
- Verify that the device is properly cooled and ventilated
- Check for any blockages or obstructions in the device’s airflow
- Consider relocating the device to a cooler environment
- Monitor the device’s temperature closely and take corrective action if it continues to rise
- Consider replacing the device if it is faulty or malfunctioning
- Review and update device configuration and settings to prevent similar issues in the future