SmartDeviceTemperatureWarning #
Device temperature warning (instance {{ $labels.instance }})
Alert Rule
alert: SmartDeviceTemperatureWarning
annotations:
description: |-
Device temperature warning (instance {{ $labels.instance }})
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/smartctl-exporter/smartdevicetemperaturewarning/
summary: Smart device temperature warning (instance {{ $labels.instance }})
expr: smartctl_device_temperature > 60
for: 2m
labels:
severity: warning
Here is a sample runbook for the Prometheus alert rule SmartDeviceTemperatureWarning
:
Meaning #
The SmartDeviceTemperatureWarning
alert is triggered when the temperature of a smart device exceeds 60 degrees Celsius. This alert is raised to warn of a potential overheating issue that could lead to device failure or data loss.
Impact #
If left unchecked, an overheating smart device can lead to:
- Device failure or shutdown, resulting in data loss or unavailability
- Permanent damage to the device, requiring costly repairs or replacement
- Compromised data integrity or security
- Disruption to critical business operations or services
Diagnosis #
To diagnose the issue, follow these steps:
- Check the device’s temperature reading in Prometheus using the
smartctl_device_temperature
metric. - Verify that the device is not in a high-temperature environment or experiencing unusual workload patterns.
- Review device logs for any error messages or warnings related to temperature or overheating.
- Check the device’s cooling system or fan functionality to ensure proper operation.
Mitigation #
To mitigate the issue, follow these steps:
- Immediately shutdown the device to prevent further overheating and potential damage.
- Investigate and address any underlying causes of the overheating issue, such as:
- High ambient temperature
- Inadequate cooling or airflow
- Overworked or malfunctioning device components
- Consider relocating the device to a cooler environment or providing additional cooling mechanisms.
- Monitor the device’s temperature closely and implement preventative measures to avoid future overheating issues.
Note: This runbook is a sample and may require modification to fit your specific use case or environment.