SmartNvmeWearoutIndicator #
NVMe device is wearing out (instance {{ $labels.instance }})
Alert Rule
alert: SmartNvmeWearoutIndicator
annotations:
description: |-
NVMe device is wearing out (instance {{ $labels.instance }})
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/smartctl-exporter/smartnvmewearoutindicator/
summary: Smart NVME Wearout Indicator (instance {{ $labels.instance }})
expr: smartctl_device_available_spare{device=~"nvme.*"} < smartctl_device_available_spare_threshold{device=~"nvme.*"}
for: 15m
labels:
severity: critical
Meaning #
The SmartNvmeWearoutIndicator alert is triggered when the available spare blocks on an NVMe device falls below a certain threshold. This indicates that the device is worn out and needs to be replaced to prevent data loss or corruption.
Impact #
- Data loss or corruption due to worn-out NVMe device
- Potential downtime or system unavailability
- Reduced performance and increased latency
Diagnosis #
- Check the affected instance and NVMe device using the
instance
anddevice
labels. - Verify the current available spare blocks value using the
smartctl_device_available_spare
metric. - Check the device’s wear level and health status using other smartctl metrics, such as
smartctl_device_wear_level
andsmartctl_device_health_status
. - Consult the device’s documentation and manufacturer’s guidelines for replacement or maintenance.
Mitigation #
- Immediately replace the worn-out NVMe device to prevent data loss or corruption.
- Perform a thorough backup of critical data to ensure business continuity.
- Consider migrating to a more reliable or redundant storage solution.
- Monitor the device’s health status and wear level regularly to prevent future wear-out issues.
- Update the
smartctl_device_available_spare_threshold
value if necessary, based on the device’s manufacturer’s guidelines and your organization’s requirements.
Note: For more detailed steps and specific instructions, refer to the linked runbook: https://github.com/srerun/prometheus-alerts/blob/main/content/runbooks/smartctl-exporter/SmartNvmeWearoutIndicator.md