SmartNvmeWearoutIndicator #

NVMe device is wearing out (instance {{ $labels.instance }})

Alert Rule

alert: SmartNvmeWearoutIndicator
annotations:
  description: |-
    NVMe device is wearing out (instance {{ $labels.instance }})
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/smartctl-exporter/smartnvmewearoutindicator/
  summary: Smart NVME Wearout Indicator (instance {{ $labels.instance }})
expr: smartctl_device_available_spare{device=~&#34;nvme.*&#34;} &lt; smartctl_device_available_spare_threshold{device=~&#34;nvme.*&#34;}
for: 15m
labels:
  severity: critical

Meaning #

The SmartNvmeWearoutIndicator alert is triggered when the available spare blocks on an NVMe device falls below a certain threshold. This indicates that the device is worn out and needs to be replaced to prevent data loss or corruption.

Impact #

Data loss or corruption due to worn-out NVMe device
Potential downtime or system unavailability
Reduced performance and increased latency

Diagnosis #

Check the affected instance and NVMe device using the instance and device labels.
Verify the current available spare blocks value using the smartctl_device_available_spare metric.
Check the device’s wear level and health status using other smartctl metrics, such as smartctl_device_wear_level and smartctl_device_health_status.
Consult the device’s documentation and manufacturer’s guidelines for replacement or maintenance.

Mitigation #

Immediately replace the worn-out NVMe device to prevent data loss or corruption.
Perform a thorough backup of critical data to ensure business continuity.
Consider migrating to a more reliable or redundant storage solution.
Monitor the device’s health status and wear level regularly to prevent future wear-out issues.
Update the smartctl_device_available_spare_threshold value if necessary, based on the device’s manufacturer’s guidelines and your organization’s requirements.

Note: For more detailed steps and specific instructions, refer to the linked runbook: https://github.com/srerun/prometheus-alerts/blob/main/content/runbooks/smartctl-exporter/SmartNvmeWearoutIndicator.md