HostRaidDiskFailure #
At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap
Alert Rule
alert: HostRaidDiskFailure
annotations:
description: |-
At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostraiddiskfailure/
summary: Host RAID disk failure (instance {{ $labels.instance }})
expr: (node_md_disks{state="failed"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: warning
Meaning #
The HostRaidDiskFailure alert is triggered when at least one device in a RAID array on a host fails. This alert is critical because it can lead to data loss or unavailability if not addressed promptly.
Impact #
The impact of a failed RAID disk can be significant, including:
- Data loss or corruption
- Reduced system performance
- Increased risk of system crashes or failures
- Potential for cascading failures of other disks in the RAID array
- Downtime and loss of productivity for users or services dependent on the host
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Prometheus alert details for the specific host and RAID array affected.
- Log in to the host and check the system logs for errors related to the RAID array.
- Use the
mdadm
command to check the status of the RAID array and identify the failed disk. - Verify that the failed disk is not causing any other system issues.
Mitigation #
To mitigate the issue, follow these steps:
- Identify the failed disk and replace it as soon as possible.
- Use the
mdadm
command to remove the failed disk from the RAID array. - Add a new disk to the RAID array and allow it to resync.
- Monitor the RAID array for any further issues or errors.
- Consider running a filesystem check on the affected host to ensure data integrity.
- Update the monitoring system to reflect the changes made to the RAID array.
Note: Always follow proper safety protocols when replacing hardware components, and ensure that you have the necessary expertise and resources to perform the replacement.