HostRaidArrayGotInactive

HostRaidArrayGotInactive #

RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.

Alert Rule
alert: HostRaidArrayGotInactive
annotations:
  description: |-
    RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/hostraidarraygotinactive/
  summary: Host RAID array got inactive (instance {{ $labels.instance }})
expr: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 0m
labels:
  severity: critical

Here is a runbook for the HostRaidArrayGotInactive alert rule:

Meaning #

The HostRaidArrayGotInactive alert is triggered when the RAID array on a host becomes inactive due to one or more disk failures. This means that the RAID array is no longer functioning properly and data may be at risk.

Impact #

The impact of an inactive RAID array is high, as it can lead to data loss or corruption. Additionally, if the failed disk(s) are not replaced promptly, the entire RAID array may become unavailable, causing significant disruptions to the system or application.

Diagnosis #

To diagnose the issue, follow these steps:

  1. Check the Prometheus alert label device to identify the specific RAID array that is affected.
  2. Log in to the affected host and check the RAID array status using the mdadm command or a similar tool.
  3. Identify the failed disk(s) and determine the cause of the failure (e.g., hardware fault, software issue, etc.).
  4. Check the system logs for any error messages related to the RAID array or disk failures.

Mitigation #

To mitigate the issue, follow these steps:

  1. Replace the failed disk(s) with new ones to ensure the RAID array has sufficient redundancy.
  2. Rebuild the RAID array using the mdadm command or a similar tool.
  3. Monitor the RAID array status to ensure it is functioning properly and data is being written correctly.
  4. Consider implementing additional monitoring and alerting for disk failures to prevent similar issues in the future.

Remember to also investigate the root cause of the disk failure to prevent similar issues from occurring in the future.