ZfsOfflinePool #

A ZFS zpool is in a unexpected state: {{ $labels.state }}.

Alert Rule

alert: ZfsOfflinePool
annotations:
  description: |-
    A ZFS zpool is in a unexpected state: {{ $labels.state }}.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/zfsofflinepool/
  summary: ZFS offline pool (instance {{ $labels.instance }})
expr: node_zfs_zpool_state{state!=&#34;online&#34;} &gt; 0
for: 1m
labels:
  severity: critical

Here is a sample runbook for the Prometheus alert rule “ZfsOfflinePool”:

Meaning #

The ZfsOfflinePool alert is triggered when a ZFS zpool is in an unexpected state, meaning it’s not online. This alert is critical because it can indicate a storage system failure, which can lead to data loss or unavailability.

Impact #

The impact of this alert is high, as it can cause:

Data loss or corruption
System downtime or unavailability
Performance degradation
Potential for cascading failures in dependent systems

Diagnosis #

To diagnose the issue, follow these steps:

Check the ZFS zpool status using the zpool status command.
Verify the zpool configuration and ensure it’s correct.
Check the system logs for any errors or warnings related to ZFS or the zpool.
Run zpool scrub to check for any data corruption or inconsistencies.
Review the node exporter metrics to identify any trends or patterns leading up to the alert.

Mitigation #

To mitigate the issue, follow these steps:

Immediately investigate and resolve any underlying system issues causing the zpool to be offline.
If the zpool is offline due to a faulty disk, replace the disk and resilver the zpool using zpool replace and zpool resilver.
If the zpool is offline due to a configuration issue, correct the configuration and bring the zpool online using zpool online.
Monitor the zpool status and node exporter metrics closely to ensure the issue is resolved and the system is stable.
Consider implementing additional monitoring and alerting for ZFS zpool health to detect potential issues before they become critical.