ZfsOfflinePool #
A ZFS zpool is in a unexpected state: {{ $labels.state }}.
Alert Rule
alert: ZfsOfflinePool
annotations:
description: |-
A ZFS zpool is in a unexpected state: {{ $labels.state }}.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/node-exporter/zfsofflinepool/
summary: ZFS offline pool (instance {{ $labels.instance }})
expr: node_zfs_zpool_state{state!="online"} > 0
for: 1m
labels:
severity: critical
Here is a sample runbook for the Prometheus alert rule “ZfsOfflinePool”:
Meaning #
The ZfsOfflinePool alert is triggered when a ZFS zpool is in an unexpected state, meaning it’s not online. This alert is critical because it can indicate a storage system failure, which can lead to data loss or unavailability.
Impact #
The impact of this alert is high, as it can cause:
- Data loss or corruption
- System downtime or unavailability
- Performance degradation
- Potential for cascading failures in dependent systems
Diagnosis #
To diagnose the issue, follow these steps:
- Check the ZFS zpool status using the
zpool status
command. - Verify the zpool configuration and ensure it’s correct.
- Check the system logs for any errors or warnings related to ZFS or the zpool.
- Run
zpool scrub
to check for any data corruption or inconsistencies. - Review the node exporter metrics to identify any trends or patterns leading up to the alert.
Mitigation #
To mitigate the issue, follow these steps:
- Immediately investigate and resolve any underlying system issues causing the zpool to be offline.
- If the zpool is offline due to a faulty disk, replace the disk and resilver the zpool using
zpool replace
andzpool resilver
. - If the zpool is offline due to a configuration issue, correct the configuration and bring the zpool online using
zpool online
. - Monitor the zpool status and node exporter metrics closely to ensure the issue is resolved and the system is stable.
- Consider implementing additional monitoring and alerting for ZFS zpool health to detect potential issues before they become critical.