MinioNodeDiskOffline #
Minio cluster node disk is offline
Alert Rule
alert: MinioNodeDiskOffline
annotations:
description: |-
Minio cluster node disk is offline
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/minio-internal/minionodediskoffline/
summary: Minio node disk offline (instance {{ $labels.instance }})
expr: minio_cluster_nodes_offline_total > 0
for: 0m
labels:
severity: critical
Here is a sample runbook for the MinioNodeDiskOffline alert:
Meaning #
The MinioNodeDiskOffline alert is triggered when one or more disks in a Minio cluster node are offline. This alert is critical because it can cause data unavailability and potential data loss.
Impact #
The impact of this alert is high because it can:
- Cause data unavailability to users
- Lead to data loss if the offline disk is not brought back online promptly
- Affect the overall performance and reliability of the Minio cluster
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Minio cluster node status using the Minio dashboard or CLI
- Identify the specific disk that is offline using the LABELS output in the alert
- Check the disk’s health status using disk utility commands (e.g.
smartctl
,fsck
) - Check the system logs for any error messages related to the disk or Minio node
- Verify that the disk is properly connected and configured
Mitigation #
To mitigate the issue, follow these steps:
- Immediately investigate the cause of the disk offline and take corrective action (e.g. replace the disk, check cables, etc.)
- Bring the offline disk back online as soon as possible
- Verify that the Minio cluster node is healthy and data is available
- Consider adding additional redundancy to the Minio cluster to prevent similar issues in the future
- Update the runbook and documentation to prevent similar issues in the future
Note: The mitigation steps may vary depending on the specific environment and setup. This runbook is meant to provide general guidance and may need to be tailored to the specific use case.