CephOsdLowSpace #
Ceph Object Storage Daemon is going out of space. Please add more disks.
Alert Rule
alert: CephOsdLowSpace
annotations:
description: |-
Ceph Object Storage Daemon is going out of space. Please add more disks.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephosdlowspace/
summary: Ceph OSD low space (instance {{ $labels.instance }})
expr: ceph_osd_utilization > 90
for: 2m
labels:
severity: warning
Here is a runbook for the CephOsdLowSpace alert:
Meaning #
The CephOsdLowSpace alert is triggered when the utilization of a Ceph Object Storage Daemon (OSD) exceeds 90% for more than 2 minutes. This indicates that the OSD is running low on available space, which can lead to reduced performance and increased risk of data loss.
Impact #
If left unaddressed, a Ceph OSD with low available space can cause:
- Reduced write performance and increased latency
- Increased risk of data loss or corruption
- Potential for the OSD to become unavailable, leading to a reduction in overall cluster capacity
- Increased risk of cascading failures in the cluster
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Ceph cluster’s overall health using the
ceph -s
command - Identify the specific OSD that triggered the alert using the
instance
label in the alert notification - Check the OSD’s utilization and available space using the
ceph osd df
command - Verify that the OSD is not experiencing any other issues, such as high latency or errors
Mitigation #
To mitigate the issue, follow these steps:
- Add additional disks to the OSD to increase its available space
- Consider rebalancing the data in the cluster to reduce the load on the affected OSD
- Monitor the OSD’s utilization and available space to ensure the issue is resolved
- Consider adjusting the alert threshold or configuration to prevent false positives or unnecessary alerts