CephOsdLowSpace #

Ceph Object Storage Daemon is going out of space. Please add more disks.

Alert Rule

alert: CephOsdLowSpace
annotations:
  description: |-
    Ceph Object Storage Daemon is going out of space. Please add more disks.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephosdlowspace/
  summary: Ceph OSD low space (instance {{ $labels.instance }})
expr: ceph_osd_utilization &gt; 90
for: 2m
labels:
  severity: warning

Here is a runbook for the CephOsdLowSpace alert:

Meaning #

The CephOsdLowSpace alert is triggered when the utilization of a Ceph Object Storage Daemon (OSD) exceeds 90% for more than 2 minutes. This indicates that the OSD is running low on available space, which can lead to reduced performance and increased risk of data loss.

Impact #

If left unaddressed, a Ceph OSD with low available space can cause:

Reduced write performance and increased latency
Increased risk of data loss or corruption
Potential for the OSD to become unavailable, leading to a reduction in overall cluster capacity
Increased risk of cascading failures in the cluster

Diagnosis #

To diagnose the issue, follow these steps:

Check the Ceph cluster’s overall health using the ceph -s command
Identify the specific OSD that triggered the alert using the instance label in the alert notification
Check the OSD’s utilization and available space using the ceph osd df command
Verify that the OSD is not experiencing any other issues, such as high latency or errors

Mitigation #

To mitigate the issue, follow these steps:

Add additional disks to the OSD to increase its available space
Consider rebalancing the data in the cluster to reduce the load on the affected OSD
Monitor the OSD’s utilization and available space to ensure the issue is resolved
Consider adjusting the alert threshold or configuration to prevent false positives or unnecessary alerts