CephHighOsdLatency #
Ceph Object Storage Daemon latency is high. Please check if it doesn’t stuck in weird state.
Alert Rule
alert: CephHighOsdLatency
annotations:
  description: |-
    Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephhighosdlatency/
  summary: Ceph high OSD latency (instance {{ $labels.instance }})
expr: ceph_osd_perf_apply_latency_seconds > 5
for: 1m
labels:
  severity: warning
Here is a runbook for the Prometheus alert rule CephHighOsdLatency:
Meaning #
The CephHighOsdLatency alert is triggered when the average apply latency of Ceph Object Storage Daemons (OSDs) exceeds 5 seconds over a 1-minute period. This indicates that the OSDs are experiencing high latency, which can lead to slower write performance and potentially impact the overall health of the Ceph cluster.
Impact #
High OSD latency can have several negative consequences:
- Slower write performance, leading to increased latency for applications writing data to the Ceph cluster
 - Increased risk of data loss or corruption due to delayed writes
 - Potential for OSDs to become stuck in a weird state, leading to further performance degradation and instability in the cluster
 
Diagnosis #
To diagnose the root cause of high OSD latency, follow these steps:
- Check the Ceph cluster’s overall health using the 
ceph -scommand or a monitoring dashboard. - Investigate the OSD’s performance metrics, such as 
ceph_osd_perf_apply_latency_seconds, to identify which OSDs are experiencing high latency. - Check the OSD’s debug logs for any errors or warnings related to high latency.
 - Verify that the OSDs have sufficient resources (e.g., CPU, memory, disk space) to operate effectively.
 - Check for any network connectivity issues or congestion that may be contributing to high latency.
 
Mitigation #
To mitigate high OSD latency, follow these steps:
- Check for any recent changes to the Ceph cluster’s configuration or deployment that may be contributing to high latency.
 - Increase the OSD’s resources (e.g., CPU, memory, disk space) if necessary to improve performance.
 - Implement load balancing or distribute writes across multiple OSDs to reduce the load on individual OSDs.
 - Adjust the Ceph cluster’s configuration to optimize performance, such as adjusting the 
osd_target_transaction_sizeorosd_client_message_sizesettings. - Consider upgrading the Ceph cluster to a newer version, which may include performance improvements or bug fixes related to OSD latency.