CephHighOsdLatency #
Ceph Object Storage Daemon latency is high. Please check if it doesn’t stuck in weird state.
Alert Rule
alert: CephHighOsdLatency
annotations:
description: |-
Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/ceph-internal/cephhighosdlatency/
summary: Ceph high OSD latency (instance {{ $labels.instance }})
expr: ceph_osd_perf_apply_latency_seconds > 5
for: 1m
labels:
severity: warning
Here is a runbook for the Prometheus alert rule CephHighOsdLatency
:
Meaning #
The CephHighOsdLatency
alert is triggered when the average apply latency of Ceph Object Storage Daemons (OSDs) exceeds 5 seconds over a 1-minute period. This indicates that the OSDs are experiencing high latency, which can lead to slower write performance and potentially impact the overall health of the Ceph cluster.
Impact #
High OSD latency can have several negative consequences:
- Slower write performance, leading to increased latency for applications writing data to the Ceph cluster
- Increased risk of data loss or corruption due to delayed writes
- Potential for OSDs to become stuck in a weird state, leading to further performance degradation and instability in the cluster
Diagnosis #
To diagnose the root cause of high OSD latency, follow these steps:
- Check the Ceph cluster’s overall health using the
ceph -s
command or a monitoring dashboard. - Investigate the OSD’s performance metrics, such as
ceph_osd_perf_apply_latency_seconds
, to identify which OSDs are experiencing high latency. - Check the OSD’s debug logs for any errors or warnings related to high latency.
- Verify that the OSDs have sufficient resources (e.g., CPU, memory, disk space) to operate effectively.
- Check for any network connectivity issues or congestion that may be contributing to high latency.
Mitigation #
To mitigate high OSD latency, follow these steps:
- Check for any recent changes to the Ceph cluster’s configuration or deployment that may be contributing to high latency.
- Increase the OSD’s resources (e.g., CPU, memory, disk space) if necessary to improve performance.
- Implement load balancing or distribute writes across multiple OSDs to reduce the load on individual OSDs.
- Adjust the Ceph cluster’s configuration to optimize performance, such as adjusting the
osd_target_transaction_size
orosd_client_message_size
settings. - Consider upgrading the Ceph cluster to a newer version, which may include performance improvements or bug fixes related to OSD latency.