EtcdHighFsyncDurations #
Etcd WAL fsync duration increasing, 99th percentile is over 0.5s
Alert Rule
alert: EtcdHighFsyncDurations
annotations:
description: |-
Etcd WAL fsync duration increasing, 99th percentile is over 0.5s
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/etcd-internal/etcdhighfsyncdurations/
summary: Etcd high fsync durations (instance {{ $labels.instance }})
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m]))
> 0.5
for: 2m
labels:
severity: warning
Here is a sample runbook for the EtcdHighFsyncDurations alert:
Meaning #
The EtcdHighFsyncDurations alert is triggered when the 99th percentile of etcd’s WAL fsync duration exceeds 0.5 seconds over a 1-minute period. This indicates that etcd is experiencing high latency when writing to disk, which can impact the overall performance and reliability of the system.
Impact #
The impact of high fsync durations can be significant, leading to:
- Slow etcd write performance
- Increased latency for etcd operations
- Potential for etcd to fall behind, leading to inconsistencies and errors
- Increased risk of data loss or corruption in the event of a failure
Diagnosis #
To diagnose the root cause of high fsync durations, follow these steps:
- Check etcd disk usage and available space to ensure that the disk is not full or nearly full.
- Verify that the disk is properly configured and healthy (e.g., check for disk errors, firmware issues, etc.).
- Investigate system resource utilization (CPU, memory, I/O) to identify potential bottlenecks.
- Check etcd logs for errors or warnings related to disk I/O or fsync operations.
- Verify that etcd is properly configured and optimized for the underlying storage system.
Mitigation #
To mitigate the effects of high fsync durations, follow these steps:
- Immediately investigate and address any underlying disk or system resource issues.
- Consider increasing etcd’s disk write buffer size to reduce the frequency of fsync operations.
- Optimize etcd’s configuration for the underlying storage system (e.g., adjust sync interval, etc.).
- Consider migrating etcd to a faster storage system or optimizing the existing storage configuration.
- Implement additional monitoring and alerting to detect potential issues before they become critical.