PulsarHighWriteLatency #

Messages cannot be written in a timely fashion

Alert Rule

alert: PulsarHighWriteLatency
annotations:
  description: |-
    Messages cannot be written in a timely fashion
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/pulsar-internal/pulsarhighwritelatency/
  summary: Pulsar high write latency (instance {{ $labels.instance }})
expr: sum(pulsar_storage_write_latency_overflow &gt; 0) by (topic)
for: 1h
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule PulsarHighWriteLatency:

Meaning #

The PulsarHighWriteLatency alert is triggered when there is an excessive number of write latency overflows in Pulsar, indicating that messages are not being written in a timely fashion. This can lead to performance issues, data loss, or even complete system unavailability.

Impact #

The impact of this alert can be severe, as it affects the ability to write data to Pulsar. This can lead to:

Performance degradation: High write latency can cause slower data processing, leading to delays in downstream applications.
Data loss: Unwritten messages may be lost, resulting in incomplete or inaccurate data.
System unavailability: In extreme cases, high write latency can cause Pulsar to become unavailable, leading to complete system downtime.

Diagnosis #

To diagnose the root cause of this alert, follow these steps:

Check the Pulsar cluster configuration to ensure that it is properly sized and configured for the current workload.
Investigate disk usage and I/O performance to identify any bottlenecks or issues with storage.
Review Pulsar broker logs for errors or exceptions related to write operations.
Verify that there are no network connectivity issues between Pulsar brokers and bookies.
Check the topic configuration to ensure that the write latency threshold is set correctly.

Mitigation #

To mitigate this alert, follow these steps:

Scale up the Pulsar cluster to increase capacity and reduce write latency.
Optimize disk usage and I/O performance by increasing storage capacity, improving disk configuration, or using faster storage.
Implement load balancing to distribute write traffic across multiple brokers.
Implement queuing or buffering mechanisms to handle message bursts and reduce write latency.
Consider upgrading Pulsar version to take advantage of performance improvements and bug fixes.

Remember to investigate and address the root cause of the issue, rather than just treating the symptoms. Once the issue is resolved, the alert should clear automatically.