PulsarTopicVeryLargeBacklogStorageSize #
The topic backlog storage size is over 20 GB
Alert Rule
alert: PulsarTopicVeryLargeBacklogStorageSize
annotations:
description: |-
The topic backlog storage size is over 20 GB
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/pulsar-internal/pulsartopicverylargebacklogstoragesize/
summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance
}})
expr: sum(pulsar_storage_size > 20*1024*1024*1024) by (topic)
for: 1h
labels:
severity: critical
Here is a runbook for the Prometheus alert rule PulsarTopicVeryLargeBacklogStorageSize
:
Meaning #
The PulsarTopicVeryLargeBacklogStorageSize
alert is triggered when the storage size of a Pulsar topic backlog exceeds 20 GB. This can indicate a potential issue with message processing or retention in the Pulsar cluster.
Impact #
The impact of this alert is high, as a large backlog storage size can lead to:
- Increased storage costs
- Performance degradation of the Pulsar cluster
- Potential message loss or duplication
- Increased latency for message processing
Diagnosis #
To diagnose the root cause of this alert, follow these steps:
- Identify the affected topic(s) by checking the
topic
label in the alert. - Check the Pulsar cluster’s storage usage and availability.
- Investigate the message processing rate and latency for the affected topic(s).
- Verify that there are no issues with the message producer or consumer applications.
- Check the Pulsar cluster’s configuration and retention settings.
Mitigation #
To mitigate this alert, follow these steps:
- Identify and resolve any issues with message processing or retention in the Pulsar cluster.
- Consider increasing the storage capacity of the Pulsar cluster.
- Adjust the retention settings or message TTL (time to live) for the affected topic(s).
- Implement a more efficient message processing strategy, such as using a larger consumer group or increasing the consumer instances.
- Monitor the Pulsar cluster’s storage usage and message processing rate to ensure the issue does not recur.