PulsarSubscriptionVeryHighNumberOfBacklogEntries #

The number of subscription backlog entries is over 100k

Alert Rule

alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries
annotations:
  description: |-
    The number of subscription backlog entries is over 100k
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/pulsar-internal/pulsarsubscriptionveryhighnumberofbacklogentries/
  summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance
    }})
expr: sum(pulsar_subscription_back_log) by (subscription) &gt; 100000
for: 1h
labels:
  severity: critical

Here is a sample runbook for the Prometheus alert rule:

Meaning #

This alert is triggered when the total number of backlog entries for a Pulsar subscription exceeds 100,000. This indicates that the subscription is not able to keep up with the incoming messages, leading to a large backlog of unprocessed messages.

Impact #

A high backlog of unprocessed messages can cause:

Delays in processing messages, potentially leading to data loss or staleness
Increased memory usage on the Pulsar brokers, potentially leading to performance issues or even crashes
Decreased overall system performance and reliability

Diagnosis #

To diagnose the issue, follow these steps:

Check the Pulsar subscription logs to identify the root cause of the backlog buildup.
Verify that the subscription is properly configured and that the consumers are functioning correctly.
Check the message rate and size to identify if there are any unusual patterns or spikes.
Verify that the Pulsar brokers have sufficient resources (e.g., memory, CPU) to handle the message load.

Mitigation #

To mitigate the issue, follow these steps:

Increase the number of consumers for the subscription to help process the backlog.
Check for any message processing bottlenecks and optimize the processing pipeline as needed.
Consider increasing the resources (e.g., memory, CPU) available to the Pulsar brokers to handle the message load.
Implement message retention policies to prevent backlog buildup in the future.

Note: This is just a sample runbook, and you may need to customize it to fit your specific use case and environment.