PulsarReadOnlyBookies #
Observing Readonly Bookies
Alert Rule
alert: PulsarReadOnlyBookies
annotations:
description: |-
Observing Readonly Bookies
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/pulsar-internal/pulsarreadonlybookies/
summary: Pulsar read only bookies (instance {{ $labels.instance }})
expr: count(bookie_SERVER_STATUS{} == 0) by (pod)
for: 5m
labels:
severity: critical
Here is a runbook for the PulsarReadOnlyBookies
alert:
Meaning #
This alert is triggered when one or more Pulsar bookies are in a read-only state. Bookies are responsible for storing and serving data in a Pulsar cluster. When a bookie is in read-only mode, it cannot accept new writes, which can lead to data loss and inconsistencies.
Impact #
The impact of this alert is critical, as it can cause:
- Data loss or inconsistencies due to the inability to write to the affected bookies
- Reduced cluster availability and performance
- Potential data corruption or inconsistencies
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Pulsar cluster logs for any errors or warnings related to the affected bookie(s)
- Verify the bookie’s configuration and disk usage to ensure that it is not running out of disk space
- Check the network connectivity between the affected bookie and other nodes in the cluster
- Verify that the bookie is not experiencing high CPU or memory usage
- Check the Pulsar cluster’s metrics to see if there are any other indicators of issues, such as high latency or error rates
Mitigation #
To mitigate the issue, follow these steps:
- Identify the affected bookie(s) and investigate the cause of the read-only state
- If the issue is caused by disk space constraints, free up disk space or add additional storage capacity
- If the issue is caused by a configuration error, correct the configuration and restart the affected bookie
- If the issue is caused by network connectivity issues, resolve the network issues and restart the affected bookie
- If the issue is caused by high resource usage, investigate and resolve the underlying cause, and restart the affected bookie
- Once the issue is resolved, verify that the bookie is no longer in read-only mode and that the cluster is functioning normally.