PulsarHighNumberOfFunctionErrors #
Observing more than 10 Function errors per minute
Alert Rule
alert: PulsarHighNumberOfFunctionErrors
annotations:
description: |-
Observing more than 10 Function errors per minute
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/pulsar-internal/pulsarhighnumberoffunctionerrors/
summary: Pulsar high number of function errors (instance {{ $labels.instance }})
expr: sum((rate(pulsar_function_user_exceptions_total{}[1m]) + rate(pulsar_function_system_exceptions_total{}[1m]))
> 10) by (name)
for: 1m
labels:
severity: critical
Here is a sample runbook for the Prometheus alert rule:
Meaning #
The PulsarHighNumberOfFunctionErrors
alert is triggered when the rate of function errors in Pulsar exceeds 10 errors per minute. This indicates a significant issue with the Pulsar functions, potentially causing failures, data loss, or performance degradation.
Impact #
- Function execution failures may lead to data loss or inconsistencies.
- Increased error rates can cause performance degradation, leading to slower processing times or even complete system halts.
- High error rates can also indicate potential security vulnerabilities or configuration issues.
Diagnosis #
To diagnose the issue:
- Check the Pulsar function logs for error messages and exceptions to identify the root cause of the errors.
- Verify the function configuration and ensure that it is correct and up-to-date.
- Check the system resources (e.g., CPU, memory, and disk space) to ensure they are not overwhelmed.
- Review the Pulsar cluster’s overall health and performance metrics to identify any underlying issues.
Mitigation #
To mitigate the issue:
- Immediate Action: Pause or roll back the affected Pulsar functions to prevent further errors and data loss.
- Short-term Fix: Investigate and resolve the root cause of the errors, which may involve:
- Updating function code or configuration.
- Adjusting system resources or scaling the Pulsar cluster.
- Implementing retry mechanisms or circuit breakers to handle transient errors.
- Long-term Solution: Implement proactive measures to prevent similar issues in the future, such as:
- Enhancing function monitoring and logging.
- Implementing automated testing and validation for functions.
- Conducting regular Pulsar cluster maintenance and upgrades.
- Post-Incident Activities:
- Perform a thorough post-mortem analysis to identify areas for improvement.
- Update the runbook to reflect new knowledge and best practices.
- Schedule a review of the incident with the teams involved to discuss lessons learned and prevent similar incidents in the future.