KafkaConsumersGroup #
Kafka consumers group
Alert Rule
alert: KafkaConsumersGroup
annotations:
description: |-
Kafka consumers group
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/danielqsj-kafka-exporter/kafkaconsumersgroup/
summary: Kafka consumers group (instance {{ $labels.instance }})
expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50
for: 1m
labels:
severity: critical
Here is a runbook for the Prometheus alert rule:
Meaning #
This alert is triggered when the lag of a Kafka consumers group exceeds 50 messages for more than 1 minute. The lag is calculated as the sum of the kafka_consumergroup_lag metric by consumergroup. This indicates that the consumers in the group are not keeping up with the producers, which can lead to message loss or latency.
Impact #
If left unchecked, this issue can lead to:
- Message loss or duplication
- Increased latency in processing messages
- Unprocessable messages accumulating in the Kafka topic
- Downstream systems experiencing errors or delays
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Kafka consumers group metrics to identify the specific consumergroup experiencing the lag.
- Investigate the consumer instance(s) in the group to determine if they are experiencing high CPU usage, memory issues, or network connectivity problems.
- Verify that the consumer configuration is correct, including topics, partitions, and offset management.
- Review the Kafka broker logs for any errors or issues that may be contributing to the lag.
Mitigation #
To mitigate the issue, follow these steps:
- Check and adjust the consumer instance resources (e.g., increase CPU or memory) to ensure they can keep up with the message volume.
- Verify that the consumer configuration is optimal, including tuning parameters such as batch size, fetch size, and maxlag.
- Implement backpressure mechanisms, such as pause and resume, to control the message flow and prevent overload.
- Consider rebalancing the Kafka topic partitions to distribute the load more evenly across consumers.
- If necessary, adjust the alert threshold or time window to better suit the specific use case and environment.
Remember to monitor the situation and adjust the mitigation strategies as needed to prevent further occurrences of this alert.