NatsHighNumberOfConnections #
NATS server has more than 1000 active connections
Alert Rule
alert: NatsHighNumberOfConnections
annotations:
description: |-
NATS server has more than 1000 active connections
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/nats-exporter/natshighnumberofconnections/
summary: Nats high number of connections (instance {{ $labels.instance }})
expr: gnatsd_connz_num_connections > 1000
for: 5m
labels:
severity: warning
Meaning #
The NatsHighNumberOfConnections alert is triggered when the number of active connections to a NATS server exceeds 1000. This alert indicates that the NATS server is experiencing a high load, which can lead to performance issues, latency, and even crashes.
Impact #
- Performance degradation: High number of connections can cause the NATS server to slow down, leading to increased latency and decreased throughput.
- Resource exhaustion: A large number of connections can consume system resources such as memory, CPU, and network bandwidth, leading to resource exhaustion.
- Increased error rates: High connection counts can lead to increased error rates, including connection timeouts, failures, and message losses.
- Potential crashes: If left unchecked, a high number of connections can cause the NATS server to crash, leading to service disruptions and downtime.
Diagnosis #
To diagnose the root cause of the high number of connections, follow these steps:
- Check the NATS server logs for any errors or warnings related to connection handling.
- Verify that the NATS server is properly configured to handle a high volume of connections.
- Investigate recent changes to the system, such as new applications or services that may be contributing to the increased connection count.
- Use Prometheus metrics to analyze the connection pattern and identify any anomalies.
- Verify that the NATS server has sufficient system resources (CPU, memory, and network bandwidth) to handle the increased load.
Mitigation #
To mitigate the high number of connections, follow these steps:
- Increase the NATS server’s system resources (CPU, memory, and network bandwidth) to handle the increased load.
- Implement connection limits or throttling to prevent excessive connections.
- Optimize NATS server configuration to improve performance and reduce latency.
- Identify and troubleshoot any application or service that is causing the high connection count.
- Consider load balancing or clustering NATS servers to distribute the load and improve overall system resilience.
- Monitor the NATS server’s performance and connection count regularly to prevent similar issues in the future.