Below is a structured runbook for the alerts defined in the provided Prometheus alert rules file. Each alert includes sections for meaning, impact, diagnosis, and mitigation.
Runbook for NATS Alerts #
Below is the updated runbook with links to the relevant runbook pages for each alert:
1. NatsHighConnectionCount #
- Meaning: The number of NATS connections exceeds 100 for more than 3 minutes.
- Impact: Could indicate resource exhaustion or potential misuse of the system.
- Diagnosis:
- Check the number of connections using
gnatsd_varz_connections
. - Identify which clients are connected and their connection rates.
- Check the number of connections using
- Mitigation:
- Ensure legitimate usage patterns.
- Increase server capacity or adjust connection limits if required.
- Investigate and terminate any unauthorized connections.
- Runbook Link: NatsHighConnectionCount
2. NatsHighPendingBytes #
- Meaning: Pending bytes in NATS connections exceed 100,000 for more than 3 minutes.
- Impact: Potential delays in message delivery, leading to application performance issues.
- Diagnosis:
- Inspect
gnatsd_connz_pending_bytes
for the affected instance. - Identify publishers with high message rates or slow consumers.
- Inspect
- Mitigation:
- Tune client configurations to optimize publishing and consumption rates.
- Scale consumers or increase processing capacity.
- Runbook Link: NatsHighPendingBytes
3. NatsHighSubscriptionsCount #
- Meaning: The number of subscriptions exceeds 50 for more than 3 minutes.
- Impact: May indicate suboptimal client behavior or over-subscription to channels.
- Diagnosis:
- Examine
gnatsd_connz_subscriptions
for the instance. - Check for duplicate or unnecessary subscriptions.
- Examine
- Mitigation:
- Optimize application subscription logic.
- Reduce redundant or overlapping subscriptions.
- Runbook Link: NatsHighSubscriptionsCount
4. NatsHighRoutesCount #
- Meaning: The number of routes in the cluster exceeds 10 for more than 3 minutes.
- Impact: Could lead to increased resource usage and complexity in routing messages.
- Diagnosis:
- Check
gnatsd_varz_routes
for active routes. - Review the cluster topology for misconfigurations or unexpected peers.
- Check
- Mitigation:
- Simplify the cluster design by reducing unnecessary routes.
- Review and correct any misconfigured peer connections.
- Runbook Link: NatsHighRoutesCount
5. NatsHighMemoryUsage #
- Meaning: Memory usage of the NATS server exceeds 200 MB for 5 minutes.
- Impact: May result in degraded server performance or crashes.
- Diagnosis:
- Analyze memory usage with
gnatsd_varz_mem
. - Check for memory-intensive workloads or leaks.
- Analyze memory usage with
- Mitigation:
- Optimize message size and retention policies.
- Restart the server if necessary to clear memory leaks.
- Runbook Link: NatsHighMemoryUsage
6. NatsSlowConsumers #
- Meaning: Slow consumers are detected for more than 3 minutes.
- Impact: May delay or drop messages, affecting downstream applications.
- Diagnosis:
- Identify slow consumers using
gnatsd_varz_slow_consumers
. - Investigate network or processing bottlenecks.
- Identify slow consumers using
- Mitigation:
- Increase consumer processing capacity.
- Optimize message handling to reduce delays.
- Runbook Link: NatsSlowConsumers
7. NatsHighCpuUsage #
- Meaning: CPU usage exceeds 80% for 5 minutes.
- Impact: Could lead to degraded performance or timeouts.
- Diagnosis:
- Monitor
rate(gnatsd_varz_cpu[5m])
for sustained high usage. - Identify and optimize CPU-intensive workloads.
- Monitor
- Mitigation:
- Distribute workloads across multiple nodes.
- Upgrade server hardware if necessary.
8. NatsHighJetstreamStoreUsage #
- Meaning: JetStream store usage exceeds 80% capacity for 5 minutes.
- Impact: Risk of message loss due to storage exhaustion.
- Diagnosis:
- Review
gnatsd_varz_jetstream_stats_storage
andgnatsd_varz_jetstream_config_max_storage
. - Check for high storage usage patterns.
- Review
- Mitigation:
- Increase storage capacity.
- Implement retention policies to manage usage.
9. NatsHighJetstreamMemoryUsage #
- Meaning: JetStream memory usage exceeds 80% of the configured limit for 5 minutes.
- Impact: May result in message processing slowdowns or failures.
- Diagnosis:
- Check
gnatsd_varz_jetstream_stats_memory
andgnatsd_varz_jetstream_config_max_memory
. - Look for memory spikes due to high message throughput or retention policies.
- Check
- Mitigation:
- Increase JetStream memory allocation.
- Optimize message sizes and retention policies.
10. NatsHighNumberOfSubscriptions #
- Meaning: The number of subscriptions exceeds 1,000 for more than 5 minutes.
- Impact: May lead to increased resource usage and delays in message delivery.
- Diagnosis:
- Monitor
gnatsd_connz_subscriptions
for the instance. - Check if any clients are creating excessive subscriptions.
- Monitor
- Mitigation:
- Limit the number of subscriptions per client.
- Optimize subscription patterns to prevent duplication.
11. NatsTooManyErrors #
- Meaning: API errors in JetStream increase for more than 5 minutes.
- Impact: Indicates potential instability or misconfiguration in JetStream.
- Diagnosis:
- Inspect
increase(gnatsd_varz_jetstream_stats_api_errors[5m])
. - Review logs for error details and patterns.
- Inspect
- Mitigation:
- Address configuration issues.
- Resolve application-level errors causing API failures.
12. NatsJetstreamConsumersExceeded #
- Meaning: The number of JetStream consumers exceeds 100 for more than 5 minutes.
- Impact: Could lead to excessive resource usage or message processing bottlenecks.
- Diagnosis:
- Examine
sum(gnatsd_varz_jetstream_stats_accounts)
for consumer counts. - Identify which consumers are contributing to the high count.
- Examine
- Mitigation:
- Optimize consumer creation logic in applications.
- Distribute workload among fewer, more efficient consumers.
13. NatsFrequentAuthenticationTimeouts #
- Meaning: More than 5 authentication timeouts occur within 5 minutes.
- Impact: Indicates issues with authentication, potentially blocking connections.
- Diagnosis:
- Review
increase(gnatsd_varz_auth_timeout[5m])
. - Check authentication server or configuration logs for anomalies.
- Review
- Mitigation:
- Address authentication server issues.
- Adjust authentication timeout settings.
14. NatsMaxPayloadSizeExceeded #
- Meaning: Payload size exceeds the configured maximum of 1MB for 5 minutes.
- Impact: May lead to message rejection or delivery failures.
- Diagnosis:
- Monitor
max(gnatsd_varz_max_payload)
. - Identify clients or applications sending oversized messages.
- Monitor
- Mitigation:
- Reconfigure clients to respect payload size limits.
- Increase maximum payload size if feasible.
15. NatsLeafNodeConnectionIssue #
- Meaning: No leaf node connections have been established within 5 minutes.
- Impact: Indicates potential issues with leaf node connectivity, leading to message routing problems.
- Diagnosis:
- Inspect
increase(gnatsd_varz_leafnodes[5m])
. - Verify network connectivity and leaf node configurations.
- Inspect
- Mitigation:
- Resolve network or configuration issues preventing leaf node connections.
- Restart affected nodes if necessary.
16. NatsMaxPingOperationsExceeded #
- Meaning: Ping operations exceed 50 for more than 5 minutes.
- Impact: Indicates potential instability in connection health checks.
- Diagnosis:
- Review
gnatsd_varz_ping_max
for the instance. - Check logs for ping operation patterns.
- Review
- Mitigation:
- Optimize client configurations to reduce ping frequency.
- Ensure stable network conditions.
17. NatsWriteDeadlineExceeded #
- Meaning: Write deadlines are exceeded, indicating potential message delivery issues.
- Impact: Could lead to dropped or delayed messages.
- Diagnosis:
- Monitor
gnatsd_varz_write_deadline
. - Investigate client write speeds and network conditions.
- Monitor
- Mitigation:
- Increase write deadline limits.
- Optimize client-side configurations and network conditions.