nats-exporter

Below is a structured runbook for the alerts defined in the provided Prometheus alert rules file. Each alert includes sections for meaning, impact, diagnosis, and mitigation.


Runbook for NATS Alerts #

Below is the updated runbook with links to the relevant runbook pages for each alert:


1. NatsHighConnectionCount #

  • Meaning: The number of NATS connections exceeds 100 for more than 3 minutes.
  • Impact: Could indicate resource exhaustion or potential misuse of the system.
  • Diagnosis:
    • Check the number of connections using gnatsd_varz_connections.
    • Identify which clients are connected and their connection rates.
  • Mitigation:
    • Ensure legitimate usage patterns.
    • Increase server capacity or adjust connection limits if required.
    • Investigate and terminate any unauthorized connections.
  • Runbook Link: NatsHighConnectionCount

2. NatsHighPendingBytes #

  • Meaning: Pending bytes in NATS connections exceed 100,000 for more than 3 minutes.
  • Impact: Potential delays in message delivery, leading to application performance issues.
  • Diagnosis:
    • Inspect gnatsd_connz_pending_bytes for the affected instance.
    • Identify publishers with high message rates or slow consumers.
  • Mitigation:
    • Tune client configurations to optimize publishing and consumption rates.
    • Scale consumers or increase processing capacity.
  • Runbook Link: NatsHighPendingBytes

3. NatsHighSubscriptionsCount #

  • Meaning: The number of subscriptions exceeds 50 for more than 3 minutes.
  • Impact: May indicate suboptimal client behavior or over-subscription to channels.
  • Diagnosis:
    • Examine gnatsd_connz_subscriptions for the instance.
    • Check for duplicate or unnecessary subscriptions.
  • Mitigation:
    • Optimize application subscription logic.
    • Reduce redundant or overlapping subscriptions.
  • Runbook Link: NatsHighSubscriptionsCount

4. NatsHighRoutesCount #

  • Meaning: The number of routes in the cluster exceeds 10 for more than 3 minutes.
  • Impact: Could lead to increased resource usage and complexity in routing messages.
  • Diagnosis:
    • Check gnatsd_varz_routes for active routes.
    • Review the cluster topology for misconfigurations or unexpected peers.
  • Mitigation:
    • Simplify the cluster design by reducing unnecessary routes.
    • Review and correct any misconfigured peer connections.
  • Runbook Link: NatsHighRoutesCount

5. NatsHighMemoryUsage #

  • Meaning: Memory usage of the NATS server exceeds 200 MB for 5 minutes.
  • Impact: May result in degraded server performance or crashes.
  • Diagnosis:
    • Analyze memory usage with gnatsd_varz_mem.
    • Check for memory-intensive workloads or leaks.
  • Mitigation:
    • Optimize message size and retention policies.
    • Restart the server if necessary to clear memory leaks.
  • Runbook Link: NatsHighMemoryUsage

6. NatsSlowConsumers #

  • Meaning: Slow consumers are detected for more than 3 minutes.
  • Impact: May delay or drop messages, affecting downstream applications.
  • Diagnosis:
    • Identify slow consumers using gnatsd_varz_slow_consumers.
    • Investigate network or processing bottlenecks.
  • Mitigation:
    • Increase consumer processing capacity.
    • Optimize message handling to reduce delays.
  • Runbook Link: NatsSlowConsumers

7. NatsHighCpuUsage #

  • Meaning: CPU usage exceeds 80% for 5 minutes.
  • Impact: Could lead to degraded performance or timeouts.
  • Diagnosis:
    • Monitor rate(gnatsd_varz_cpu[5m]) for sustained high usage.
    • Identify and optimize CPU-intensive workloads.
  • Mitigation:
    • Distribute workloads across multiple nodes.
    • Upgrade server hardware if necessary.

8. NatsHighJetstreamStoreUsage #

  • Meaning: JetStream store usage exceeds 80% capacity for 5 minutes.
  • Impact: Risk of message loss due to storage exhaustion.
  • Diagnosis:
    • Review gnatsd_varz_jetstream_stats_storage and gnatsd_varz_jetstream_config_max_storage.
    • Check for high storage usage patterns.
  • Mitigation:
    • Increase storage capacity.
    • Implement retention policies to manage usage.

9. NatsHighJetstreamMemoryUsage #

  • Meaning: JetStream memory usage exceeds 80% of the configured limit for 5 minutes.
  • Impact: May result in message processing slowdowns or failures.
  • Diagnosis:
    • Check gnatsd_varz_jetstream_stats_memory and gnatsd_varz_jetstream_config_max_memory.
    • Look for memory spikes due to high message throughput or retention policies.
  • Mitigation:
    • Increase JetStream memory allocation.
    • Optimize message sizes and retention policies.

10. NatsHighNumberOfSubscriptions #

  • Meaning: The number of subscriptions exceeds 1,000 for more than 5 minutes.
  • Impact: May lead to increased resource usage and delays in message delivery.
  • Diagnosis:
    • Monitor gnatsd_connz_subscriptions for the instance.
    • Check if any clients are creating excessive subscriptions.
  • Mitigation:
    • Limit the number of subscriptions per client.
    • Optimize subscription patterns to prevent duplication.

11. NatsTooManyErrors #

  • Meaning: API errors in JetStream increase for more than 5 minutes.
  • Impact: Indicates potential instability or misconfiguration in JetStream.
  • Diagnosis:
    • Inspect increase(gnatsd_varz_jetstream_stats_api_errors[5m]).
    • Review logs for error details and patterns.
  • Mitigation:
    • Address configuration issues.
    • Resolve application-level errors causing API failures.

12. NatsJetstreamConsumersExceeded #

  • Meaning: The number of JetStream consumers exceeds 100 for more than 5 minutes.
  • Impact: Could lead to excessive resource usage or message processing bottlenecks.
  • Diagnosis:
    • Examine sum(gnatsd_varz_jetstream_stats_accounts) for consumer counts.
    • Identify which consumers are contributing to the high count.
  • Mitigation:
    • Optimize consumer creation logic in applications.
    • Distribute workload among fewer, more efficient consumers.

13. NatsFrequentAuthenticationTimeouts #

  • Meaning: More than 5 authentication timeouts occur within 5 minutes.
  • Impact: Indicates issues with authentication, potentially blocking connections.
  • Diagnosis:
    • Review increase(gnatsd_varz_auth_timeout[5m]).
    • Check authentication server or configuration logs for anomalies.
  • Mitigation:
    • Address authentication server issues.
    • Adjust authentication timeout settings.

14. NatsMaxPayloadSizeExceeded #

  • Meaning: Payload size exceeds the configured maximum of 1MB for 5 minutes.
  • Impact: May lead to message rejection or delivery failures.
  • Diagnosis:
    • Monitor max(gnatsd_varz_max_payload).
    • Identify clients or applications sending oversized messages.
  • Mitigation:
    • Reconfigure clients to respect payload size limits.
    • Increase maximum payload size if feasible.

15. NatsLeafNodeConnectionIssue #

  • Meaning: No leaf node connections have been established within 5 minutes.
  • Impact: Indicates potential issues with leaf node connectivity, leading to message routing problems.
  • Diagnosis:
    • Inspect increase(gnatsd_varz_leafnodes[5m]).
    • Verify network connectivity and leaf node configurations.
  • Mitigation:
    • Resolve network or configuration issues preventing leaf node connections.
    • Restart affected nodes if necessary.

16. NatsMaxPingOperationsExceeded #

  • Meaning: Ping operations exceed 50 for more than 5 minutes.
  • Impact: Indicates potential instability in connection health checks.
  • Diagnosis:
    • Review gnatsd_varz_ping_max for the instance.
    • Check logs for ping operation patterns.
  • Mitigation:
    • Optimize client configurations to reduce ping frequency.
    • Ensure stable network conditions.

17. NatsWriteDeadlineExceeded #

  • Meaning: Write deadlines are exceeded, indicating potential message delivery issues.
  • Impact: Could lead to dropped or delayed messages.
  • Diagnosis:
    • Monitor gnatsd_varz_write_deadline.
    • Investigate client write speeds and network conditions.
  • Mitigation:
    • Increase write deadline limits.
    • Optimize client-side configurations and network conditions.