NatsHighPendingBytes #

High number of NATS pending bytes

Alert Rule

alert: NatsHighPendingBytes
annotations:
  description: |-
    High number of NATS pending bytes ({{ $value }}) for {{ $labels.instance }}
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/nats-exporter/natshighpendingbytes/
  summary: Nats high pending bytes (instance {{ $labels.instance }})
expr: gnatsd_connz_pending_bytes &gt; 100000
for: 3m
labels:
  severity: warning

alert: NatsHighPendingBytes
annotations:
  description: |-
    NATS server has more than 100,000 pending bytes
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/nats-exporter/natshighpendingbytes/
  summary: Nats high pending bytes (instance {{ $labels.instance }})
expr: gnatsd_connz_pending_bytes &gt; 100000
for: 5m
labels:
  severity: warning

Meaning #

This alert triggers when the number of pending bytes for NATS connections exceeds 100,000 bytes for a duration of 3 minutes. Pending bytes represent messages waiting to be delivered to clients, indicating potential performance issues or bottlenecks in message processing.

Impact #

High pending bytes may cause delays in message delivery, impacting downstream systems or applications relying on NATS for real-time data.
Prolonged high pending bytes can lead to resource exhaustion in the NATS server, including increased memory usage and potential message loss.
It may indicate a slow or unresponsive client, insufficient server resources, or network congestion.

Diagnosis #

Check the Alert Details:
- Review the alert annotations to identify the affected instance and the current value of pending bytes.
- Example:
  - Instance: {{ $labels.instance }}
  - Pending Bytes: {{ $value }}
Inspect Client Connections:
- Access the NATS management interface or use monitoring tools to view detailed connection statistics.
- Identify clients with high pending bytes and verify if they are slow or unresponsive.
Review Resource Utilization:
- Check the NATS server’s CPU, memory, and network usage to ensure sufficient resources are available.
Check Application Logs:
- Look for errors or warnings in the logs of applications connected to NATS.
- Identify any issues with message processing or acknowledgment.
Network Analysis:
- Analyze network performance between the NATS server and clients to identify potential bottlenecks.

Mitigation #

Address Client Issues:
- Restart or fix any unresponsive or slow clients causing the high pending bytes.
- Optimize client applications to process messages more efficiently.
Scale NATS Infrastructure:
- Add more NATS server instances or increase resources (CPU, memory) for existing servers.
- Consider implementing clustering for better load distribution.
Optimize Message Flow:
- Reduce the volume of messages published to NATS if unnecessary or redundant data is being sent.
- Adjust NATS server and client configuration settings, such as buffer sizes, to better handle the message load.
Network Improvements:
- Investigate and resolve any network-related issues, such as high latency or packet loss, between the NATS server and clients.
Clear Message Backlog:
- If safe, purge undelivered messages for non-critical topics or queues to reduce pending bytes.

Preventative Measures #

Implement proactive monitoring for NATS metrics such as pending bytes, message rates, and resource utilization.
Establish alert thresholds and response plans tailored to your system’s load patterns.
Regularly test and validate client and server configurations for optimal performance.
Ensure adequate resource allocation for NATS servers and clients based on anticipated workloads.