NatsHighCpuUsage #
NATS server is using more than 80% CPU for the last 5 minutes
Alert Rule
alert: NatsHighCpuUsage
annotations:
description: |-
NATS server is using more than 80% CPU for the last 5 minutes
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/nats-exporter/natshighcpuusage/
summary: Nats high CPU usage (instance {{ $labels.instance }})
expr: rate(gnatsd_varz_cpu[5m]) > 0.8
for: 5m
labels:
severity: warning
Here is the runbook for the Prometheus alert rule “NatsHighCpuUsage”:
Meaning #
This alert is triggered when the NATS server’s CPU usage exceeds 80% for a period of 5 minutes. This indicates that the server is experiencing high load, which can lead to performance degradation, slower message processing, and even crashes.
Impact #
High CPU usage on the NATS server can have several impacts on the system:
- Slower message processing and increased latency
- Decreased throughput and system performance
- Increased risk of server crashes and downtime
- Potential data loss or corruption
Diagnosis #
To diagnose the root cause of high CPU usage on the NATS server, follow these steps:
- Check the NATS server logs for any errors or warnings that may indicate the cause of high CPU usage.
- Verify that the server is not experiencing any memory leaks or issues.
- Check the system metrics for any signs of resource starvation (e.g., low memory, high disk usage).
- Investigate any recent changes to the NATS configuration, code, or dependencies that may be contributing to the high CPU usage.
- Review the NATS message queue statistics to identify any bottlenecks or issues.
Mitigation #
To mitigate high CPU usage on the NATS server, follow these steps:
- Identify and resolve any underlying issues causing high CPU usage (e.g., fix bugs, optimize code).
- Scale up the NATS server instance to increase available resources (e.g., CPU, memory).
- Implement load balancing or clustering to distribute the load across multiple NATS servers.
- Optimize NATS configuration settings to reduce CPU usage (e.g., adjust batch sizes, timeouts).
- Consider upgrading the NATS server to a newer version that may have performance improvements.
Remember to investigate and resolve the underlying cause of high CPU usage to prevent recurring issues.