NetdataHighCpuUsage #

Netdata high CPU usage (> 80%)

Alert Rule

alert: NetdataHighCpuUsage
annotations:
  description: |-
    Netdata high CPU usage (&gt; 80%)
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/netdata-internal/netdatahighcpuusage/
  summary: Netdata high cpu usage (instance {{ $labels.instance }})
expr: rate(netdata_cpu_cpu_percentage_average{dimension=&#34;idle&#34;}[1m]) &gt; 80
for: 5m
labels:
  severity: warning

Here is a runbook for the Prometheus alert rule NetdataHighCpuUsage:

Meaning #

The NetdataHighCpuUsage alert is triggered when the average CPU usage of the Netdata instance exceeds 80% over a 1-minute period. This alert indicates that the Netdata instance is experiencing high CPU utilization, which may impact its performance and ability to collect and process metrics.

Impact #

High CPU usage on the Netdata instance can lead to:

Delays or loss of metric data
Increased latency in data processing and visualization
Potential crashes or instability of the Netdata instance
Impaired ability to monitor and troubleshoot system performance

Diagnosis #

To diagnose the root cause of high CPU usage on the Netdata instance:

Check the Netdata instance logs for errors or warnings related to CPU usage.
Verify that the system resources (CPU, memory, etc.) are sufficient to support the Netdata instance.
Investigate any recent changes to the system configuration, plugins, or metric collection settings that may be contributing to the high CPU usage.
Use Netdata’s built-in dashboards and charts to identify any specific metrics or plugins that are consuming excessive CPU resources.

Mitigation #

To mitigate high CPU usage on the Netdata instance:

Optimize system resources: Ensure that the system has sufficient CPU and memory resources to support the Netdata instance.
Adjust plugin settings: Review and adjust plugin settings to reduce CPU usage, such as reducing the frequency of metric collection or disabling unnecessary plugins.
Implement caching: Consider implementing caching mechanisms to reduce the load on the Netdata instance.
Upgrade Netdata: Ensure that the Netdata instance is running the latest version, as newer versions may include performance optimizations.
Consider load balancing: If the Netdata instance is handling a high volume of metrics, consider load balancing the traffic across multiple instances to reduce the CPU load on individual instances.