HostCpuStealNoisyNeighbor #

CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.

Alert Rule

alert: HostCpuStealNoisyNeighbor
annotations:
  description: |-
    CPU steal is &gt; 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/netdata-internal/hostcpustealnoisyneighbor/
  summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
expr: rate(netdata_cpu_cpu_percentage_average{dimension=&#34;steal&#34;}[1m]) &gt; 10
for: 5m
labels:
  severity: warning

Meaning #

The HostCpuStealNoisyNeighbor alert indicates that the average CPU steal percentage on a host has exceeded 10% over the past 1 minute, and this condition has persisted for at least 5 minutes. CPU steal occurs when a virtual machine (VM) is waiting for the hypervisor to allocate CPU resources, which can lead to performance issues.

Impact #

This alert can have a significant impact on the performance of applications running on the affected host. Noisy neighbors, such as other VMs on the same host, can consume excessive CPU resources, leading to:

Increased latency
Decreased throughput
Unresponsive applications
Potential crashes or failures of critical services

Diagnosis #

To diagnose the root cause of this issue, follow these steps:

Check the Netdata dashboard for the affected host to identify the current CPU steal percentage and trend.
Investigate the instance labels $labels to determine the specific VM or instance affected.
Review system logs and monitoring data to identify any other performance issues or anomalies on the host.
Check the hypervisor logs to identify if there are any issues with resource allocation or scheduling.

Mitigation #

To mitigate the impact of this issue, follow these steps:

Isolate the noisy neighbor: Identify the specific VM or instance causing the CPU steal and consider migrating it to a different host or adjusting its resource allocation.
Adjust instance sizing: Review the instance sizes and resource allocations to ensure they are adequate for the workload.
Optimize hypervisor settings: Check the hypervisor settings to ensure they are optimized for the current workload and resource utilization.
Monitor and adjust: Continuously monitor the CPU steal percentage and adjust the instance sizing, resource allocation, and hypervisor settings as needed to maintain optimal performance.

Remember to consult the Netdata documentation for more detailed guidance on diagnosing and mitigating this issue.