VaultClusterHealth #

Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf “%.2f”}}%

Alert Rule

alert: VaultClusterHealth
annotations:
  description: |-
    Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf &#34;%.2f&#34;}}%
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/hashicorp-vault-internal/vaultclusterhealth/
  summary: Vault cluster health (instance {{ $labels.instance }})
expr: sum(vault_core_active) / count(vault_core_active) &lt;= 0.5
for: 0m
labels:
  severity: critical

Here is a runbook for the VaultClusterHealth alert rule:

Meaning #

The Vault cluster health alert indicates that the Vault cluster is not in a healthy state. This is determined by checking the ratio of active Vault cores to the total number of Vault cores. If the ratio falls below 50%, the alert is triggered.

Impact #

A unhealthy Vault cluster can have significant implications on the overall security and reliability of the system. It can lead to:

Unavailability of secrets and sensitive data
Increased risk of data breaches
Disruption to dependent applications and services
Potential loss of business critical data

Diagnosis #

To diagnose the issue, follow these steps:

Check the Vault cluster status using the Vault CLI or Web UI
Verify that the Vault cores are properly configured and running
Check the system logs for any errors or warnings related to Vault
Investigate any recent changes or deployments that may have caused the issue
Review the Vault cluster topology and ensure it is correctly configured

Mitigation #

To mitigate the issue, follow these steps:

Identify the root cause of the unhealthy cluster state
Take corrective action to restore the cluster to a healthy state
- This may involve restarting Vault cores, fixing configuration issues, or rolling back recent changes
Verify that the cluster is healthy by checking the status and monitoring metrics
Perform a thorough review of the Vault cluster configuration and topology to prevent similar issues in the future
Consider implementing additional monitoring and alerting to detect potential issues earlier