VaultClusterHealth #
Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf “%.2f”}}%
Alert Rule
alert: VaultClusterHealth
annotations:
description: |-
Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf "%.2f"}}%
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/hashicorp-vault-internal/vaultclusterhealth/
summary: Vault cluster health (instance {{ $labels.instance }})
expr: sum(vault_core_active) / count(vault_core_active) <= 0.5
for: 0m
labels:
severity: critical
Here is a runbook for the VaultClusterHealth alert rule:
Meaning #
The Vault cluster health alert indicates that the Vault cluster is not in a healthy state. This is determined by checking the ratio of active Vault cores to the total number of Vault cores. If the ratio falls below 50%, the alert is triggered.
Impact #
A unhealthy Vault cluster can have significant implications on the overall security and reliability of the system. It can lead to:
- Unavailability of secrets and sensitive data
- Increased risk of data breaches
- Disruption to dependent applications and services
- Potential loss of business critical data
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Vault cluster status using the Vault CLI or Web UI
- Verify that the Vault cores are properly configured and running
- Check the system logs for any errors or warnings related to Vault
- Investigate any recent changes or deployments that may have caused the issue
- Review the Vault cluster topology and ensure it is correctly configured
Mitigation #
To mitigate the issue, follow these steps:
- Identify the root cause of the unhealthy cluster state
- Take corrective action to restore the cluster to a healthy state
- This may involve restarting Vault cores, fixing configuration issues, or rolling back recent changes
- Verify that the cluster is healthy by checking the status and monitoring metrics
- Perform a thorough review of the Vault cluster configuration and topology to prevent similar issues in the future
- Consider implementing additional monitoring and alerting to detect potential issues earlier