ConsulServiceHealthcheckFailed #
Service: {{ $labels.service_name }}
Healthcheck: {{ $labels.service_id }}
Alert Rule
alert: ConsulServiceHealthcheckFailed
annotations:
description: |-
Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/consul-exporter/consulservicehealthcheckfailed/
summary: Consul service healthcheck failed (instance {{ $labels.instance }})
expr: consul_catalog_service_node_healthy == 0
for: 1m
labels:
severity: critical
Here is a runbook for the ConsulServiceHealthcheckFailed alert:
Meaning #
The ConsulServiceHealthcheckFailed alert indicates that a Consul service health check has failed. This means that the service is not responding to health checks, which can indicate a problem with the service or the underlying infrastructure.
Impact #
The impact of this alert can be severe, as it may indicate that a critical service is not functioning properly. This can lead to issues with application availability, data loss, or other downstream effects. The failed health check may also indicate a larger issue with the Consul cluster or the underlying infrastructure.
Diagnosis #
To diagnose the root cause of the ConsulServiceHealthcheckFailed alert, follow these steps:
- Check the Consul UI to verify the status of the service and the health check.
- Check the service logs to determine if there are any error messages or exceptions that may indicate the cause of the failure.
- Verify that the service is properly configured and that the health check is correctly set up.
- Check the underlying infrastructure, such as the network, CPU, and memory usage, to ensure that it is functioning properly.
- Review the Consul cluster status to ensure that it is healthy and functioning as expected.
Mitigation #
To mitigate the ConsulServiceHealthcheckFailed alert, follow these steps:
- Restart the service to see if it self-heals.
- Check the service configuration and verify that it is correct.
- Update the service or Consul configuration to fix any issues identified during diagnosis.
- Check the underlying infrastructure and perform any necessary maintenance or repairs.
- Consider increasing the monitoring and logging for the service to provide more visibility into its operation.
- If the issue persists, consider escalating to a higher-level support team or seeking guidance from a subject matter expert.
By following these steps, you should be able to diagnose and mitigate the root cause of the ConsulServiceHealthcheckFailed alert, and restore the service to a healthy state.