ConsulServiceHealthcheckFailed #

Service: {{ $labels.service_name }} Healthcheck: {{ $labels.service_id }}

Alert Rule

alert: ConsulServiceHealthcheckFailed
annotations:
  description: |-
    Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/consul-exporter/consulservicehealthcheckfailed/
  summary: Consul service healthcheck failed (instance {{ $labels.instance }})
expr: consul_catalog_service_node_healthy == 0
for: 1m
labels:
  severity: critical

Here is a runbook for the ConsulServiceHealthcheckFailed alert:

Meaning #

The ConsulServiceHealthcheckFailed alert indicates that a Consul service health check has failed. This means that the service is not responding to health checks, which can indicate a problem with the service or the underlying infrastructure.

Impact #

The impact of this alert can be severe, as it may indicate that a critical service is not functioning properly. This can lead to issues with application availability, data loss, or other downstream effects. The failed health check may also indicate a larger issue with the Consul cluster or the underlying infrastructure.

Diagnosis #

To diagnose the root cause of the ConsulServiceHealthcheckFailed alert, follow these steps:

Check the Consul UI to verify the status of the service and the health check.
Check the service logs to determine if there are any error messages or exceptions that may indicate the cause of the failure.
Verify that the service is properly configured and that the health check is correctly set up.
Check the underlying infrastructure, such as the network, CPU, and memory usage, to ensure that it is functioning properly.
Review the Consul cluster status to ensure that it is healthy and functioning as expected.

Mitigation #

To mitigate the ConsulServiceHealthcheckFailed alert, follow these steps:

Restart the service to see if it self-heals.
Check the service configuration and verify that it is correct.
Update the service or Consul configuration to fix any issues identified during diagnosis.
Check the underlying infrastructure and perform any necessary maintenance or repairs.
Consider increasing the monitoring and logging for the service to provide more visibility into its operation.
If the issue persists, consider escalating to a higher-level support team or seeking guidance from a subject matter expert.

By following these steps, you should be able to diagnose and mitigate the root cause of the ConsulServiceHealthcheckFailed alert, and restore the service to a healthy state.