ClickhouseNoLiveReplicas #

There are too few live replicas available, risking data loss and service disruption.

Alert Rule

alert: ClickhouseNoLiveReplicas
annotations:
  description: |-
    There are too few live replicas available, risking data loss and service disruption.
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/clickhouse-internal/clickhousenolivereplicas/
  summary: ClickHouse No Live Replicas (instance {{ $labels.instance }})
expr: ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1
for: 0m
labels:
  severity: critical

Here is a runbook for the ClickhouseNoLiveReplicas alert rule:

Meaning #

The ClickhouseNoLiveReplicas alert is triggered when there are too few live replicas available in a ClickHouse cluster, indicating a high risk of data loss and service disruption. This alert is critical and requires immediate attention.

Impact #

The impact of this alert is significant, as it can lead to:

Data loss: With too few live replicas, ClickHouse may not be able to maintain data durability and consistency, resulting in potential data loss.
Service disruption: The lack of live replicas can cause queries to fail, leading to service disruptions and impacting the availability of critical applications.

Diagnosis #

To diagnose the issue, follow these steps:

Check the ClickHouse cluster status: Verify the current status of the ClickHouse cluster, including the number of live replicas and any error messages.
Investigate node status: Check the status of individual nodes in the cluster to identify any issues or failures that may be contributing to the lack of live replicas.
Review recent changes: Check the recent changes made to the ClickHouse configuration, node deployments, or application code to see if any changes may have caused the issue.
Check system resources: Verify that the underlying system resources, such as CPU, memory, and disk space, are sufficient to support the ClickHouse cluster.

Mitigation #

To mitigate the issue, follow these steps:

Investigate and resolve underlying issues: Address any underlying issues found during diagnosis, such as node failures or configuration errors.
Increase replica count: Temporarily increase the replica count to ensure data durability and consistency.
Implement rolling restarts: Perform rolling restarts of ClickHouse nodes to ensure that all nodes are healthy and live.
Monitor cluster status: Closely monitor the ClickHouse cluster status to ensure that the issue has been resolved and live replicas are available.

By following these steps, you should be able to diagnose and mitigate the ClickhouseNoLiveReplicas issue, restoring data durability and consistency to the ClickHouse cluster.