PatroniHasNoLeader #
A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}
Alert Rule
alert: PatroniHasNoLeader
annotations:
description: |-
A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/embedded-exporter-patroni/patronihasnoleader/
summary: Patroni has no Leader (instance {{ $labels.instance }})
expr: (max by (scope) (patroni_master) < 1) and (max by (scope) (patroni_standby_leader)
< 1)
for: 0m
labels:
severity: critical
Meaning #
The PatroniHasNoLeader
alert is triggered when a Patroni cluster does not have a leader node, neither primary nor standby. This means that there is no node in the cluster that is responsible for making decisions and ensuring the cluster’s overall health. This is a critical situation that requires immediate attention to prevent data loss and ensure business continuity.
Impact #
The lack of a leader node in a Patroni cluster can have severe consequences, including:
- Data loss or inconsistencies due to the absence of a primary node
- Inability to make decisions and respond to changes in the cluster
- Increased risk of cluster instability and further failures
- Potential revenue loss and business impact due to service unavailability
Diagnosis #
To diagnose the root cause of the PatroniHasNoLeader
alert, follow these steps:
- Check the Patroni cluster status and inspect the current node roles using the
patroni_cluster
metric. - Verify that all nodes in the cluster are correctly configured and running.
- Review the system logs for any errors or warnings related to node failures or network connectivity issues.
- Check for any recent changes or maintenance activities that may have caused the leader node to fail or step down.
Mitigation #
To mitigate the PatroniHasNoLeader
alert, follow these steps:
- Immediately investigate and resolve the root cause of the leader node failure. This may involve restarting the failed node, replacing a faulty node, or resolving network connectivity issues.
- Promote a standby node to primary using the
patroni_ctl
command-line tool to ensure the cluster has a functional leader node. - Verify the cluster status and node roles using the
patroni_cluster
metric to ensure the cluster is healthy and functional. - Perform a thorough analysis of the incident to identify the root cause and implement measures to prevent similar incidents in the future.