CassandraNodeIsUnavailable #
Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}
Alert Rule
alert: CassandraNodeIsUnavailable
annotations:
description: |-
Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/instaclustr-cassandra-exporter/cassandranodeisunavailable/
summary: Cassandra Node is unavailable (instance {{ $labels.instance }})
expr: sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint)
< 1
for: 0m
labels:
severity: critical
Here is a runbook for the Prometheus alert rule “CassandraNodeIsUnavailable”:
Meaning #
The CassandraNodeIsUnavailable alert is triggered when a Cassandra node in a cluster is unavailable, indicating a potential issue with data storage and retrieval. This alert is critical as it can impact the overall performance and availability of the system.
Impact #
The unavailability of a Cassandra node can lead to:
- Data loss or inconsistencies
- Increased latency or timeouts for read and write operations
- Reduced system performance and availability
- Potential data corruption or inconsistencies
- Impact on business operations and revenue
Diagnosis #
To diagnose the issue, follow these steps:
- Check the Cassandra node’s status using the Cassandra CLI or a monitoring tool like Prometheus.
- Verify that the node is not responding to requests or is showing high latency.
- Review the Cassandra logs to identify any error messages or exceptions related to the node’s unavailability.
- Check the system resources (CPU, memory, disk space) to ensure they are not exhausted.
- Verify that the node is properly configured and that there are no network connectivity issues.
Mitigation #
To mitigate the issue, follow these steps:
- Restart the Cassandra node: If the node is not responding, try restarting it to see if it recovers.
- Investigate and resolve underlying issues: Identify and resolve any underlying issues causing the node’s unavailability, such as resource exhaustion, network connectivity problems, or configuration errors.
- Failover to another node: If the node cannot be recovered, failover to another node in the cluster to minimize data loss and ensure system availability.
- Restore from backup: If data loss has occurred, restore from a backup to ensure data integrity and consistency.
- Perform a rolling restart: Perform a rolling restart of the Cassandra cluster to ensure all nodes are updated and healthy.
Remember to follow your organization’s specific procedures and guidelines for incident response and resolution.