CassandraNodeIsUnavailable

CassandraNodeIsUnavailable #

Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}

Alert Rule
alert: CassandraNodeIsUnavailable
annotations:
  description: |-
    Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/instaclustr-cassandra-exporter/cassandranodeisunavailable/
  summary: Cassandra Node is unavailable (instance {{ $labels.instance }})
expr: sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint)
  < 1
for: 0m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule “CassandraNodeIsUnavailable”:

Meaning #

The CassandraNodeIsUnavailable alert is triggered when a Cassandra node in a cluster is unavailable, indicating a potential issue with data storage and retrieval. This alert is critical as it can impact the overall performance and availability of the system.

Impact #

The unavailability of a Cassandra node can lead to:

  • Data loss or inconsistencies
  • Increased latency or timeouts for read and write operations
  • Reduced system performance and availability
  • Potential data corruption or inconsistencies
  • Impact on business operations and revenue

Diagnosis #

To diagnose the issue, follow these steps:

  1. Check the Cassandra node’s status using the Cassandra CLI or a monitoring tool like Prometheus.
  2. Verify that the node is not responding to requests or is showing high latency.
  3. Review the Cassandra logs to identify any error messages or exceptions related to the node’s unavailability.
  4. Check the system resources (CPU, memory, disk space) to ensure they are not exhausted.
  5. Verify that the node is properly configured and that there are no network connectivity issues.

Mitigation #

To mitigate the issue, follow these steps:

  1. Restart the Cassandra node: If the node is not responding, try restarting it to see if it recovers.
  2. Investigate and resolve underlying issues: Identify and resolve any underlying issues causing the node’s unavailability, such as resource exhaustion, network connectivity problems, or configuration errors.
  3. Failover to another node: If the node cannot be recovered, failover to another node in the cluster to minimize data loss and ensure system availability.
  4. Restore from backup: If data loss has occurred, restore from a backup to ensure data integrity and consistency.
  5. Perform a rolling restart: Perform a rolling restart of the Cassandra cluster to ensure all nodes are updated and healthy.

Remember to follow your organization’s specific procedures and guidelines for incident response and resolution.