StoreConnectionIsTooSlow

StoreConnectionIsTooSlow #

Store connection is too slow to {{$labels.pool}} pool, {{$labels.shard}} shard in Graph node {{$labels.instance}}

Alert Rule
alert: StoreConnectionIsTooSlow
annotations:
  description: |-
    Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/graph-node-internal/storeconnectionistooslow/
  summary: Store connection is too slow (instance {{ $labels.instance }})
expr: store_connection_wait_time_ms > 10
for: 0m
labels:
  severity: warning
alert: StoreConnectionIsTooSlow
annotations:
  description: |-
    Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/graph-node-internal/storeconnectionistooslow/
  summary: Store connection is too slow (instance {{ $labels.instance }})
expr: store_connection_wait_time_ms > 20
for: 0m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule “StoreConnectionIsTooSlow”:

Meaning #

The “StoreConnectionIsTooSlow” alert is triggered when the wait time for the store connection exceeds 10 milliseconds. This indicates that the connection to the store is slow, which can impact the performance of the Graph node.

Impact #

A slow store connection can cause:

  • Delays in data processing and querying
  • Increased latency in the Graph node
  • Potential data loss or inconsistencies
  • Impact on overall system performance and availability

Diagnosis #

To diagnose the issue, follow these steps:

  1. Check the instance and pool/shard details in the alert labels to identify the affected component.
  2. Verify the current store connection wait time using the store_connection_wait_time_ms metric.
  3. Investigate recent changes in the store configuration, network connectivity, or resource utilization that may be contributing to the slow connection.
  4. Review system logs for errors or warnings related to the store connection.

Mitigation #

To mitigate the issue, follow these steps:

  1. Check the store connection configuration and optimize it if necessary.
  2. Verify that the network connectivity between the Graph node and the store is stable and not congested.
  3. Consider increasing resources (e.g., CPU, memory) for the Graph node or store to improve performance.
  4. If the issue persists, consider restarting the affected Graph node or store component.
  5. If none of the above steps resolve the issue, escalate to a senior engineer or expert for further assistance.