ProviderFailedBecauseNet_versionTimeout

ProviderFailedBecauseNet_versionTimeout #

net_version timeout for Provider {{$labels.provider}} in Graph node {{$labels.instance}}

Alert Rule
alert: ProviderFailedBecauseNet_versionTimeout
annotations:
  description: |-
    net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/graph-node-internal/providerfailedbecausenet_versiontimeout/
  summary: Provider failed because net_version timeout (instance {{ $labels.instance
    }})
expr: eth_rpc_status == 3
for: 0m
labels:
  severity: critical

Here is a runbook for the Prometheus alert rule:

Meaning #

The ProviderFailedBecauseNet_versionTimeout alert is triggered when the eth_rpc_status metric returns a value of 3, indicating that a provider has failed due to a net_version timeout. This alert is critical and requires immediate attention.

Impact #

The impact of this alert is that the provider is unable to function correctly, which can lead to:

  • Disruption of critical services dependent on the provider
  • Loss of data or transactions
  • Inconsistent network state
  • Inability to retrieve or update information from the provider

Diagnosis #

To diagnose the issue, follow these steps:

  1. Check the eth_rpc_status metric to confirm that it is still returning a value of 3.
  2. Investigate the provider logs for errors related to net_version timeouts.
  3. Verify that the provider is configured correctly and that there are no issues with the underlying network or infrastructure.
  4. Check the Grafana dashboard for any other related alerts or issues.

Mitigation #

To mitigate the issue, follow these steps:

  1. Restart the provider service to attempt to recover from the timeout.
  2. Check the provider configuration and adjust any settings that may be contributing to the timeout.
  3. Investigate and resolve any underlying network or infrastructure issues.
  4. If the issue persists, consider increasing the timeout value or implementing retry logic to improve the provider’s resilience.
  5. Notify the relevant teams and stakeholders of the issue and ensure that it is being actively worked on.

Remember to check the alert’s annotations for specific details about the affected provider and instance, and refer to the linked runbook for further guidance.