PostgresqlReplicationLagSizeTooLarge #

“Replication lag size on server {{$labels.instance}} ({{$labels.application_name}}) is currently {{ $value | humanize1024}}B behind the leader in cluster {{$labels.cluster_name}}”

Alert Rule

alert: PostgresqlReplicationLagSizeTooLarge
annotations:
  description: |-
    &#34;Replication lag size on server {{$labels.instance}} ({{$labels.application_name}}) is currently {{ $value | humanize1024}}B behind the leader in cluster {{$labels.cluster_name}}&#34;
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/postgres-exporter/postgresqlreplicationlagsizetoolarge/
  summary: Postgresql is more than 1G behind (instance {{ $labels.instance }})
expr: pg_replication_status_lag_size &gt; 1e&#43;09
for: 5m
labels:
  severity: critical

Meaning #

The PostgresqlReplicationLagSizeTooLarge alert is triggered when the replication lag size of a PostgreSQL instance exceeds 1GB. This means that the standby server is falling behind the primary server in terms of data replication, which can lead to data inconsistencies and potential data loss.

Impact #

The impact of this alert is critical, as it can result in:

Data inconsistencies between the primary and standby servers
Potential data loss if the standby server is not brought up to date
Downtime for applications relying on the standby server
Increased risk of data corruption

Diagnosis #

To diagnose the issue, follow these steps:

Check the PostgreSQL logs for any errors or warnings related to replication
Verify the network connectivity between the primary and standby servers
Check the disk space and I/O performance of the standby server
Review the PostgreSQL configuration files to ensure that the replication settings are correct
Use the pg_replication_status_lag_size metric to monitor the replication lag size and identify the root cause of the issue

Mitigation #

To mitigate the issue, follow these steps:

Check the PostgreSQL replication settings and adjust them as necessary
Verify that the standby server has sufficient disk space and I/O performance
Restart the PostgreSQL service on the standby server to re-establish replication
Consider increasing the replication timeout or adjusting the replication strategy to reduce the lag size
Implement a backup and restore process to ensure data consistency in case of a failover
Follow the runbook provided in the annotations for detailed instructions on resolving the issue: https://srerun.github.io/prometheus-alerts/runbooks/postgres-exporter/postgresqlreplicationlagsizetoolarge/