PostgresqlReplicationLagSizeTooLarge #
“Replication lag size on server {{$labels.instance}} ({{$labels.application_name}}) is currently {{ $value | humanize1024}}B behind the leader in cluster {{$labels.cluster_name}}”
Alert Rule
alert: PostgresqlReplicationLagSizeTooLarge
annotations:
description: |-
"Replication lag size on server {{$labels.instance}} ({{$labels.application_name}}) is currently {{ $value | humanize1024}}B behind the leader in cluster {{$labels.cluster_name}}"
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/postgres-exporter/postgresqlreplicationlagsizetoolarge/
summary: Postgresql is more than 1G behind (instance {{ $labels.instance }})
expr: pg_replication_status_lag_size > 1e+09
for: 5m
labels:
severity: critical
Meaning #
The PostgresqlReplicationLagSizeTooLarge alert is triggered when the replication lag size of a PostgreSQL instance exceeds 1GB. This means that the standby server is falling behind the primary server in terms of data replication, which can lead to data inconsistencies and potential data loss.
Impact #
The impact of this alert is critical, as it can result in:
- Data inconsistencies between the primary and standby servers
- Potential data loss if the standby server is not brought up to date
- Downtime for applications relying on the standby server
- Increased risk of data corruption
Diagnosis #
To diagnose the issue, follow these steps:
- Check the PostgreSQL logs for any errors or warnings related to replication
- Verify the network connectivity between the primary and standby servers
- Check the disk space and I/O performance of the standby server
- Review the PostgreSQL configuration files to ensure that the replication settings are correct
- Use the
pg_replication_status_lag_size
metric to monitor the replication lag size and identify the root cause of the issue
Mitigation #
To mitigate the issue, follow these steps:
- Check the PostgreSQL replication settings and adjust them as necessary
- Verify that the standby server has sufficient disk space and I/O performance
- Restart the PostgreSQL service on the standby server to re-establish replication
- Consider increasing the replication timeout or adjusting the replication strategy to reduce the lag size
- Implement a backup and restore process to ensure data consistency in case of a failover
- Follow the runbook provided in the annotations for detailed instructions on resolving the issue: https://srerun.github.io/prometheus-alerts/runbooks/postgres-exporter/postgresqlreplicationlagsizetoolarge/