MongodbReplicationLag #

Mongodb replication lag is more than 10s

Alert Rule

alert: MongodbReplicationLag
annotations:
  description: |-
    Mongodb replication lag is more than 10s
      VALUE = {{ $value }}
      LABELS = {{ $labels }}    
  runbook: https://srerun.github.io/prometheus-alerts/runbooks/percona-mongodb-exporter/mongodbreplicationlag/
  summary: MongoDB replication lag (instance {{ $labels.instance }})
expr: (mongodb_rs_members_optimeDate{member_state=&#34;PRIMARY&#34;} - on (set) group_right
  mongodb_rs_members_optimeDate{member_state=&#34;SECONDARY&#34;}) / 1000 &gt; 10
for: 0m
labels:
  severity: critical

Here is a runbook for the MongodbReplicationLag alert rule:

Meaning #

The MongodbReplicationLag alert is triggered when the replication lag in a MongoDB replica set exceeds 10 seconds. This means that the secondary members of the replica set are not keeping up with the primary member, and data written to the primary is taking too long to be replicated to the secondaries.

Impact #

If left unaddressed, this alert can lead to:

Data inconsistencies between primary and secondary members
Increased risk of data loss in the event of a primary failure
Decreased performance and availability of the MongoDB cluster
Potential for application errors and downtime

Diagnosis #

To diagnose the issue, follow these steps:

Check the MongoDB logs for any errors or warnings related to replication
Verify that the replica set is properly configured and all members are up and running
Check the network connectivity and latency between primary and secondary members
Investigate any recent changes to the MongoDB configuration or application workload
Use the mongodb_rs_members_optimeDate metric to identify the specific replica set and members experiencing the lag

Mitigation #

To mitigate the issue, follow these steps:

Identify and address any underlying network or infrastructure issues causing the lag
Optimize MongoDB configuration for improved replication performance (e.g. adjusting the syncTimeout value)
Increase the resources (e.g. CPU, RAM) of the secondary members to improve their ability to keep up with the primary
Consider adding additional secondary members to the replica set to distribute the load and improve replication performance
Perform a manual replication sync to catch up the secondary members, if necessary

Remember to investigate and address the root cause of the issue to prevent future occurrences of this alert.