ApcUpsLessThan15MinutesOfBatteryTimeRemaining #
Battery is almost empty (< 15 Minutes remaining)
Alert Rule
alert: ApcUpsLessThan15MinutesOfBatteryTimeRemaining
annotations:
description: |-
Battery is almost empty (< 15 Minutes remaining)
VALUE = {{ $value }}
LABELS = {{ $labels }}
runbook: https://srerun.github.io/prometheus-alerts/runbooks/apcupsd_exporter/apcupslessthan15minutesofbatterytimeremaining/
summary: APC UPS Less than 15 Minutes of battery time remaining (instance {{ $labels.instance
}})
expr: apcupsd_battery_time_left_seconds < 900
for: 0m
labels:
severity: critical
Here is a runbook for the Prometheus alert rule:
Meaning #
This alert is triggered when the APC UPS battery time remaining falls below 15 minutes, indicating that the battery is almost empty. This is a critical alert as it can lead to unexpected downtime or data loss if the UPS is unable to provide power to the system.
Impact #
- Unexpected downtime or data loss due to UPS failure
- Potential loss of business-critical services or operations
- Increased risk of equipment damage or data corruption
Diagnosis #
To diagnose the issue, follow these steps:
- Check the UPS status and battery levels using the apcupsd exporter dashboard or CLI.
- Verify that the UPS is properly configured and receiving power from the grid.
- Check for any signs of UPS malfunction or failure, such as alarms or error messages.
- Review system logs for any errors or warnings related to the UPS or power supply.
Mitigation #
To mitigate the issue, follow these steps:
- Immediately switch to backup power source: If possible, switch to a backup power source, such as a generator or secondary UPS, to prevent downtime or data loss.
- Notify system administrators and stakeholders: Inform relevant teams and stakeholders of the issue and expected downtime, if applicable.
- Replace or repair the UPS: Arrange for the UPS to be replaced or repaired as soon as possible to ensure continued power supply to the system.
- Perform a thorough system check: Once the UPS is replaced or repaired, perform a thorough system check to ensure that all systems are functioning correctly and data integrity is maintained.
Remember to also investigate the root cause of the issue to prevent future occurrences.