Outage

Terugblik

Incident Report: Outage March 27 & 28

Date/Time of Incidents:
  • March 27, 2025, 6:16 AM: Primary database node stopped responding to application, but continued to respond to Database cluster precluding automatic failover behavior.

  • March 28, 2025, 8:19 AM: Primary database node stopped responding to application, but continued to respond to Database cluster precluding automatic failover behavior.

Identification:
  • March 28, 2025, 3:30 PM: The root cause was identified as the differential DB backup triggered a partial network failure of host server.
Resolution:
  • March 28, 2025, 4:00 PM: Differential backup processes were disabled. Weekly full backups and constant log backups continue as outlined in our Disaster Recovery process: https://www.monnit.com/policy/data-retention/
  • Backup process alone doesn’t freeze the server, our testing has also identified two contributing factors. Host server memory locks up under certain stress tests. Virtual machine network drivers can trigger VM hang up as observed in logs.

  • April 2, 2025, 6:00 PM: Maintenance to replace host memory and add additional network interfaces into host servers. We will verify memory stress tests function after memory replacement. And the new network interfaces will allow us to create new virtual hosts for SQL Server while running the existing cluster to minimize downtime of transition.

  • Mid April, 2025: Maintenance to migrate database processing to new database cluster. New cluster running new VM instances with updated network interfaces. And also upgrading from SQL Server 2017 to SQL Server 2022 in anticipation of multi datacenter support.

  • May, 2025: Maintenance to transition processing to new Datacenter providing an additional layer of failover protection.

— END OF REPORT —

Opgelost

Services restored, root cause investigation underway.

Onderzoekende

We are investigating the cause of system outage

5 Getroffen Diensten: