March 12, 2025, 12:53 PM: The active database node experienced a loss of network connectivity
due to broken ARP tables.
March 12, 2025, 6:47 PM: While technicians were rebuilding the host configurations, all processing servers were migrated to other hosts. The primary load balancer tied to the affected host was unavailable, and the secondary load balancer, which was managing traffic at the time, stopped communicating with the servers.
March 12, 2025, 9:02 PM: After service restoration, a large number of gateways rejoined the system with their full backlog of data, overrunning the TCP queues. This event affected only TCP-based gateways, as the queue processing took longer than the gateways’ waiting time, causing them to register as a failed communication sequence.
March 13, 2025, 1:15 AM: Unrelated to the previous networking issues, a hardware error occurred in the network storage array cache controller.
March 12, 2025, 7:48 PM: Once the incorrect settings were discovered, all hosts were rebooted. Services were restored with updated network configurations, and virtual hosts were rebuilt with replaced virtual network adapters as needed. The load balancer was migrated to a new host to manage network traffic.
March 13, 2025, 2:59 AM: Services were restored after the network storage array failed over, data was restored, and all machines were restarted with their storage made available.
— END OF REPORT —
This latest outage was found to be separately triggered by a hardware controller in the Network Storage Array.
Migrated to the paired array and service was restored. Technicians have repaired the storage array and it is back in service as the secondary.
technicians are investigating an outage.
Services have been restored but still working to restore full redundancies
From the disruption this morning the networking rebuild is just about finished. but we are seeing another issue currently.