Networking Reconfiguration

Postmortem Mar 13, 2025 11:12 AM MDT

Incident Report: Network Connectivity

Date/Time of Incident:

March 12, 2025, 12:53 PM: The active database node experienced a loss of network connectivity
due to broken ARP tables.
March 12, 2025, 6:47 PM: While technicians were rebuilding the host configurations, all processing servers were migrated to other hosts. The primary load balancer tied to the affected host was unavailable, and the secondary load balancer, which was managing traffic at the time, stopped communicating with the servers.
March 12, 2025, 9:02 PM: After service restoration, a large number of gateways rejoined the system with their full backlog of data, overrunning the TCP queues. This event affected only TCP-based gateways, as the queue processing took longer than the gateways’ waiting time, causing them to register as a failed communication sequence.
March 13, 2025, 1:15 AM: Unrelated to the previous networking issues, a hardware error occurred in the network storage array cache controller.

Identification:

March 12, 2025, 4:55 PM: The root cause was identified as the conversion from LBFO (Load Balancing Failover) to SET (Switch Embedded Teaming) during our upgrade to new hardware requiring Windows Server 2025 from Windows Server 2022. The correct VLAN IDs were inadvertently left in the configuration for LBFO, causing connectivity issues with SET. This led to ARP tables being unable to identify MAC addresses and their IP addresses randomly. Due to the location of these settings, it was not immediately obvious that they were left in by mistake. During troubleshooting, hosts and virtual machines occasionally lost connectivity, resulting in downtime for some services.

Resolution:

March 12, 2025, 7:48 PM: Once the incorrect settings were discovered, all hosts were rebooted. Services were restored with updated network configurations, and virtual hosts were rebuilt with replaced virtual network adapters as needed. The load balancer was migrated to a new host to manage network traffic.
March 13, 2025, 2:59 AM: Services were restored after the network storage array failed over, data was restored, and all machines were restarted with their storage made available.

— END OF REPORT —

Resolved Mar 13, 2025 8:26 AM MDT

This latest outage was found to be separately triggered by a hardware controller in the Network Storage Array.
Migrated to the paired array and service was restored. Technicians have repaired the storage array and it is back in service as the secondary.

Investigating Mar 13, 2025 1:22 AM MDT

technicians are investigating an outage.

Problem Identified Mar 12, 2025 9:56 PM MDT

Services have been restored but still working to restore full redundancies

Problem Identified Mar 12, 2025 7:05 PM MDT

From the disruption this morning the networking rebuild is just about finished. but we are seeing another issue currently.