Delayed Data availablility

Løst
Opdateret

Incident Report: Windows Update - Sensor Data Write Failure
Date/Time of Incident:

  • 12/11/2024, 12:54 AM: Windows Updates ran automatically, causing the primary database node to reboot. As expected, database processing shifted to the secondary node. However, the secondary node was not saving to the sensor data table.
  • 12/11/2024, 6:12 AM: The issue was identified by system administrators.
  • 12/11/2024, 6:15 AM: Administrators convened and determined the cause of the processing failure. They reverted processing back to the primary node.

Resolution:

  • 12/11/2024, 6:15 AM: Processing was restored to the primary node, and sensor data resumed saving to the database.
  • 12/11/2024, 6:15 AM – 7:55 AM: Logs from the affected timeframe were identified and prepared for reprocessing.
  • 12/11/2024, 7:55 AM – 10:10 AM: The affected logs were reprocessed, fully restoring sensor data into the database.

Actions Taken:

  1. Disable Automatic Updates (12/11/2024):
    • Windows Updates were disabled, and Group Policies are being reviewed to prevent automatic updates in the future.
  2. Enhanced Monitoring (12/11/2024):
    • New automated alerts were added to monitor data volume in specific tables, along with system responsiveness.
    • Enhanced logging was applied to inactive nodes to match the active node monitoring of database transactions.
  3. Data Restoration (12/11/2024):
    • All affected data was successfully restored to the database.
      4.Database Configuration Review (12/11/2024, 12/18/2024):
    • A review of database settings, table permissions, and network configurations for all nodes will occur. Adjustments and validations are planned during the scheduled maintenance window on Wednesday 12/18.
Post-mortem

The primary database server ran a system update and processing failed over to the secondary node. All systems continued to process except the data storage of messages. We are investigating reasons for this. Data is still available from the queue to be replayed in to the system. System administrators will be working to restore them.

Løst
Undersøgelse

This issue was opened retrospectively.

3 berørte tjenester: