Given how many systems were affected and the speed of recovery, the first thing that come to mind is some horrible router upgrade fault or BGP catastrophe. That’s capable of affecting reachability across the entire Google cloud, and it’s the sort of thing that is simultaneously difficult to fix initially but eventually easy to mitigate once you have a handle on it. The problem is after a major routability issue like that, many systems that aren’t directly affected can fail simply due to being able to function correctly because they can’t talk to what they need to for an extended period of time, and it could still take time for Google engineers to restore those systems to functioning correctly.
So different systems that rely upon Google cloud could take differing times to become operational again. Hopefully MCOC’s dependencies are easier to restore.
Comments
This is not that kind of outage. This is a something went horribly wrong across the board kind of disruption. There’s nowhere to fail over to.
PROBLEM IS IDENTIFIED AND A FIX IS BEING WORKED ON - WONT BE READY FOR ALL YET THOUGH
So different systems that rely upon Google cloud could take differing times to become operational again. Hopefully MCOC’s dependencies are easier to restore.
Hours expected