When Azure could not be reached - Analysis of a global Public Cloud Outage image

When Azure could not be reached – Analysis of a global Public Cloud Outage

Microsoft provided valuable insight into the behind the scenes situation for a global network incident that happened about a year ago. In this article, I try to summarize my understanding of the situation – cause and effect. And my conclusion. In short: the cloud is just someone else’s computer, not a magical place where everything is always perfect. Real people manage real software and hardware. And no matter how skilled the people, hardened the processes, and optimized the hardware and software, things will go wrong. The blast zone can be very small, hopefully, but sometimes it can be large. As was the case for this incident.

This particular incident had more impact than we normally would expect. Failure of a server, a rack, a zone or even a region are expected. Failure of the fabric that ties the regions together and the regions to the internet did not seem as likely – to me at least. But even these can of course occur. And should be catered for.

As it happened, for this particular incident, zone redundant and cross-region designs and fail-over strategies did not prevent impact. Services were inaccessible and little could be done but wait for Microsoft staff to resolve the problem. And while they work on the resolution – we get very little insight in the problem and the path to recovery. It is somebody else’s computer used by thousands of different parties and Microsoft cannot/does not communicate with all these parties individually. We – and our customers – have to sit and wait until services resume operations.

January 2023 – Network connectivity lost

Wednesday 25th January 2023 saw a major outage on Azure. While most services were running without a problem, WAN issues led to loss of connectivity across Azure regions and between the internet and Azure. This problem made headlines, such as this one on CNN, on Reuters, on TechHQ and dozens of other news sites, Microsoft has communicated about this issue using the Tracking ID VSG1-B90.

What the world experienced:

  • 07:08 UTC – first detection of loss of connectivity: network connectivity between Azure Regions and between the public internet and any services running in Azure was impacted (delayed or even completely blocked)
  • 12:35 UTC – full recovery of all network routers and fully restored connectivity

What happened behind the scenes (this is my very short summary of the explanation provided by Microsoft in a very revealing video on YouTube regarding this issue):

  • an engineer added a new router in Madrid (around 7:08 UTC); the engineer used a script they had customized. the script instructed the new router and subsequently all routers (of a specific type) in Azure’s global WAN to reset themselves (forget all about their place in the network, their neighbours and the shortest network paths to specific addresses)
  • all traffic over the WAN – between regions and between Azure and the outside world – was impacted, more or less all over the world
  • routers started to rebuilt their knowledge of the world; this took them approximately 1 hour and 40 minutes. After that period, they were functioning as before
  • the engineer was not aware of the global negative impact of their change and added a second new router in Madrid at around 7:40 UTC – using the same customized script and causing the same global ripple effect: routers reset themselves, were more or less non-functional for a time and subsequently started to selfheal.
  • At round 9:25 UTC, the routers were fully recovered and functioning as before.
  • Further impact was experienced even after the routers had recovered. Automated health systems that monitor the network and ensure packets are traveling across the optimal path had been paused during the outage and it took some time to restart them and have them fully resume their activities; especially in the India region and around Chicago there was some network packet loss for a time.
  • at 12:35 UTC, the (WAN) network had fully recovered; services depending on the WAN took some additional time to fully recover after the network had regained a healthy state. It is difficult to establish at which point in time all effects had vanished and everything was back to normal

Microsoft stressed that they never expected something like this to happen. It clearly did. And now we are doing things to prevent this from ever happening again. They also explain in the video how there is a strict way of working within the Azure organization. Changes are performed based on MoPs (method of procedure) – validated, verified, battle testes ways of working. Any change to or deviation from a MoP has to be peer reviewed, signed off by senior technical management, tested on a simulation environment and rolled out in production using automated mechanisms (presumably with fail safes, smoke detection and rollback facilities).

image

Despite having this apparently ironclad approach to making changes to the network infrastructure, things went quite horribly wrong. The engineer had made changes to the script. These changes did not go through the four step process described by Dave. And although the changes had not been approved, they could still be executed by the engineer in the production environment. I am surprised that the engineer was not stopped from doing things that did not have the right status. Because the manual customization had not been tested in anger, the fact that it had an unexpected side effect (bringing down the WAN) was not known ahead of time. It affected only one of the three types of routers used in the Azure network and that behavior was not expected nor found out ahead of time.

Perhaps the most surprising element of all in this story is how despite the network having major issues – that were clearly detected by Dave’s teams – the engineer went ahead and made the exact same disastrous change a second time, more than 30 minutes after making the first change. How could an engineer working on the WAN not be aware of the issues with that same WAN (they had caused half an hour before). How could not all activities on the WAN be halted as soon as the network deterioration was noticed? It was a bad day for Microsoft and its customers. Certainly for Dave and his team(s).

Kudo’s for his frank explanation in this video. And I suppose the network engineer must have felt pretty awful afterwards. The fragility of the cloud, the fact that one person’s actions can have such far stretching impact, is a sobering realization to me. These things are not supposed to happen. Yet they do. And I can only assume they will continue to happen. We have to be prepared. We have to have our fall back plans that for the systems and services that our business requires to be available ensure that they can continue to run when Azure is globally on its knees,. Because that will happen.

Conclusion

The cloud is someone else’s computer – not an abstract, infallible mechanism. The mistakes you know you could make in your own IT environment can also be made by the cloud provider. People are responsible for the Azure systems. People who are experienced and smart and dedicated. And who make mistakes and do not always have full insight in how systems behave and how changes can have widespread impact. It is understandable this can and will happen. And we should realize that this will happen again. Therefore we have to prepare for it. Outages like this one will occur from time to time. We need to design our strategy and our cloud environment and the fallback plans and failover strategies based on that assumption.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.