On Easter Friday, Amazons AWS services were affected by a major service disruption in the Availability Zone in North Virginia. It rendered the services unusable over the course of almost three days and reduced performance for several days. The status history highlights the impact
of the event. Major web services like Foursquare, HootSuite, Quora, Reddit and many others were affected. Now that the incident has been dealt with and the fault analyzed, there is hope that both Amazon and its customers have learnt their lessons from it.
At the end of last week, Amazon released a rather technical account of what had happened and the consequences of that. In this article we will simplify some technical details, so if you want the full description head over to Amazons' AWS fallout post mortem
. Basically during a routine network upgrade, some parts of the network have been wrongly connected, thus resulting in several servers no longer being accessible. Since the underlying architecture is highly connected, this resulted in the machines trying to aggressively connect to new machines, which the wrongly connected network was not designed to handle. This triggered a domino effect which further degraded the performance of certain elements in the availability zone of North Virginia. So basically, it boils down to a human error causing a lot of unwanted consequences.
Amazon was very upfront in its description which can be applauded. They also acknowledged that their infrastructure needs improvements to reduce the possibility of further incidents like that. This will happen on multiple levels. For one, bugs aggravating the consequences of the outage will be fixed. Currently the fixes are at the testing stage and should be deployed in the coming weeks. As a customer this shouldn't be noticed at all. Furthermore Amazon will change the architecture of parts of their cloud infrastructure to limit the impact of similar failures. The company also learned, that if they had more spare capacity installed, the service outage wouldn't have been nearly as severe. Thus Amazon now plans to install additional storage to be prepared for future worst case events. Last but not least, Amazon promised to improve communication with customers in the future. As a refund, Amazon offers a ten day credit of the service to customers running their services in the affected availability zone, even if they haven't been affected.
It should be noted, that Amazon already offered both the infrastructure and the capabilities on the software side, to design cloud applications in a way, that this outage would not affect them at all or only very slightly. The problem with this, apart from higher cost is, that it's increasingly difficult to properly design software, that leverages such Multi Availability Zone (as Amazon calls it) infrastructure. Amazon now plans to improve the ease of use of these technologies and also wants to provide further guidance on how to utilize it. Starting on Monday, May 2nd, Amazon hosts a series of free webinars
to this end. Some of these fault tolerancy improvements will even be handled automatically in the near future, provided the customer opts in.
Those people now loadly advocating against cloud computing should consider one thing. Incidents such as these can happen with private infrastructure as well. The main difference is, that the public won't hear as much about it as is the case with a multi-tenant cloud provider. Especially those complaints related to data loss should consider, that only data that couldn't be processed due to the service disruption got lost in the event. Any data that was already persisted in EC2 or RDB was already backed up and safe. To protect from such data loss, a vastly different infrastructure would be required, that ensures that in-flight data is separately backed up. Most of the time it boils down to a simple cost issue. More protections cost additional money, though in some cases it's worth considering to make the investment.
Summing it up, we can say, that lessons on both sides have (hopefully) been learned. On the one hand, Amazon as a cloud provider found out about some weaknesses of its infrastructure as well as some shortcomings of their software. Thus they are now able to correct the problems or provide additional safeguards. On the other hand, customers of cloud services learned, that even if the Service Level Agreement
(SLA) states 99.95% availability, that doesn't mean you are safe from any failures. In the worst case, you better have a disaster recovery plan ready, especially if the services you host are mission critical.
© 2009 - 2013 Bright Side Of News*, All rights reserved.