Millions of Netizens were forced to go outside and get some fresh air for an hour Sunday when Amazon Web Services (AWS) experienced a brief outage, taking down sites such as Instagram and Vine, among others. The downtime only affected the North Virginia U.S.-EAST datacenter, meaning that any Cloud-based company that actually followed Amazon’s own recommendations (as well as the recommendations of any Cloud consultant worth his or her salt, including yours truly) was unaffected.
Why? Because the Cloud is not built to avoid failure. It’s built to work around failure and recover automatically from failure. If you followed the recommendations and geographically distributed your instances, along with implementing a proper Cloud architecture for delivering basic availability, then your service would have remained standing.
Netflix, for one, kept perking along, because Netflix follows this recommendation. In fact, Netflix tests their deployment on a regular basis via their Simian Army – a collection of processes and applications that routinely wreak havoc on their production environment in order to test whether they’ve done things properly.
In this instance, the simian in question is the Chaos Gorilla – an application that takes down an entire AWS Availability Zone supporting the Netflix deployment. What Netflix runs on purpose, Amazon deployed accidentally – or at least, we can presume it was accidental. But maybe they should have taken down a data center on purpose, essentially running their own Chaos Gorilla. How else will AWS customers know they’ve properly architected their Cloud-based apps?