Just in case your head wasn't in the clouds last week, you may have missed this story: the Microsoft Azure Cloud went down for hours, ostensibly as a result of a hacker attack, but in reality the side-effect of an expired SSL certificate. And if that weren't bad enough, the SSL certificate snafu kept the site down for 12 hours.
Bonehead oversight, right? Everybody makes mistakes. Stuff happens. Fair enough. Should you be worried? Here's some food for thought.
- If Microsoft can't get the simple stuff right, then can you trust them with the difficult stuff?
- In particular, if Microsoft's own governance infrastructure can't enforce a policy as simple as "renew SSL certificates before they expire," then how well does Microsoft handle governance overall?
- When Amazon has outages, they provide detailed explanations of what happened, and what they're doing to keep the same thing from happening again. I've poked around a bit, but I can't find such an explanation from Microsoft. We're waiting, Microsoft!
- What were the techs at Microsoft doing for those 12 hours? Did it take them that long to figure out what the problem was? Or did they figure it out, but it took that long to fix the problem? Obtaining and installing an updated SSL certificate shouldn't take more than half an hour.
- What was Microsoft's backup plan? The SSL certificate was a single point of failure. How many other single points of failure does Azure have?
- Not using Azure? That doesn't mean you're immune. Who's to say the same kind of bonehead mistake couldn't happen at your Cloud provider?
The bottom line: Cloud Computing is still a work in progress. It may be amusing to pick on Microsoft, especially when they deserve it. But we could equally well heap our scorn on Cloud Computing in general. Expect and plan for failure -- a best practice for all of IT, but especially in the Cloud.
Azure, cloud platform