Advertisment

Learnings from a Cloud Failure

author-image
PCQ Bureau
New Update

Last month, just as we were going to press, there was a major outage in Amazon's EC2 cloud in their US North Virginia data center. As a result, many companies (some of them known names like Reddit, Foresquare, HootSuite, Quora, etc) either faced downtime or severe latency and connectivity issues. Apparently, the problem occurred because of a failure in the AWS Elastic Block Storage system, which triggered excessive re-mirroring of EBS volumes, creating a large number of backups, thereby running out of capacity and causing the outage. Apparently, AWS didn't support auto-failover replication amongst regions, which would have saved companies that were using the service from this downtime.

Advertisment

Lots of fingers were pointed at Amazon, and lots of doubts raised in people's mind about whether to move to the cloud or not. But was it really Amazon's fault? Or did companies whose sites/apps go down didn't do their homework properly about ensuring high availability?

Actually, it's both, but we're not here on a fault finding expedition. Instead, it's time to step back, analyze the situation, and learn from it.

Advertisment

First thing's first. It's not as if after this incident, all cloud computing service providers will shut shop and go home, nor would companies stop moving to it. The cloud is too big and important for such incidents to bring it down. If anything, they will further strengthen it.

Two, do your homework and build fault tolerance and availability for your apps so that a catastrophe such as this one doesn't take you down. As I said earlier, it's not as if the blame entirely lies on Amazon for the impact. While lots of websites did go down, there were also some that emerged unscathed from the failure, such as Netflix. They had built their applications for redundancy and availability.

Three, there are too many advantages of moving to the cloud, like cost savings, scalability, etc. Would you have been able to build your own IT infrastructure that's as scalable and at the same cost as what you're paying to the cloud service provider? Probably not, else you wouldn't be choosing a cloud service provider in the first place.

Four, read the fine print offered by your cloud service provider in the service offerings more carefully. Find out the measures they're taking to ensure availability of your apps. Are they providing you the relevant management tools to do the job, or are they providing you with sufficient redundant infrastructure to ensure availability. What level of control do you have over how your applications are hosted and how they work on the cloud? If these things are not made clear, then you'll end up playing 'blame-games' with your cloud service provider whenever failure occurs.

Lastly, you can't completely rely on technology to be always available. It's not as if you've not faced crashes in your own IT infrastructure, and it's not as if service providers will promise 100% freedom from crashes. They will happen, so you just have to be prepared for them and move on.

Advertisment