Monday, September 2, 2013

Cloud infrastructure risks

More and more online businesses turn to cloud infrastructure to take advantage of  fast scalability and cost effectiveness, however it is not all "cloud 9" experience. Last week we had another reminder how perilous cloud computing concept really is. This time it was Amazon Web Services (AWS), one of the world's largest cloud providers. It went down for 59 minutes due to issues with one of its US datacentres. Affected were not only its own Amazon.com e-commerce service (that reportedly lost $1,100 in net sales per each second of outage) but also a number of high profile online businesses including Vine, Instagram, Airbnb and Flipboard, just to name a few.

It was not an isolated case for Amazon and other cloud infrastructure providers are not faring much better - Google's five minute outage a week before temporarily reduced the traffic on the Interned by almost 40%. We are all annoyed by outages of websites, either as end users or as owners of those websites, but the reality is that it is simply impossible to guarantee 100% online availability because there are so many parties involved in the "delivery chain".

The failure can occur in any part of that chain – hard disks can fail, updates of site code can go wrong, entire server farms can become unresponsive, power can fail, ISP can get disconnected, DNS registry may get corrupted, bandwidth may get clogged at the user end, and now you also have to factor in that cloud infrastructure may fail as well (affecting the entire site stored on that cloud or only some services provided from the affected cloud – eg. like map tiles for an online mapping app). The reason for failure could be people, software or hardware, or a combination of all three...

There are ways of minimising each of the above risks but as an operator of a website, your options are rather limited to mitigate it entirely. So, dealing with cloud infrastructure risk should be no different. In this case you either accept that outages will happen (and rely on service provider to fix the problem in a reasonable time) or you may need to deploy your site across two cloud infrastructures but it will increase significantly the complexity of your website structure – which is not without a significant cost either.

The Internet is full of dependencies so the most pragmatic approach is to aim at delivering a minimum acceptable service to your clients. An hour to 3hrs outage may be well within their tolerance if you clearly articulate such a possibility in your service delivery policy. Delivering anything more would be just a waste of resources for no real benefit to you or your users.