A Black Swan in the Cloud

Originally posted on April 28 2011

 I know I’m a bit late but, as usual, I prefer to think before writing.


I’m pretty sure you all know that Amazon Elastic Compute Cloud service experienced a rather severe outage on April the 21st. A bunch of popular sites suffered from that.
This event, obviously, make a few eyebrows rise on the whole subject of Cloud and SaaS.

I have always been lukewarm on both, but I do still think that both are here to stay and the Amazon failure does not change the outlook much.
I’ve also seen interesting articles which state how the cloud should be handled like any other resource, and I generally agre. Just let me add few points.

The aftermath of a failure can be mitigated by managing the cloud to create a secure, fail safe, architecture, using multiple datacenters or multiple providers. This is a viable option but it adds a whole new dimension to the software. It creates complexity and costs, that were originally thought to be off-loaded to the cloud/SaaS provider, thus partially jeopardizing the overall economic advantage. 

There is no SLA that can fully protect you in case of failure. Each system is doomed to fail, the actual question is when it will fail. The damage it could do to your business is hardly quantifiable. A mission critical system down for a long time (the value of “long” depends on the actual business) can literally kill your reputation and damage your bottom line for years to come. I haven’t seen a SLA that could cover for such damage, yet.
Even when adequate refunds are built in the SLA, are you sure that the provider could actually pay for them in case of a bad failure?

When your mission critical cloud or SaaS system is down, you can’t do anything but wait. Just wait for someone else to fix your issue. If you had your physical machines in a datacenter, at least, you could closely follow the people who are in charge to fix the issue, get status reports and get ready to go back to work. In case of a partial recover, you could make sure that what is most important for you is restored before. In case of a long outage you could restore a database backup on a spare machine and pull the data you need to work manually. If you have your physical machine available you can always do something more than wait.

What I mean is that SaaS and cloud have clear advantages over on premise when everything goes well or there are minor problems. When a major issue rises, risks are way higher. Major issues, though, are very rare, but such are black swans too.



993 views and 1 response

  • May 27 2011, 5:54 AM

    DavidWLocke (Twitter) responded:

    A lot of web developers don't care if there are outages or not. They don't go to the trouble to prevent outages. They won't do ITIL. Similarly, many non-web dev IT shops won't do geo-mirroring, so they can always be up. Instead, they talk about recovery and business continuation. This latter issue does reach beyond SLAs. Nobody is slicing through the pancake stack that we call the cloud to realize Codd's write once edict, which would mean that all the companies doing business with me would share my one and only address record. The real, cost minimized cloud is a long way off, beyond server virtualization.