Google, LinkedIn, and Microsoft prove no cloud is too big to fail

Crashes are inevitable in the cloud. The trick to a successful cloud strategy is to design for the impending failure

Nothing is certain in this world except death and taxes -- and lots of cloud outages.

The week started badly for Google Drive users as the cloud-based service for storing documents, videos, and Google Docs was down for several hours Monday. Users vented on social networks over their frustration, with one user tweeting, "Google Drive now back up like a limp horse struggling to move." Google moved quickly to acknowledge the issue on its Apps Status Dashboard but did not say what caused the problem or how many were affected.

On Wednesday, it was LinkedIn's turn to roll over and play dead. Users hoping to reach out and connect via the social media network were greeted with a 503 error "Service Unavailable" message. The website suffered intermittent service disruptions throughout the morning, knocked out for about 45 minutes, got back up for 30 minutes, then experienced a second service disruption. To date, there's been no word from LinkedIn on why the outage occurred.

Last week Microsoft's Hotmail and Outlook.com vanished from the cloud, leaving users of the online suite of mail, calendar, and storage services unable to access their accounts. It was 16 hours before service was fully restored. A routine server firmware update gone wrong was the culprit, and Microsoft apologized to users on its Outlook blog.

If events of the past fortnight teach us anything, it's that failure is inevitable when it comes to the cloud. Blaming the cloud, however, misses the point. After an outage on Amazon Web Services last year, InfoWorld's Matt Prigge wrote:

[The AWS outage] is akin to a very serious, yet recoverable failure in a core infrastructure component in an on-premise data center -- and I've seen it happen more times than I can count. If you operate a mission-critical infrastructure that can't tolerate downtime, you probably have measures in place to protect your operations from extended outages were they to occur -- a backup data center in another building or site, for example. If you haven't invested in building that kind of redundancy, your organization has essentially decided the risk of downtime isn't worth the time and money it would take to avoid that threat. The exact same is true of the public cloud.

More and more businesses are moving to the cloud -- the total public cloud services market in 2011 was $91 billion and will grow to $207 billion in 2016, according to Gartner. Since no cloud-based service offers a 100 percent uptime guarantee, the trick to a successful cloud strategy, experts say, is to design for failure.

Part of that preparation is knowing what an outage will cost your business -- Amazon.com had an outage for 49 minutes on Jan. 31 that cost more than $4 million in lost sales -- and using that cost to create effective SLAs or to determine how much of an investment to make in cloud services. InfoWorld's David Linthicum writes:

Keep in mind you'll have to deal with outages no matter if your servers are on a public cloud provider or in a local data center. ...[Y]ou should approach the use of cloud-based platforms -- or any new platforms -- with a clear understanding as to how much outages will cost the business. Use this figure to determine the cost of risk, as well as the proper amount to spend on failover services, whether in the cloud or on premises.

In an article for Network World, Apurva Dave, vice president of products and marketing at Riverbed Stingray, writes that "[o]rganizations that take failure into account build a robust and dynamic infrastructure that can withstand any cloud failure." He goes on to outline three ways to avoid the impacts of any public cloud providers' next cloud outage:

Balance across availability zones (AZs). Large public cloud providers' data centers are built across AZs and regions. By having your application instances in separate AZs, if one zone goes down, users can be redirected in real time to another one.
Balance across multiple cloud providers. Instead of just using Amazon Web Services, use a combination of AWS with Joyent, Azure, Rackspace, and/or another provider, diverting traffic to an available cloud in the event of a failure.
Add another cloud to the mix. Adding a private cloud into the mix is a safety net in the event of a public cloud outage.

Cloud outages will happen. Is your business prepared to fail?

This article, "Google, LinkedIn, and Microsoft prove no cloud is too big to fail," was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest business technology news, follow InfoWorld.com on Twitter.

Next read this:

Caroline Craig is East Coast site editor for InfoWorld.