Client Pay Portal
 man falling from the clouds

Are Cloud Failures a Possibility?

Once in the Cloud, Your Responsibilities are not Abdicated

On February 28th & 29th 2012, if you were peering into a crystal ball I am sure you would have seen organizations that use Windows Azure standing alongside their CIO, CTO, and their staff of IT Professionals “Laying Hands” on a network connection while saying a prayer that went something like this:

“Heavenly Father, we humbly beseech your divine intervention, from your cloud to ours. Lay your healing hands praying woman in cloudson our application and restore our cloud….. “ 

“Let us pray: Our Application, who resides in Azure, offline be thy shame. Online become, or my job be done, in the cloud as it is on-premise. Give us our compute, and the cloud reboot, and forgive us our complacency as we forgive Microsoft for 24 hours latency. And lead us not into leap year calculations, but deliver us from application unavailability. For the cloud must be online, all of the time. With application availability there, Microsoft and Azure will share, the power and glory for ever and ever Amen. “


I am sure there were a lot of CIOs sweating bullets during this outage. After all, when the CEO comes into your office and wants to know why your systems are offline, blaming Microsoft, Google, or Amazon isn’t an option. To be fair, Microsoft is not the first Cloud provider to suffer an outage. While this definitely should not happen, the reality is that outages & glitches do happen. This does not nullify advantages of The Cloud. The crux of this issue was a control program that had not taken into account leap year. Bottom Line: Human Error.
 

Driving the Conversation

As a CIO / IT Professional, you need to drive the conversation with your CEO around the cloud. While I am sure you probably positioned the benefits of the cloud, but you also need to make sure they do not get the opinion that the cloud is some kind of silver bullet to kill all your IT daemons. You should NOT give them impression that the cloud is perfect. While the cloud may be much more reliable than your server that is sitting in a closet, or at a corporate datacenter, it CAN FAIL! If you position the cloud realistically with your CEO, when the cloud “goes down”, your CEO will not FREAK OUT! 

You need to plan for failure and make sure you and your CEO understand what the escalation path and strategy will be if the cloud goes down. Your plan should not include “laying hands and praying on your internet cable”.  Communication is the key.
 

Communicate, Communicate, Communicate

chalk boardCommunication, I believe is the biggest lesson Microsoft can learn from this outage. As a Microsoft Partner we found out the night of the 28th that Windows Azure was down. We found this out because of the consistent monitoring that we do for our clients; however, it was not until the next morning that we realized how wide spread the outage was. The biggest complaint I have seen from the comments on the web were lamenting the lack of detailed information Microsoft was putting out. Microsoft did push out regular updates, but the information was sparse, and it contained little to no details as to what exactly they were doing and when customers might expect things back online. Shortly is not a good timeline. In a crisis situation, information is often sparse and everyone is afraid to put out any wrong information. This is especially true when you are large visible company like Microsoft where everybody is ready to throw you under the bus. In the absence of communication, people think the worst, and tempers get hot. This is especially true when money and jobs are on the line. 

My take is, and it may not be conventional public relations wisdom, but it is important to always deliberately communicate meaningful information. Take the arguments and issues off the table by delivering meaningful information during a crisis. This probably isn't just a good axiom for a crisis, but rather an everyday practice in general. President Grover Cleveland is probably best known as the only President in history to be elected for two non-sequential terms. During his first electoral campaign in 1884 a Buffalo newspaper released a story titled “A Terrible Tale” that divulged the fact that Cleveland had fathered a child out of wedlock ten years prior. Not a big deal by today’s standards, but back in 1884 this was a scandal of great magnitude. One of Cleveland’s supporters sent him a telegram asking him what to do. He replied “Tell the Truth”. This resonated with people and Grover Cleveland went on to be elected President of the United States in 1884 and again in 1892. 

Tell the Truth! If you don’t know, say so! I think it is far better to communicate that you do not know and regularly communicate what actions you are taking, rather than to leave people to their imaginations. Likewise with your CEO, Customers, and Vendors, You need to show your steps and give them updates as you get them.

 

On-Premise or Off-Premise, It’s Your Architecture

I am sure in the coming days the cynics will dog pile on Microsoft and bash Windows Azure. However at the cloud architecturecore, our application and our infrastructure whether it be on-premise or in the cloud is ultimately our responsibility. Simply moving to the cloud does not abdicate our responsibility. When discussing the biggest benefits of the cloud, at it's core are elastic scalability, no commitment, and paying only for what you use. While reliability and higher availability are benefits, you are ultimately responsible for your cloud strategy and architecture as an organization. 

Applications need to be designed to run in the cloud. If you will be running a critical application at a cloud provider, you need to plan for an outage, design for redundancy, and expect to pay for that redundancy. I have spoken with organizations that actually believe that because they put their application on a single instance of Windows Azure they have high availability. Let’s be real here. The Microsoft SLA guarantee requires a minimum of two Azure Instances. To quote the Windows Azure Web Site: “...when you deploy two or more role instances in different fault and upgrade domains your Internet facing roles will have external connectivity at least 99.95% of the time.” That being said, if you want high availability and backup reliability you need to implement dual cloud instances across multiple geographic locations.
 

Lessons Learned

Those that do not learn from history are doomed to repeat it! As IT Professionals, we need to prepare our CEOs so we are not put on the hot seat when the evitable failure happens. We also need to make sure we have a plan on how to communicate that internally and externally to our organization and what our strategy and plan of action is when failure occurs. 

Biggest lessons taken from the Microsoft's Cloud outage:

  • Nobody’s cloud is infallible.
  • Glitches and failures will happen, whether in the cloud or on-premise.
  • Plan for Failure.
  • Communicate, Communicate, Communicate.
  • Moving to the cloud does not absolve you of responsibility for your organization’s cloud strategy and topology.
  • Design for backup reliability by implementing dual cloud instances across multiple geographic locations.

Understand the strengths and weaknesses in a cloud solution. While it was highly publicized and it definitely had an economic impact to those effected, this incident in no way diminishes the benefits or viability of the Microsoft Cloud. Windows Azure and Microsoft’s Cloud remain one of the best most advanced, reliable, and fault tolerant cloud solutions on the planet. Despite a cloudy day there are still blue skies with Windows Azure! 
 
windows azure

*Note: No beheadings or killings were perpetrated as a result of this religious analogy.

Author

Vincent W. Mayfield, Chief Executive Officer
Vincent W. Mayfield

CEO