Statistically, most data center outages are attributed to human error, according to data collected by the Uptime Institute. Some of this year’s most high-profile outages, however, were caused by failure of the automated failover mechanisms, either on the power infrastructure or the network side.
Regardless of the root cause, data center outages are expensive. The Ponemon Institute calculated that an outage can cost a company about US$1.02m. The institute conducted a survey, analysing downtime costs at 41 US data centers, releasing the results in May.
Here are some of the loudest downtime incidents from the last year:
Global outages for RIM
In October, Research In Motion (RIM), maker of the BlackBerry smartphones, experienced an outage of its infrastructure. While it first affected customers in Europe, the Middle East and Africa, the effects spread to the Americas the following day.
The company traced the issue to failure of a core switch at one of its data centers. “Although the system is designed to failover to a back-up switch, the failover did not function as previously tested,” RIM representatives said. “As a result, a large backlog of data was generated and we are now working to clear that backlog and restore normal service as quickly as possible.”
Issues persisted over three days, causing intermittent service delays for many customers. To compensate them, RIM offered free downloads of some BlackBerry App World premium applications, for which customers usually have to pay.
Nature sinks Google’s cloud
Google’s App Engine Datastore services went down in August. The company traced the cause to a thunderstorm that interrupted utility power to a Google data center in the American Midwest. In this case, the automatic-failover mechanism for switching to generator power failed to do its job.
Google’s Ikai Lan wrote in an email to App Engine customers: “Power distribution equipment in the data center failed in the wake of the loss of utility power, which powered off a subset of the machines in the data center.”
The outage caused loss of a portion of the compute and storage capacity supported by the data center, leading to high latency, server errors and even total downtime for App Engine master-slave Datastore applications.
Google did not specify why the data center’s electrical systems failed to switch to generators when it lost utility power, or how long the facility remained without power.
The App Engine team performed an emergency failover at the application level, migrating affected applications to a back-up data center. As a result, some applications appeared to “jump backwards in time” as they came back up. This happens because data written to the primary data center during the period immediately preceding the outage does not get migrated.
Amazon’s Dublin fiasco
A utility supplying power to an Amazon data center in Dublin first blamed stormy weather for a power outage that affected the facility but then retracted the initial diagnosis that a lightning strike had taken out a 10MW transformer. According to Amazon, the facility failed to switch to back-up generators after it lost utility power. The Amazon Web Services (AWS) team said it believed the data center’s programmable logic controllers (PLCs), which synchronize electrical phases between generators, were to blame.
A PLC at the facility detected a ground fault and failed to complete its task, leading to the data center outage because most of the data center’s back-up generators were disabled.
The outage affected Amazon’s Infrastructure-as-a-Service businesses, Elastic Compute Cloud (cloud servers) and Elastic Block Store (cloud storage), and its cloud database service called Relational Database Service. Cloud instances of these three services hosted in Dublin felt most of the effect.
Amazon said nearly all EC2 zones instances and about 60% of EBS volumes in the zone went down. Networking gear connecting the zone to the Internet and other availability zones in the region also went down, causing connectivity issues resulting in customers receiving API errors. The AWS team said it would make a number of changes to the data center to prevent such issues from reoccurring. The changes included adding redundancy and more isolation for the PLCs to insulate them from failures.
Telecity suffers in Docklands
A power outage at Telecity’s Meridian Gate data center in London’s Docklands in July caused disruptions to companies colocating there and to their customers. The provider traced the outage to a “fault on a breaker in the power distribution system”, according to a note it sent to one of the affected customers, network connectivity provider C4L. The network provider said all its customers were kicked offline and its 10G ring was broken by the outage. Power to the facility was restored within 20 minutes.
In an emailed statement, Telecity said: “We have resolved a power outage that affected our Meridian Gate data center earlier today. Our engineers responded quickly and restored power to the facility in around 20 minutes. We kept all affected customers informed throughout the process and have apologized for the disruption.”
In addition to C4L, affected customers included online marketing service Initial Rewards, email marketing company Easy Inbox, and online multimedia agency BlueLevel, among many others.