Archive for the ‘power outage’ Category

When The Power Goes Out at Google

Wednesday, March 24th, 2010

By: Rich Miller

What happens when the power goes out at a Google data center? We found out on Feb. 24, when a power outage at a Google facility caused more than two hours of downtime for Google App Engine, the company’s cloud computing platform for developers. Last week the company released a detailed incident report on the outage, which underscored the critical importance of good documentation, even in huge data center networks with failover capacity.

Most of Google’s recent high-profile outages have been caused by routing or network capacity problems, including outages in May and September of last year (see How Google Routes Around Outages for more). But not so with the Feb. 24 event.

“The underlying cause of the outage was a power failure in our primary datacenter,” Google reported. “While the Google App Engine infrastructure is designed to quickly recover from these sort of failures, this type of rare problem, combined with internal procedural issues extended the time required to restore the service.”

Power Down for 30 Minutes
Data center power outages typically fall into two categories: those in which the entire data center loses power for an extended period, and those in which power is restored relatively quickly but hardware within the data center has trouble restarting properly. The Google App Engine downtime appears to fall into the latter category. Power to the primary data center was restored within a half hour, but a key group of servers failed to restart properly. The somewhat unusual pattern of the recovery presented the first challenge.