Gmail blackout caused by engineering error during routine upgrades to web servers

News by Dan Raywood

Google's 150 million Gmail users were unable to access their accounts for two hours after an error during 'routine upgrades' to the company's web servers.

Google's 150 million Gmail users were unable to access their accounts for two hours after an error during ‘routine upgrades' to the company's web servers.

Google initially claimed that the issue affected only ‘a small subset of users' and that ‘service had already been restored for some users' less than 20 minutes later'.

David Besbris, engineering director at Google claimed that the issue was fixed and that it was ‘still investigating the root cause of this outage, and we'll share more information soon'. He also claimed that minor issues were not normally discussed on the official blog but ‘because this is impacting so many of you, we wanted to let you know we're currently looking into the issue and hope to have more info to share here shortly'.

Ben Treynor, VP engineering and site reliability specialist at Google, further explained that Google had already thoroughly investigated what happened, and was currently compiling a list of issues it intends to fix or improve as a result of the investigation.

He said: “Gmail's web interface had a widespread outage earlier today, lasting about 100 minutes. We know how many people rely on Gmail for personal and professional communications, and we take it very seriously when there's a problem with the service. Thus, right up front, I'd like to apologise to all of you — today's outage was a big geal, and we're treating it as such.”

He said: “Yesterday morning (pacific time) a small fraction of Gmail's servers were taken offline to perform routine upgrades but had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response.

“At about 12:30pm PT a few of the request routers became overloaded and in effect told the rest of the system ‘stop sending us traffic, we're too slow!' This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.

“As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.”

After the Gmail engineering team was alerted to the failures and established that the core problem was insufficient available capacity, the team brought additional request routers online, distributed the traffic across the request routers, and the Gmail web interface came back online.

Treynor said that Google has turned its full attention to helping ensure this kind of event does not happen again by increasing request router capacity well beyond peak demand to provide headroom and concluding that request routers do not have sufficient failure isolation.


Find this article useful?

Get more great articles like this in your inbox every lunchtime

Video and interviews