Facebook suffered a major outage last night which left some of its 500 million users unable to access the site.
Its director of software engineering Robert Johnson said that Facebook was down or unreachable for approximately two and a half hours, and called it ‘the worst outage we've had in over four years'.
He said that the main problem was ‘an unfortunate handling of an error condition', with an intent of the automated system to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store.
He said: “This works well for a transient problem with the cache, but it doesn't work when the persistent store is invalid. Today we made a change to the persistent copy of a configuration value that was interpreted as invalid, this meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
“To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.”
He commented that the way to stop the feedback cycle was quite painful as it had to stop all traffic to this database cluster, which effectively meant turning the site off and once the databases had recovered and the root cause had been fixed, more people were slowly allowed back onto the site.
“This got the site back up and running today, and for now we've turned off the system that attempts to correct configuration values. We're exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes,” he said.
“We apologise again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.”
Craig Labovitz, chief scientist at Arbor Networks, said that he used ATLAS data to graph Facebook's traffic and over a 24 hour period that included the outage, Facebook's traffic plummeted around 6:30pm GMT and return shortly after 9pm GMT. He said: “From a quick glance at the data, the outage appears to be global (impacting all of the 80 ISPs).”
Panda Security, talking about the DDoS attacks by the Anonymous group from the start of the week, said that some claims were made that Anonymous were somehow involved in taking down Facebook. Panda said: “They have expressed that they are not responsible for the attack, it it has nothing to do with their mission.”
Amusingly Facebook was forced to use its Twitter feed to say that it had highlighted the problem and say that it was working on fixing it.