Twitter has admitted that its service has been disrupted too many times recently.
Twitter spokesman Matt Graves said that while it is working on long-term solutions to make Twitter more reliable and stable with the bulk of its engineering efforts currently focused on this issue, it has now moved resources from other projects to focus on this as it continues to have users frustrated at an unavailable system.
However the micro-blogging website has continued to have problems when it comes to the demands of its users. Twitter engineer Jean-Paul Cozzatti explained that a database request locked up the service last Monday, causing problems on both Twitter.com and its API.
While the service was tested to the limit during the World Cup, Cozzatti claimed that more than 50 optimisations and improvements have been made to the platform, including: doubling the capacity of its internal network; improving the monitoring of the internal network; rebalancing the traffic on the internal network to redistribute the load; doubling the throughput to the database that stores tweets; making a number of improvements to the way we it uses memcache, improving the speed of Twitter while reducing internal network traffic; and improving page caching of the front and profile pages, reducing page load time by 80 per cent for some of the most popular pages.
Despite the improvements, ‘there are still times when we run into problems unrelated to Twitter's capacity' according to Cozzatti.
He said: “On Monday, our users database, where we store millions of user records, got hung up running a long-running query; as a result, most of the table became locked. The locked users table manifested itself in many ways: users were unable to sign up, sign in, update their profile or background images and responses from the API were malformed, rendering the response unusable to many of the API clients.”
He said this affected most of the Twitter ecosystem, and a force-restart of the database server in recovery mode took more than 12 hours. “During the recovery, the users table and related tables remained unavailable. Unfortunately, even after the recovery process completed, the table remained in an unusable state,” he said.
“Finally, we replaced the partially-locked user database with a copy that was fully available (in the parlance of database admins everywhere, we promoted a slave to master), fixing the database and all of the related issues.
“We have taken steps to ensure we can more quickly detect and respond to similar issues in the future. For example, we are prepared to more quickly promote a slave database to a master database, and we put additional monitoring in place to catch errant queries like the one that caused Monday's incidents.”
Cozzatti also confirmed that it will move into its own data centre later this year to help make Twitter more reliable. The facility in Salt Lake City, Utah, will give Twitter full control over network and systems configuration, with a much larger footprint in a building designed specifically around its unique power and cooling needs.
He said: “Twitter will be able to define and manage to a finer grained service level agreement on the service as we are managing and monitoring at all layers. The data centre will house a mixed-vendor environment for servers running open source OS and applications.
“Importantly, having our own data centre will give us the flexibility to more quickly make adjustments as our infrastructure needs change. This first Twitter managed data centre is being designed with a multi-homed network solution for greater reliability and capacity. We will continue to work with NTT America to operate our current footprint, and plan to bring additional Twitter managed data centres online over the next 24 months.”