Online encyclopedia Wikipedia was hit by an outage on Wednesday evening due to an overheating problem in its European data centre.
Mark Bergsma, operations engineer for the Wikimedia Foundation, claimed that global outage was caused by many of its servers being turned off to protect themselves, and this resulted in all user traffic being moved to its Florida cluster, for which it has a standard quick failover procedure in place that changes the DNS entries.
He said: “However, shortly after we did this failover switch, it turned out that this failover mechanism was now broken, causing the DNS resolution of Wikimedia sites to stop working globally. This problem was quickly resolved, but unfortunately it may take up to an hour before access is restored for everyone, due to caching effects.
Keith Tilley, managing director UK and executive vice president for Europe at SunGard Availability Services, claimed that the outage highlights the importance of ensuring resilience throughout an organisation.
He said: “In today's ‘always-on' culture where information is the lifeblood of an organisation, customers will not hang around if faced with systems and information that are not available; they will simply go to a competitor. For this reason, business leaders need to consider whether they have the capabilities in-house to meet such high levels of customer expectation or whether they should look elsewhere for the skills.
“Ensuring the resilience of existing IT infrastructure can be a laborious, pain-staking and costly task for some IT departments, whose time could be better spent on more strategic priorities.
“Organisations need to take time to analyse operational risks, identify those areas where external skills and infrastructure are required or would be more effective and find the right outsourcing partner to mitigate these risks. In doing so, they can ensure information availability and keep their IT teams focused on strategic projects which will naturally grow the business.”