Last Monday I got caught up in the evacuation of Clapham Junction, Europe's busiest railway station with some 2,000 trains a day, and peak of 180 per hour, passing through.
Everyone, including people like me who were just changing trains at Clapham Junction, were forced off the platforms and out of the station as it implemented its ‘disaster recovery plan' - or was it demonstrating its lack of a plan?
Certainly it was a drastic solution to cope with overcrowding on the platforms which posed a risk to passengers; ironically the situation had occurred as a result of a union dispute over passenger safety being jeopardised by reduced staff in the face of increasing passenger numbers.
Effectively the station decided to go for a complete ‘system reboot', back to its last known good state, ie, when it had no passengers. Unfortunately it didn't seem that the evacuation plan had been practiced and confusion reigned. Staff were unsure where to direct people, how 'strongly' to enforce ejection, and more passengers were disembarking adding to the throng. Eventually trains ceased stopping at the station, but it took an age for every last straggler lingering in waiting rooms to leave and it was hours later before normal service resumed. In the meantime passengers were entirely uninformed about what was happening.
The system was overwhelmed by an unexpected event. Real time ‘transactions' and ongoing business was halted at the station and the problem was one of resilience – how to keep the core functions of the network operating while minimising and containing problems so that they did not impact the entire system. On that higher priority scale - of minimising network impact - the plan worked. But for those within the 'contained incident' the impact could have been greatly reduced by practicing the plan.
In the virtual world, we have the option of switching to alternatives that do not need to be geographically co-located – and maybe even ‘virtual' cloud based services. We also have the ability to ‘scan' our traffic before it arrives, travelling via an intermediary – and shunting it off to a siding if we don't like the look of it. And we can ban or redirect all traffic coming from an undesirable same source.
But it's not the specific options that we choose that are the most important aspects of the analogy. It's that we need to be prepared if we want to minimise the impact of disruption to our services. Most of us are like Clapham Junction – we'll get through an unexpected event, but at a high cost to our service, our customers and our reputation.
However, a better recovery plan – that we actually practiced and refined to iron out problems – could make recovery a more seamless and less traumatic operation, and reduce losses, whether financial or reputational.
The old adage, a stitch in time saves nine, applies equally to railways and networks – expending time and effort on preparing your response now will save money and reputation later.