Prepare for ‘partial failures’ of IT infrastructure like Visa outage

Visa’s letter to the Treasury Select Committee, documenting details behind the recent outage which left millions of people unable to complete card transactions, reinforces a critical challenge that organisations face when exposed to a ‘partial failure’ of IT infrastructure. This is according to Peter Groucutt, managing director of Databarracks.

This week, Visa revealed that a ‘rare defect’ to a switch caused a partial failure in its primary UK data centre. The issue delayed its secondary data centre from assuming responsibility for handling all of its card transactions taking place and was the root-cause behind millions of failed card transactions, over 10 hours on Friday 1st June 2018.

In the wake of the outage The Committee contacted the payments firm, seeking clarification over the cause of the outage and assurances to what action Visa is taking to prevent a repeat. Amongst the findings, Groucutt reveals that a number of lessons can be learned:

“Businesses are often better prepared for a complete outage than ‘partial failures’. When a system fails completely the process to fail-over is more clearly defined to whether it is a manual action, or automatic process. Partial failures however, make that change-over difficult. Once the problem has been identified, you have to make the decision to either fully switch to the secondary system or fix the problem on the primary. Defining the point at which to fail-over is specific to each organisation and the issue you are dealing with.

“A switch issue, for instance will require a different response to a natural disaster. An organisation with good Incident and Crisis Management processes will have these processes in place – decisions will already have been made and documented, so in the event of an incident, a business knows exactly what to do.

“In practice, a business might decide that it can’t tolerate an outage of longer than four hours. If it takes two hours to be fully operational at a second site, it then leaves you a window of just two hours to fix that issue before committing to fail-over.

Groucutt continues: “We would expect Visa to have a very mature incident management process in place and based on the reports, that was absolutely the case. Partial failures can be very difficult to plan for and mange, but the issue was identified, and response protocols initiated.”

Groucutt concludes: “The lessons Visa can take from the incident is that they weren’t prepared for this particular partial failure and should address this by building new processes to allow the backup switch to take over. We can all do the same.

“It is a good idea to include issues like this in your testing. It’s not just switches – we’ve seen exactly this issue for UPS systems and generators too. An organisation will have a testing schedule for each of these technologies, so it’s important to include the impact of partial failures to these. A business should think about how quickly it can identify what the issue is and importantly, the actions which then need to be taken to either fix the problem and recover or alternatively, manually take it offline and failover to a secondary site.”

Memorandum of Understanding (MoU) agreement is part of ABB’s global strategy to grow through a network of channel partners.
Eaton has introduced its xModular system - the latest addition to its critical systems portfolio that brings innovation, integration and a digital dimension to the design, deployment, and operation of data centre type facilities.
Now celebrating its fifth year, the Managed Services Summit Europe 2022 (MSS Europe) in Amsterdam, the leading managed services event for the European IT channel, is set to make its return on 21 June, building on the success of the last year’s summit.
New report provides key recommendations to MSPs on how to best grow their cloud businesses.
Research commissioned by Lenovo reveals CIOs are more involved than ever before in areas outside their traditional technology purview, such as business model transformation, corporate strategy, and sustainability.
Agilitas Launches ‘CONNECT for Enhanced Experiences’ Report.
Multiyear strategic collaboration agreement drives customer value and innovation, and strengthens AWS’s position as Rackspace Technology’s preferred cloud provider.
The strategic cooperation will help develop sustainable solutions such as data centre heat re-use or alternative power sources.