Horizon IT scandal: repeating patterns

There is so much that has been written and said about the Horizon scandal over the past 10 years, that it’s hard to pick out what actually happened, and how it got so bad.

When things go really badly wrong, there’s never just one cause. Many things have to have failed. In the case of Horizon, this list is huge: the tech, the supplier, the culture, the investigations, the lawyers, the system of private prosecutions, the management, the politicians, the governance and scrutiny, the recovery. There are things wrong at pretty much every level.

But, many of the things that went wrong are not unique to Horizon. We see them every day.

This is a repeating pattern of failure.

A pattern that pervades the tech industry.

I’m going to break this down into a few areas with examples. But this is by no means comprehensive.

Bad technology

There is no getting away from the fact that the tech was bad - Horizon was riddled with bugs.

If you read the witness statement from David McDonnell, an engineering manager on one part of the system, you’ll get a sense of why it was so bad: poor software development practices, bad release management, no testing, no support from leadership to do it right etc.

But Horizon is far from unique here. We see these patterns cropping up over and over again.

In 2011, the Canadian Federal Government rolled out the Phoenix system for paying government employees. By 2018, it had caused pay problems to close to 80% of federal government public servants through underpayments, overpayments and non-payments. It’s estimated to have caused CAD$2.2bn in unexpected costs and is still paying some public servants incorrectly 15 years after it was introduced.

It’s not possible to build a completely bug free system, but there are techniques and methodologies to reduce the possibility and impact of bugs that do occur - things like test driven development, continuous integration and automated testing.

Arguably, if Horizon had been built using these techniques, this issue would have been less likely to occur.

But - we have hundreds, if not thousands, of systems in government and private sector across the world built in the same way. Any one of them could lead to another Horizon, or Phoenix.

Bad design

If you read any of the background of the problems, you’ll see regular references to error messages not being clear, and the system doing unexpected things.

“One, named the “Dalmellington Bug”, after the village in Scotland where a post office operator first fell prey to it, would see the screen freeze as the user was attempting to confirm receipt of cash. Each time the user pressed “enter” on the frozen screen, it would silently update the record.”

This is a system that was never designed with its users in mind. But again, Horizon isn’t unique here. We see this pattern cropping up over and over again.

For example, in September, we heard that hospitals in Newcastle failed to send 24,000 letters to patients because they’d been placed into a folder that staff didn’t know existed. The same issue was reported in Nottingham where 400,000 letters were not sent. Recently, the health service watchdog has said IT failures are causing patient deaths.

In 2019, the driver of an LNER train crashed into another train, causing damage and a derailment (but thankfully no injuries). In the subsequent accident investigation report, it was highlighted that the driver was distracted because he was struggling to set up the train management system. He’d pressed the wrong button earlier in the journey.

These problems are often put down to “user error”. But really it’s just bad design.

Well designed systems minimise the possibility for a user to do the wrong thing.

We have techniques for delivering well designed systems - such as user-centred design user research and usability testing.

Arguably, if Horizon had been built using these techniques, this issue would have been less likely to occur.

Bad technology understanding

On Thursday, the Inquiry heard from the Post Office investigator Stephen Bradshaw. We were told that he had signed a statement in 2012 saying he had “absolute confidence” in the integrity of the Horizon system, but he told the inquiry that he was not “technically minded”.

This caused a bit of a stir in my social circle, but is far from unusual.

Many people involved in commissioning, managing and operating technology do not have the technical knowledge to understand how systems work. I’ve seen senior individuals talk about their lack of technology understanding as if it’s a badge of honour.

Not having enough technical understanding or curiosity makes it incredibly difficult to identify where problems are occurring, to challenge delivery, or hold organisations to account.

If the Post Office staff managing the Fujitsu relationship had a better understanding of how the system worked, it would have been obvious that the system was poorly built. If the Post Office investigators had a better understanding of the system, they’d know what faults were likely to occur. If the Post Office leadership had better understanding of the system, they’d have known they were on risky ground from day 1. If the officials in government had technology understanding, they’d have seen this issue coming.

This isn’t unique to Horizon.

In 2018, TSB Bank migrated data to a new IT platform. When the migration took place, it caused significant disruption to services, affecting all branches and a large proportion of its 5.2 million customers. TSB had to pay £32.7m in compensation to customers. In a review of the incident, the Financial Conduct Authority (FCA) highlighted that it was critical for TSB to understand how the banking platform was being built, but it didn’t. The FCA fined TSB £48.6m.

In 2013/14 the corruption of patient data at the Princess of Wales hospital prompted a criminal investigation, resulting in approximately 70 nurses being disciplined, with some charged with wilful neglect. The hospital investigators, police and prosecutors were unaware of the cyber security incident that corrupted the data and assumed the nurses had fabricated records. There was an assumption that the system was infallible. Only after a technical expert witness started asking the right questions in court did the issues become obvious. Sound familiar?

In modern governance, we don’t let people without a financial background manage finances, we don’t let people without a legal background manage legal issues. But it’s absolutely ok for someone without any tech experience to buy, manage and deliver tech systems that are absolutely core to the running of our organisations.

Bad culture

The final area I want to talk about is culture.

In David’s McDonnell’s witness statement I linked to above, he talks about the concerns he raised about the issues with the system. When he caused a fuss, he was replaced.

In 2003, Jason Coyne, an IT expert, was instructed to examine the Horizon system. He said he notified the Post Office that the data was “unreliable”, but he was ignored, sacked and then discredited.

In Stephen Bradshaw’s appearance at the Inquiry this week, we heard that he was told by his equals that there was a growing body of cases relating to bugs in Horizon, but because it didn’t come from above he ignored it.

This week we’ve heard a recording of the Director of Comms at the Post Office continuing to call the post-submasters “criminals”.

These all show that there was no safe space at the Post Office / Fujitsu for people to raise concerns, or challenge what was happening.

We’ve seen from other cases that this sort of culture ruins lives. For example, in the recent Lucy Letby case, we’ve heard that people tried to raise concerns but they were ignored and forced to apologise.

In some safety critical industries, there is a designed no-blame culture. People are encouraged to raise issues. When something goes wrong, an independent investigation establishes the facts. Recommendations are widely shared. Everyone gets better. Safer.

In the wake of the Alaska and Japan airplane incidents over the last few weeks - the New York Times published this great article on airplane safety and how decades of “no-blame” investigations have made errors less likely.

In the UK the Air, Rail and Marine Accident Investigation Branches take a ‘no blame’ approach to identifying the causes of incidents, with the sole aim of stopping incidents happening again.

Even though technology is now a fundamental part of our lives, the industry rarely approaches failures with a ‘no blame’ culture. Some organisations do, and there are techniques like blameless retrospectives and incident postmortems which help. But it’s far from common.

This has to change.

The systemic issue

As I’ve said, Horizon isn’t unique. It’s not simply the case of a bad IT project, a bad supplier, a bad organisation.

These issues are systemic. They’re repeated over and over and over again.

I’ve highlighted a few cases in this post including Canada’s Phoenix system, hospitals in Newcastle and Nottingham failing to send letters, the TSB migration, the corrupt data at the Princess of Wales Hospital.

But there are many more:

These things shouldn’t go wrong. But they do. All the time.

Tech is fundamental to the running of our world. It’s ingrained in everything we do. And it’s getting more and more complex all the time.

We’re on the verge of a political change in the UK. Now is the perfect moment to rethink how we build, design, commission, manage and treat technology. I look forward to seeing how all the political parties approach this in the run up to the election.

Unless we change, we’re doomed to repeat Horizon again.

Bad technology

Bad design

Bad technology understanding

Bad culture

The systemic issue

More reading