Friday, June 2, 2017

Doofus Of The Day #960

Today's award goes to the contractor allegedly responsible for completely disrupting British Airways' flights for several days last week.

The crisis was caused by power being suddenly lost to BA's two main data centres, with the problem worsened by an uncontrolled reboot of the system which shut down the entire IT system. All information about flights, baggage and passengers was lost and travellers were left stranded over the bank holiday weekend with at least 700 flights cancelled at Heathrow and Gatwick.

Bill Francis, Head of Group IT at BA's owner International Airlines Group (IAG), sent an email to staff, seen by the Press Association, which confirmed that the shut down had not been caused by IT failure or software issues.

His email revealed that an investigation so far had found that an Uninterruptible Power Supply to a core data centre at Heathrow was over-ridden on Saturday morning.

He said: "This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries. This in turn meant that the controlled contingency migration to other facilities could not be applied.

"After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system, and significantly exacerbated the problem."

There's more at the link.

I daresay the maintenance worker who flicked the wrong switch (and/or the company that employed him) won't be able to afford to pay the several hundred million pounds that the disruption is going to cost BA, after all compensation claims are paid and its aircraft and crews are finally back at work.  That's a hellish expensive mistake!



Old NFO said...

Huh, overriding a UPS is not a trivial issue at that level. That's starting to sound more like sabotage. I've been involved with data center UPS systems, and that story just doesn't make sense.

Rick T said...

One of our hospital customers was replacing the facility-grade UPS for their server room and the project was almost complete. When the contractor tried to go in to bypass mode for final testing the server room went black. Yes, the wiring wasn't done correctly nor verified during the installation.

Factor in a panicking tech who tries to turn things back on after the "Oh S**T" and you get damage during the power surge when EVERYTHING tries to restart at the same time.

Big question was why they didn't execute a Disaster Recovery fail over to the 2nd data center to resume operations. The most likely answer is they haven't been able to complete a successful DR test in years but it wasn't important enough to get the cash budgeted to fix the problems.

MadMcAl said...

The interesting point is that it were TWO main data centers affected.

If there is a single point of failure below catastrophic (EMP, Nuke or Godzilla) that can shut down not one but two main data centers then we have something akin to a vent directly going to the main reactor. Only thing missing is the X-Wing launching the proton torpedos.

And they did not execute a Disaster Recovery because the shutdown and resulting unorganized restart physically damaged the servers. They had to replace several of them, followed by a ordered start up. Then everything worked fine.

Unknown said...

It's the fault of BA, they didn't design a resilient system.

I've worked in similar environments, and the one constant is that someday, your datacenter is going to loose power, and you better have a plan for when that happens.

I've seen UPS systems fail, I've seen generators shut down because the person filling the tank put the nozzle in the wrong hole and air got into the fuel line. I've seen someone unplug one item and drop the plug, only to have it hit a power switch on it's way to the floor

Things happen, if shutting down one UPS caused this sort of chaos, the system was badly designed (and never tested)

David Lang

RBM said...

I kinda call BS here. Shouldn't the data centers have real-time failover. There is software for Enterprise companies that mimics every transaction to a duplicate data center off-site. I've seen it in practice. BA was cheap and paid for it.

Unknown said...

doing real-time failover between datacenters is HARD, it's not just 'buy this software and install it', and there are very real performance costs to worry about (including speed-of-light limitations for how fast the signals can get from datacenter to datacenter)

The vast majority of Enterprises don't replicate every transaction individually, they rely on periodic copies of larger chunks of data (hourly jobs that replicate everything that took place in that hour for example)

And like all backups, such systems need to be tested periodically (and I've known companies that have been around for many years, one nearing two decades, who have never had a real successful test, whatever they claim to the auditors)

It is the airline's fault, not the contractor's, but don't fall into the trap of assuming that this was a trivial thing for them to have done right.

David Lang

Rick T said...

Proper Business Continuity (BC) requires a local data center (which they had) but also 100% duplicate hardware maintained as a functional mirror so the outside world never knows which set of computers are running at any time. Properly done the 2nd site should have taken over production immediately allowing proper time and analysis before just cranking the main circuit breaker closed at the failed site without allowing for surge loads and damage.

As David said, this is HARD and a very expensive insurance policy. BA rolled the dice and came up short. There will be a lot of hard questions at the next board meeting.

Unknown said...

Very few companies have business continuity plans that make a datacenter failover transparent. That is incredibly expensive to do.

Like all risk management, it's a matter of cost/benefit analysis. It costs X to be able to be up and running again in 48 hours, 2X to be up and running in 24 hours, 4X to be up and running in 12 hours.

most companies don't have duplicate hardware, they have some sort of degraded capability that they can get up. With "modern" "cloud computing", you are lucky to have companies mirror their data to another datacenter, and in a disaster they plan to create their entire setup from scratch in a new datacenter on virtual machines.

In theory, this is trivial to do. In practice this is much harder to do and the vast majority of companies that try it end up failing when they try to use it and end up taking FAR longer to get back up and running if they need it.

We've even seen Amazon have problems when there is a datacenter outage, and they are in the top 1% of capabilities in this area.