Tuesday, October 6, 2020

COVID-19, bureaucrats, and old technology - a recipe for a screw-up


The BBC reports that British bureaucrats "lost" data on thousands of COVID-19 cases due to using the wrong file format.

The badly thought-out use of Microsoft's Excel software was the reason nearly 16,000 coronavirus cases went unreported in England.

And it appears that Public Health England (PHE) was to blame, rather than a third-party contractor.

The issue was caused by the way the agency brought together logs produced by commercial firms paid to analyse swab tests of the public, to discover who has the virus.

They filed their results in the form of text-based lists - known as CSV files - without issue.

PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system and made available to the NHS Test and Trace team, as well as other government computer dashboards.

As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of.

And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases.
When that total was reached, further cases were simply left off.

For a bit of context, Excel's XLS file format dates back to 1987. It was superseded by XLSX in 2007. Had this been used, it would have handled 16 times the number of cases.

At the very least, that would have prevented the error from happening until testing levels were significantly higher than they are today.

But one expert suggested that even a high-school computing student would know that better alternatives exist.

"Excel was always meant for people mucking around with a bunch of data for their small company to see what it looked like," commented Prof Jon Crowcroft from the University of Cambridge.

"And then when you need to do something more serious, you build something bespoke that works - there's dozens of other things you could do.

"But you wouldn't use XLS. Nobody would start with that."

There's more at the link.

Tell me again how Big Brother is on our side, and looking out for us, and has our best interests in mind?

I wonder how many similar errors were made by government bureaucracies all over the world in the panicked rush to deal with the COVID-19 pandemic?  Were some of the thousands of deaths of elderly residents in New York nursing homes caused by bureaucrats who crunched the wrong numbers, and concluded that was the safest place for them to be isolated from outside risks of infection - while another set of bureaucrats crunched a different set of numbers, and decided that the best place for elderly people already infected with, and (hopefully) recovering from, COVID-19 was inside those same nursing homes?  We all know the result - but why did it happen?  How much of it was due to bureaucratic miscalculation?

Bureaucrats.  Data.  Some assembly required.  Some errors expected.




Peter

9 comments:

Ray - SoCal said...

Assuming no bias / ideology drives decisions.

Numbers can be manipulated / cherry picked to support the position one wants. Especially forecasts.

A lot of that seems to be happening with Covid 19 to support draconian responses that were actually ineffective.

Sam L. said...

Some firings needed.

Eric Wilner said...

Inappropriate use of spreadsheets isn't exactly a government-bureaucrat problem; it turns up everywhere, including high-tech companies.
The spreadsheet program (Excel, LibreOffice Calc, whatever) is handy, so it gets used for mocking up some data management when there's no time to do it right.
Then, well, there's never time nor budget for doing it right, because after all we've already got the data and formulæ in a spreadsheet!
Thus we end up with things in spreadsheets that ought properly to be in SQL databeese with central servers, access control, and consistency control - and the spreadsheets get passed around and manually edited by multiple people until there can never again be a current and correct version.
... And don't get me started on what PowerPoint has done to the whole concept of project reports....

Judy said...

Anybody who has used Excel in a commercial/group setting will tell you how God awful it is for accuracy. It's great for small temporary data manipulation done by ONE person. The margin of error goes up exponentially as more people have access for data input.

Eck! said...

Over the years I've seen Excel use as database, Editor, and even Email formatter. Those were some of the obvious cases.

They should be using a database not some desktop spreadsheet tool.

Excell and then interface to the database and handle data for presentation or other uses.

Dumb bunnies.

Eck!

Sherm said...

One of the biggest dangers of bureaucracy, particular government bureaucracy, is that there is no penalty for being wrong.

C. S. P. Schofield said...

The great weakness of 1984 as world building is the assumption that a Socialist State can maintain a network of ever-present surveillance AND MAKE IT WORK. Socialist States are run by bureaucrats, not technicians. They routinely place technicians in the classic "They pretend to pay us, and wee pretend to work" situation (Yes, I know the original quote has that reversed). Result? The Bureaucrats depend on a technology they do not understand and cannot assess.

Would such a world still be an absolutely miserable place to live? Certainly. But Big Brother would have serious astigmatism.

Will said...

I suspect that people play with spreadsheets because they are easy and CHEAP. Building a database is expensive. Had a friend twenty years ago who built them using Oracle. His charge was at least $100/hr+OT. No idea what the cost would be today.

One of his last jobs was building one for CA's CAL-TRANS, so they could track road repairs. 3/4's of the way done, he went to work the day after the elections and found the door locked and a sign indicating that his contract was terminated. Change of regime, and they tossed lots of things so they could play patronage games. Lots of money wasted, I'm sure.

Tirno said...

I dealt with this specific problem back in 1998.

I was finishing my degree in Atmospheric Science (i.e. meteorology) and the second year AtmoSci class had a long lab class. We were spending 12 class hours per week in this class, which is a lot considering that we were still only getting (to be fair, paying for as well) the maximum allowed 5 credits for the class.

One of the lab involved receiving a provided CSV file of sensor data, taken every 10 seconds, of numerous types: humidity, CO2 levels, dust levels, etc. Continuously sampled over two weeks after we set up the instruments, with only occasional drop-outs of individual instruments. The assignment was to do analysis on the data to derive averages, means, standard deviations of all the various sensors, after doing initial corrections on the raw sensor inputs based on in-class derived calibrations.

Let's do the math: 14 days * 24 hours * 60 minutes * 6 times per minute equals: 120,960 rows.

I didn't have a problem. I was a native UN*X nerd, and I was writing my stats analysis tools in Perl using streaming stats techniques I had adapted from the FORTRAN examples in my textbooks. So I'd written a simple set of tools to take the raw input file, run it through a calibration program to adjust the raw sensor values to true values, pick my desired output columns using awk, and stream the data through a stats script (automatically excluding any blank lines due to sensor outage). From that I derived my assignment required values and produced my deliverable. Written in HTML. By hand. Printed out at school using a web browser on a Sun SPARC station because if you're going for nerd-cred, get all the nerd-cred.

My classmates, who were all trying to get this done via Excel, were in deep shit. Eventually, one asked me, "How are you dealing with all the data? Excel keeps crashing." I told them I wasn't using Excel, and I was using my own tools. My classmate explained the problem to me, and I thought about it and said, "How about I make you a file that is a one-minute average of all the sensor readings, post-calibration, so you can fit just 20,000 rows into Excel?" I swear, they almost collapsed in shock. So, between classes, using the Sun SPARC workstation in the lecture room, I used one of my Perl scripts as a template to program live a script to ingest the data downstream of the calibration program, accumulate the data for each minute, dump an averaged output row whenever the minute field incremented and viola, a data file they could actually use.

A few weeks later, I got my graded report back and I had a B. So I went to the junior professor running that particular assignment and asked what was wrong with my report.
"Nothing, you had slightly different answers than everyone else in the class", she said.
"That's because I was using the raw data file," I replied.
"You... what?" With the look of a scientist proverbially intending to downshift from fifth gear to fourth gear to explain to this undergrad how the world works, but accidentally shifting into reverse and the engine leaping through the hood of the car in a tangled fiery mess.

So I walk this junior professor over to the Sun SPARC station in the corner of the lecture room, and showing her the original data file, and the calibration script, and the one-minute-average script, and the resulting averaged file with a timestamp of three weeks ago, and then running that file through awk and my stats tool and saying, "I bet my classmates got THIS_VALUE for the average dust concentration metric, whereas I got *clickety-clickety-click* OTHER_VALUE using the raw data. Incidentally, this does mean that every one of them used my calibrated values, so they didn't have to do that part of the assignment themselves."

And so my B was turned into an A+.

I won't say I was all that great an AtmoSci student, but don't F with me when it comes to crunching the data.