10
votes
The need for software testing: Neil Ferguson's unstable epidemiologic model
Link information
This data is scraped automatically and may be incorrect.
- Title
- Neil Ferguson's Imperial model could be the most devastating software mistake of all time
- Authors
- David Richards and Konstantin Boudnik, By James Cook, Sunetra Gupta, By Margi Murphy, By Laurence Dodds, Sir Nigel Shadbolt, By Hannah Boland
- Published
- May 16 2020
- Word count
- 319 words
I want to be crystal clear that I do not support the criticisms in this piece as a reason to oppose stay-at-home orders and the subsequent economic impact. I believe that regardless of the model/code used to make those decisions, they were ultimately the right choice.
Concerning the article, which is more of an opinion piece, I shared the outline since it's pay-walled. I work as a software tester. It is troubling that the code underlying this COVID model allegedly produces different results depending on how it is run and the platform. I expect there to be some small numerical noise in simulations, especially if running distributed on multiple cores, but if the results vary significantly (as alleged in this piece), that calls into question the reliability of the model.
I thought these alleged software flaws are interesting and wanted to share.
As a computer scientist, and without bothering to look at the original code, some of the criticisms seem hollow:
I've done some research and the authors might refer to accidentally typing things as single precision or accidentally casting an intermediate result to single precision. That's not great. But no mention whether that actually happens in the code. Sure fortran isn't great, but fortran is what underpins a lot of academic code outside of CS, from what I've heard.
Yeah, the guy who made the software is not a software engineer, he's being paid to do research in epidemiology. This is just how a lot of research code is. One guy knows all the things about a piece of code and he does all the work on it. Is it good software engineering? No, but he's not getting paid to do that, and sadly no one else is paid to do it for him either.
Yeah, no point preparing for the advent of testing if you have no one to write any tests. Still, not an actual criticism of the code or the conclusions derived.
Frankly, that isn't all too concerning. If one computer always rounds down and one rounds up, that's not a problem if you're working with 16 digit numbers. Still, failure to show that the conclusions are affected.
As a computer scientist.... get off your high horse(tempering my choice of words here..). This is not a CS problem. Not everything is. You know another piece of code that's undeterministic, horribly designed, unmaintainable and lives are at stake? Every law ever. Go play with lawyers.
Can rigorous software engineering improve the model? Probably. We could probably buff out the non-determinism, make it easier to adapt to future threats, more performant, yada yada yada. The article completely fails to demonstrate that the outcome would be any different. Methinks this is an attack on the lockdown by proxy. If the lockdown is a consequence of a prediction based on a model written with dodgy code, then the lockdown was wrong..?
Fuck no, the lockdown was necessary if you don't want to be worse off than italy. What could have prevented this is "testing testing testing". But not unit tests on your code, but early covid-19 testing once the threat was identified. Germany did that, and the (imo) most trustworthy german virologist completely attributes our success to knowing early what is going on. Our testing allowed us to know the virus was here and how fast it was growing, where other countries waited a lot longer.
By all means, fund epidemiology and virology better. Give Neil Ferguson's lab a trained software engineer. Fund the NHS adequately to enable them to quickly develop and scale up the required tests. But don't open back up without knowing WTF is going on.
Ninjaedit: Not throwing shade here at OP, t'was a good read and you rightly distanced yourself from the epidemiological part of the argument.
That's a good catch. Well, Mr. Richards, you know what else runs Fortran? The Large Hadron Collider. LIGO. The International Atomic Energy Agency. I could probably fill encyclopedias with my list of complaints about Fortran, but it is a solid language and the glue that holds up a lot of academic research. There are mountains of tried and true monte-carlo simulations written in Fortran. Experienced researchers are well-aware of its pitfalls and will know to use it well. This is the software-equivalent of an ad-hominem - if I can't find an actual flaw, I'll cast aspersions on the language used.
I'm 100% on board with your criticisms. I agree that the article is essentially a flawed attack on the lock downs, which was a necessary step in limiting spread. I appreciate your long response and perspective here.
I've come to understand the same thing. Like any language, good Fortran can be written.
I think a much better outcome of this article would be to advocate for precisely what you said: better funding and possibly supplying someone to write better code.
Nevertheless, I am inclined to wonder whether the high death rates predicted (for the US) led to a sort of self-fulfilling prophecy. We are nearing the 100k deaths mark. There are a litany of other issues, namely testing and unclear leadership directives, that got us here. However, the 100k death toll was stated early on as a "very good job" in limiting the impact, so I wonder whether that partially charged the public sentiment with a "nothing we can do" mentality. What if it was 10k deaths that was a very good job? Would the response have been more aggressive, or would it be worse? It's obviously a speculative question with no concrete answer.
FORTRAN with
implicit none
is generally fine, this does things like disabling implicit creation of new variables, making I,J,K not implicitly be integer counters, requiring array size bounds, etc. Inimplicit
mode there are many ways to shoot yourself in the foot (and unfortunately, lots of legacy codes are written like this) but you can do that in modern languages , e.g.,:There’s nothing inherently wrong with FORTRAN, but I’ll take C or C++ any day.
Edit: honestly the worst thing about FORTRAN is that the language spec is behind a paywall, so you don’t have an equivalent to things like: https://en.cppreference.com/w/
That’s a roundabout way of saying, if I’ve interpreted what
implicit none
does incorrectly, it’s because what documentation I’ve found is not great :pYou could easily argue the other way, that too-low predictions would result in dangerous complacency. But I think focusing on public reaction to predictions in this way is missing the point of doing science. It's important to separate figuring out what's going on with communicating those results.
It's particularly important to communicate about uncertainty. The total death toll is the area under a curve and it's very difficult to predict the peak or back side of the curve in advance. Due to parameter sensitivity, it's based on little more than guesses. The main message was accurate: if we don't stop it, this is going to be very bad.
In such circumstances, giving out any single number is misleading - it should be a wide range.
I'm not sure what machines this code was being written on, but I want to note that a lot of academic code is written in C or Fortran simply because compiler support for other languages is pretty non-existent on high performance machines. And, of course, Fortran is blazing fast and the recent updates to the language make it way more usable than like Fortran77.
I actually find this whole thing kind of humorous, as I never realized that others thought academics were writing deployment-ready software packages for their research. Most academic code is never touched by anyone other than the person who wrote it, and coding practices like we see here are pretty much the norm. I don't think it's a huge issue overall as the code is a means in this case and an ends.
What? I mean, at the very least you'll have C/C++/Fortran, CUDA, python, OpenMP / OpenACC, various MPI stacks, runtime/transport layers (e.g., Charm++) etc. Not sure what other compiler support you're really looking for there, as that really covers 95% or so of most HPC / compiler codes. You could always build your own Rust compiler or something if you needed
Fortran is "blazing fast" in the sense that compilation times are fast... but in that case, so is C. The code itself can been fast or slow depending on how optimized it is. It's not inherently faster or slower than properly written C / C++ or even something like Python using Cython or numpy to offload the computation-y bits.
No, you're definitely right on both points. I used C and Fortran as an example there, but I mainly meant that 'newer' languages like Rust, Go, etc that get pointed to as "replacements" for things like Fortran or C/C++ don't have as much support. You're definitely correct that all of those things have HPC support on many systems.
I personally don't think using Fortran is somehow inherently an issue, especially considering the language has been updated multiple times in the past years. This isn't as relevant for the linked article, but I have seen criticisms passed around about Ferguson's model being based on a 'decades old' language as though that was itself an issue. I guess I was more responding to an argument not made here.
Most of the things you mention make shooting yourself in the foot or writing bad code just as easy as fortran(at least as far as my experience will carry you).
Python, C, Fortran and the shader-like language I learned for HPC (was it OpenMP or OpenCL? Don't remember) are all various kinds of dangerous if not used properly.
So from that angle, @gpl is kind of right: Academics write brittle code because the HPC ecosystem kind of sets you up for it. Keep in mind I'm not saying it's impossible to write good code in any of the above, just that you can find yourself shifting to less-than-ideal practices or can mess up and not even notice.
Can you tell I like my compiler to be the dominant part in our relationship? :D
You make a great point. I wonder why more academic researchers haven't shifted to Matlab. It's not outrageously expensive and looks to have parallel/cluster computing packages to handle gpu acceleration and MPI. I think there's an argument that this research is pretty important and repeatability is definitely desireable, but in a lot of cases, code is just for the person writing it.
Matlab is $860 a year (or $2150 for a perpetual license).
Python is free, and kicks the pants off of Matlab w.r.t., GPU acceleration / extensibility / available packages (e.g., Tensorflow and friends) any day of the week.
Really the only reason I ever see to use Matlab is if you're deep into Simulink or their control-process design stuff.
I would imagine their license cost is tenable for a research institution. But yes, python is becoming a strong competitor with the packages the numerical packages they have available. I can't speak to their distributed computing or whether they're comparable in compute speed.
I think robotics is a good application of Matlab. It's obviously excellent for linear algebra, so I know a lot of academic heat transfer codes are written in Matlab in order to do essentially custom finite element work.
I think if that's a prevailing view, it needs communicating that corona is probably a lot less severe than previously thought. The original Wuhan reports indicated a 3% fatality rate, which combined with an expected ~60% infection rate in an improperly mitigated scenario, gives you 2% of population dead. For the US, that's 6 million. That's just completely staggering. With a bit of optimism, you can convince yourself that currently, the infection fatality rate is as low as 0.3% (certainly isn't 1% at this point). That gives you 600000 dead. In other words, if the government fails to contain this, we expect at least that many. More if hospital overload is increasing fatality rates. If we scale down how we judge the government reaction appropriately, that 10k death toll came and went for the US. Compare too, that Germany exceeded that target death toll (adjusted for population) by a factor of ~2, and we've generally done a good job; so by that metric, that 100k/10k goal is quite optimistic.
In any case, I don't think the US govt did a good job at all. How do we communicate that? Ahh, that seems to be the question we have to ask at every corner here, isn't it? Nothing about this pandemic is simple enough for how our news work these days.
We're free to share outline links in comments, but AFAIK the policy on Tildes (likely due to Canadian copyright issues) is to always link the primary source so long as that site is still functioning, regardless of if it's paywalled or not. See: @Deimos' comment here. So I have changed the link to the original source.
p.s. The outline link in question for those who need it: https://outline.com/cnvAS8
Got it, thanks!
For anyone stopping by, feel free to label Offtopic, but because of you cfab I just discovered
Thanks!
*I assumed the gender because of the use of 'cockle' in the 1st intro thread. I'm hoping this doesn't backfire.
Heh... "warms the cockles of my heart" is an idiomatic expression essentially just meaning "makes me deeply happy". Cockles refers to the cochleae (ventricles/chambers) of the heart. It has nothing to do with penises or gender. :P
But don't worry about that backfiring, at least with me, since I genuinely don't care which gendered terms people refer to me as. Though for the record I am a cis male...ish (it's complicated).
I genuinely haven't heard that one before. Good to know!
Haha. After that smiley I presumed it would actually backfire. Then, for a second I really thought it was all said and settled with cis male. Aaand then the plane touched down in a weird middle place of a situation. What a rollercoaster! Thanks for that.
Gender can be weird like that sometimes. ;)
Thank you for sharing this article. It has all the substance lacking from the Telegraph piece.
I can't bypass the paywall, so... by what factor was it off? And... was Italy just a glitch in the Matrix?
https://outline.com/cnvAS8
So the difference of results is not clearly defined in this piece, just described as "wildly" varying, which is why I refer to them as "alleged". I think it reduces the credibility of the argument since it matters here for accuracy.
It really doesn't matter if there's numerical noise. There's so much uncertainty in this that I really think it compounds how little numerical noise matters. But it would be troubling if two researchers couldn't run the same model configuration and couldn't get similar results.