Recent revelations that the peer review system in climatology might have been compromised by the biases of corrupt reviewers miss a much bigger problem.
Most climatology papers submitted for peer review rely on large, complex and custom-written computer programs to produce their findings. The code for these programs is never provided to peer reviewers and even if it was, the peer climatologists doing the reviewing lack the time, resources and expertise to verify that the software works as its creators claim.
Even if the peer reviewers in climatology are as honest and objective as humanly possible, they cannot honestly say that they have actually preformed a peer review to the standards of other fields like chemistry or physics which use well-understood scientific hardware. (Other fields that rely on heavily on custom-written software have the same problem.)
Too often these days when people want to use a scientific study to bolster a political position, they utter the phrase, “It was peer reviewed” like a magical spell to shut off any criticism of a paper’s findings.
Worse, the concept of “peer review” is increasingly being treated in the popular discourse as synonymous with “the findings were reproduced and proven beyond a shadow of a doubt.”
This is never what peer review was intended to accomplish. Peer review functions largely to catch trivial mistakes and to filter out the loons. It does not confirm or refute a paper’s findings. Indeed, many scientific frauds have passed easily through peer review because the scammers knew what information the reviewers needed to see.
Peer review is the process by which scientists, knowledgeable in the field a paper is published in, look over the paper and some of the supporting data and information to make sure that no obvious errors have been made by the experimenters. The most common cause of peer review failure arises when the peer reviewer believes that the experimenters either did not properly configure their instrumentation, follow the proper procedures or sufficiently document that they did so.
Effective peer review requires that the reviewers have a deep familiarity with the instruments, protocols and procedures used by the experimenters. A chemist familiar with the use of a gas-chromatograph can tell from a description whether the instrument was properly calibrated, configured and used in a particular circumstance. On the other hand, a particle physicist who never uses gas-chromatographs could not verify it was used properly.
Today, each instance of custom-written scientific software is like an unknown, novel piece of scientific hardware. Each piece of software might as well be an “amazing wozzlescope” for all that anyone has experience with its accuracy and precision. No one can even tell if it has subtly malfunctioned. As a result, the peer review of scientific software does not indicate even a whisper of the same level of external objective scrutiny that the peer review of scientific hardware indicates.
How did we let this problem develop? I think it was simply a matter of creeping normalcy. The importance of scientific software grew so slowly that we never developed the habit of questioning it.
Thirty years ago, most scientific software was no more complicated than your average household spreadsheet is today. Software was mostly just a numerical summation tool that merely accelerated the processing of data. If a scientist had a computer, great, but if not it didn’t change the actual conclusions of their experiments. As a result of software’s relatively trivial nature, peer reviewers, journal editorial boards and other scientist paid little attention to the software that experimenters used to produce their results.
Unfortunately, that attitude has persisted even as software has grown from a minor accessory into the tool that actually performs the experiment. Today many papers are nothing but reports of what a unique piece of software spit out after processing this or that great glob of data. These software programs are so huge, so complex and so unique that no one who wasn’t directly involved in their creation could hope to understand them without months of study.
Just about everyone in business has had the experience of having to puzzle out a spreadsheet created by someone else who left the company unexpectedly. Although we seldom think of them this way, in reality each individual spreadsheet is a custom piece of computer software written in language of the spreadsheet program. (Technically, all spreadsheets are scripts.) Everybody else knows that you can’t trust the output of a spreadsheet just because the person who made it tells you, “It’s done in Excel.” To trust the output, you either have to compare it against known good data or you have to look at the individual spreadsheet itself to find any places it might go wrong. People create very complex spreadsheets and then leave and some poor schmuck gets stuck trying to figure out what the %$#@! the creator of the spreadsheet was trying to do.
Custom-written scientific programs are much, much larger and much, much more complex than any spreadsheet. It would take a huge amount of time for a peer reviewer to go through the code line by line to see if the software had any faults. Normally, peer reviewers work for only a token payment and they work in isolation. They don’t have the time or resources to actually check out a complex piece of software. Further, there is no guarantee that a peer reviewer in a particular field is competent to judge software. That is like assuming that a biologist who understands everything about the Humboldt squid can also rebuild any automotive transmission.
The practical inability of peer reviewers to verify scientific software doesn’t mean much in reality, because scientific institutions never even developed the standard that experimenters had to make the code for their software available to reviewers in the first place!
This raises a troubling question: When scientists tell the public that a scientific study that used a large, custom-written piece of software has been “peer reviewed” does that mean the study faced the same level of peer scrutiny as did a study that used more traditional hardware instruments and procedures?
Scientists have let a massive flaw slowly creep into the scientific review system as they have ignored the gradually increasing significance and complexity of computer software. Standards created to deal with relatively simple and standardized scientific hardware no longer work to double-check much more complex and nonstandard scientific software.
Eric S. Raymond, the famous computer scientist and writer, has called for open source science. I think this is the way we should go. In the past, it cost too much to print out all a study’s data and records on paper and ship that paper all over the world. With the internet, we have no such limitations. All scientific studies should upon publication put online all of their raw data, all of their protocols, all of their procedures, all of their records and the code for all of their custom-written software. There is no practical reason anymore why only a summary of a scientist’s work should be made public.
Scientific software has grown too large and complex to be maintained and verified by a handful of individuals. Only by marshaling a scientific “Army of Davids” can we hope to verify the accuracy and precision of the software we are increasingly using to make major public decisions.
In the short term, we need to aggressively challenge those who assert that studies that use complex custom software have been “peer reviewed” in any meaningful way. In the long term, we have a lot of scientific work to do over again.
See these two directly related post as well:
Scientist Are not Software Engineers — In which we learn that the critical software upon which we will reengineer the world is written and maintained by amatures.
Scientific Peer Review is Lightweight — In which we learn that not only isn’t scientific peer review the inquisitorial process most lay people think but that it isn’t even about science.