I am embarrassed to say that when reading the infamous Lancet Study for my previous post, I was so stuck by the idiocy of using cluster sampling and self-reporting in a population (the Sunni) who have a strong motive to exaggerate that I just flat ignored the actual resulting statistics. Since I knew the methodology was crap I knew the numbers were crap and I didn’t look any farther.
Commentator JohnChris and Fred Kaplan over at Slate (via Instapundit) both pointed out that the confidence interval on the studies results, even with Faluja excluded, is so broad to be utterly useless.
Kaplan nails it so I will except a bit:
“Readers who are accustomed to perusing statistical documents know what the set of numbers in the parentheses means. For the other 99.9 percent of you, I’ll spell it out in plain English—which, disturbingly, the study never does. It means that the authors are 95 percent confident that the war-caused deaths totaled some number between 8,000 and 194,000. (The number cited in plain language—98,000—is roughly at the halfway point in this absurdly vast range.)
This isn’t an estimate. It’s a dart board.
Imagine reading a poll reporting that George W. Bush will win somewhere between 4 percent and 96 percent of the votes in this Tuesday’s election. You would say that this is a useless poll and that something must have gone terribly wrong with the sampling. The same is true of the Lancet article: It’s a useless study; something went terribly wrong with the sampling.”
Of course, we know what went wrong with the sampling. The study’s basic design was flawed for examining a phenomenon know a priori to be highly asymmetrical.
This raises the obvious question: How did such a seriously flawed study get published in a prestigious (Lancet is the British equivalent of the New England Journal of Medicine) medical journal? The only possible explanation is political bias of the authors, the peer reviewers and the publisher.
Evidence for this comes in the observation of poster AMac who noted that:
“As an author of papers published in peer-reviewed journals, I was struck by the extraordinarily compressed time-line of this publication. Readers outside the biomedical fields might consider what the peer-review process involved:
1. Data were collected in September 2004, and the authors had completed compilation, statistical analysis, drafting of text, artwork, and proofreading in order to submit their work in the form of a for-publication draft manuscript (MS) to the Lancet Editor.
2. The Editor read the MS, chose peer-reviewers, had the reviewers comment on the MS, evaluated these comments, passed his/her favorable judgement on the MS to the authors, with any suggestions for necessary or advisable revisions.
3. The authors revised the MS and resubmitted it.
4. The Editor and perhaps the peer-reviewers reviewed and approved the revised text and figures. The MS files were sent to the Lancet’s copy editors for proofreading and digital typesetting. Author queries were generated and sent to the lead author, and the responses incorporated into the typeset version. Finally, the complete manuscript, ready for printing, was published on the Lancet’s website.
Four to eight weeks is an unusually short time for a high-impact journal such as the Lancet to bring such an article into print. I would doubt that Lancet, JAMA, Nature, BMJ, Science, or similar high-prestige journals have ever compressed their review and publication schedule in such a drastic manner.”
I’m not sure that the article is in the current or upcoming hardcopy version of Lancet but even so, publishing a study completed well under 60 days ago smacks of a rush job.
What we have here is the scientific equivalent of medical malpractice. We have a group of researchers who claim to have followed standard research practices, only they didn’t. They claim to have found statistically valid results, only they didn’t. Then we have a scientific journal that claims to have followed standard practices of peer review before publication, only it is very clear they did not. I think that the funders of the study would have grounds to sue were this any other profession.
In its own way, this incident is just as important as the CBS Memogate scandal. Memogate revealed that in order to advance its political agenda, a major media source ignored basic common practice in vetting the documents at the heart of the story. In this case, we see scientists funded by one of the premier medical research institutions in the U.S. (John Hopkins) and one the world’s best-regarded medical journals ignoring basic standards of practice in order to produce a result beneficial to their political biases. The failure of the institutions in both cases is glaring and requires the same sort of rigorous public review to ask what went wrong.
For example, by tradition, peer reviewers remain anonymous so that they can give their opinions without fear of professional bad feelings or institutional retribution. However, in this case I think it’s fair that the reviewers be asked to publicly justify their opinion. In fact, I think the failure is so egregious that we need to ask if the Lancet actually submitted the paper to peer review at all, and if they so, did they listen to any qualms the reviewers had? (Recall how CBS ignored its own document experts?)
Scientists carry great cachet in Western political and social debate precisely because they have traditionally been viewed as outside of politics. People believe that scientists and scientific institutions provide the best possible information to the public and politicians who then make decisions based on that information. Regrettably, it appears that an increasing number of scientists have fallen prey to the belief that one is first and foremost a political ideologue and only secondarily a member of a profession like scientist, judge or teacher. They believe that their primary obligation is to use whatever authority they have obtained by virtue of their professional standing to advance their political agenda.
This philosophy gave rise to the concept of judicial activism, in which judges ruled based on their belief in the best policy, not on their consensus interpretation of the law. This has lead to the intense politicization of the judiciary and a near complete collapse in the public’s respect for their rulings.
If the same thing happens with our scientific institutions we are screwed. We need to ask serious questions right now.
20 thoughts on “Scientific Malpractice”
i’m 100% confident the number of dead in Iraq is between 1 and 24 million. so i guess that makes it 12 million…
Well, this is certainly not the first case of scientists exploiting their prestige for political purposes. A lot of the scientific community rallied behind the Great Socialist Experiment. The UCS is not a scientific organization, but it does contain a number of scientists. And it’s pro-disarmament posturing in the 80s was rather pathetic. More recently a court of law forced the Danish Committee on Scientific Dishonesty forced to recant its persecution of Bjorn Lomborg. (See for example
Stichting Han and
Martin Gerup. The environmental movement is littered with models and scenarios whose only purpose is to make a political point.
Even the UN is beginning to realize that many of the statistical studies on Aids in Africa were perhaps not impartial.
Yep, we choose our heros, and Scientific Materialism seems to be the poison of choice for these past hundred years. Like the journalists who are annoyed that they cannot set the political agenda, there are those who see the effects of scientists and scientism on philosophy and society and cannot resist the shock effect of extreme scenarios.
People with power will misuse it.
Matya no baka
I’m Just seeing this post now, right after I posted a long comment about confidence intervals on your blog entry from October 29th. Hope you have a chance to go back and check it out, thanks!
Thanks for your cognizant input but Kaplan’s et al point still stands. The broader the confidence interval the sloppier the data.
As a another poster pointed out, the CI in the paper tells us that if we did 20 identical surveys, statistically 19 would give a casualty estimate anywhere between 8000-194000 while the 1 would be above or below that range.
All you really need to know about the paper though is that their “conservative” estimate of 51,000 violent deaths is statistically extrapolated from 21 unconfirmed reports of actual violent deaths. The only way this has a chance of being true is if the violence is symmetrically distributed across the whole of the 24.4 million Iraqi population but we know that this in absolutely not the case. Most Iraqi’s have seen no violence from the Coalition and large stretches of the country, especially in the eastern regions, have seen no serious fighting at all.
The researchers tossed out their Faluja findings because they were so obviously not representative but 7 of their clusters 33 clusters where in Baghdad so if only on of those clusters feel in the Sadar city or the like then they would have another outlier.
Violence isn’t random, especially in warfare. You can’t study it with statistical tools. Just because one particular neighborhood or region gets hammered tells you nothing about the likelihood that a similar but distant region will see violence.
One thing a peer-reviewer can ask is, “is there an independent way for the authors to support the conclusions that they have reached on the basis of their interpretation of their data? In many good papers, Figure 1 will use one method to show that a certain result. Then Figure 2 will use a different method to convince the reader of the same thing.
The more widely-accepted a method or result is, the less-necessary multiple approaches are. But the more ‘wild and wooly’ a claim is, the more valuable internally consistent verification becomes.
Plainly, Roberts et al. has no such mechanism for their extrapolation of circa two dozen recorded violent deaths to a nationwide invasion death toll. But they could have included one. They could have used their sample to estimate the toll of a violent episodes in Iraq’s recent history that has some reliable range of casualty figures associated with it. Three such are:
–Military casualties of the Iran-Iraq war of the 1980s;
–The Ba’athist executions in suppressing the rebellions of the early 1990s;
–The elevated infant mortalities due to Ba’athist abuse of the Oil-for-Food program of the late 1990s.
IIRC, the first two have casualty figures in at least the high-hundreds-of-thousands, and the third led to tens of thousands of excess infant deaths. These magnitudes are thus comparable or greater than the civilian deaths caused by the invasion.
What confidence intervals would Roberts’ methods have returned for these episodes? Would they have been wide or narrow? Consistent with published casualty estimates, or grossly inflated? This information would have been very helpful in evaluating the application of the methods they use in the case under study here.
I can think of some possible practical reasons why Roberts et al. did not take this approach. I can also think of other, politically-motivated reasons for rejecting this more-careful approach. Were I a reviewer of this manuscript, it is one of the points I would certainly have required the authors to address.
Thanks for checking out my comment. I’m really new to the commenting on political blogs thing (mostly I just hang out at my friends’ personal blogs.) Who knew this could be so consuming!
It is true that cluster sampling works best when each cluster is a microcosm of the whole population you’re trying to capture (see http://www.brainyencyclopedia.com/encyclopedia/c/cl/cluster_sampling_1.html for a mini primer) but often this is not the reality. They used cluster sampling because conducting a door-to-door simple random sample just wasn’t feasible. Iraq is a big place and not so easy to get around as we all know. However what they did do was try to minimize effects of outlier clusters (by reporting rates excluding Fallujah.) and being very clear about where their clusters were and what they found. On page 7 they mention how their cluster in Sadr City oddly reported no deaths. The hope is that these issues will balance out somewhat. But no one, not even the authors were claiming it’s perfect.
I agree with you that violence isn’t random. But I’d like to point out that their use of statistical tools was to try to show what role chance may have played. The authors are very clear that systematic error may be present.
Shannon, I’m not clear what point you’re trying to make here.
Are you claiming that the authors didn’t calculated the standard errors correctly, or that they didn’t include a design effects term? Because that would be untrue.
Are you, on the other hand, claiming that it would be perfectly normal to get a sample of this kind by chance if the Iraqi death rate had not increased substantially? Because that also would be untrue.
You might also note that one of the clusters *did* land in Sadr City. That cluster happened to have zero deaths in it. The authors did not strike it out as an outlier.
And furthermore, your claim that the Lancet report “claims to have followed standard research practices but didn’t” is a lie. They claimed to have carried out a cluster survey and did.
My post above (10/31/04 8:33pm) contains a significant error of fact. Roberts et. al did address infant mortality in their Lancet paper. My printer wasn’t working; I wrote, incorrectly, from memory. In fact, infant mortality is briefly described:
–pg. 3, bottom of col. 1
–bottom of Table 2
–pg. 4, bottom of col. 2
–pg. 6, middle of col. 1
The infant mortality figure that Roberts et al. produce is within a factor of three or less of that which can be supposed to be correct from other published sources. Shannon Love discusses this finding in a post subsequent to this one.
You wrote “Are you, on the other hand, claiming that it would be perfectly normal to get a sample of this kind by chance if the Iraqi death rate had not increased substantially? Because that also would be untrue.”
This is actually incorrect. In fact, cluster sampling when the clusters are heterogeneous population is notorious for producing samples that are not representative of the population as a whole.
Skip, you have simply asserted by fiat that the Iraqi population is extremely heterogeneous. The data doesn’t support this claim. Once the Fallujah outlier is discarded, the rest of the mortality rates by cluster all fit neatly on the same scale. Ex-Fallujah, the standard deviation of the estimated postwar mortality rates is only about 25% of the mean. And every single governorate except one saw a rising death rate. None of this fits into the picture you’re drawing of a heavily heterogeneous sample.
“And every single governorate except one saw a rising death rate.”
Not strictly true. Due to their adjustment of the clusters, 6 of the governorates were not sampled at all.
By they way, if you want to understand why statitics won’t rescue you from poor clustering read The Madness of Methods
Of course you are right; however I maintain my point – the clusters as sampled do not look like a particularly heterogeneous sample.
By the way, am I to take your silence on the subject as an admission that “claims to have followed standard research practices but didn’t” was an untrue claim on your part?
>>”Skip, you have simply asserted by fiat that the Iraqi population is extremely heterogeneous. The data doesn’t support this claim. Once the Fallujah outlier is discarded, the rest of the mortality rates by cluster all fit neatly on the same scale. Ex-Fallujah, the standard deviation of the estimated postwar mortality rates is only about 25% of the mean. And every single governorate except one saw a rising death rate. None of this fits into the picture you’re drawing of a heavily heterogeneous sample.”
Just repeating my point from another thread. A standard deviation of 25% of the mean across clusters is actually very bad news. Ideally, the standard deviation across clusters should be zero, with all variation occuring within clusters. Each cluster should be as close as possible to a smaller version of the entire population.
Ideally, the standard deviation across clusters should be zero, with all variation occuring within clusters.
Which is where we came in; any departure from the Platonic Form of the Epidemiological Study can be tricked up into sinister conspiracies to bias results by a suitably unprincipled “sound science” hack.
Skip, this is what the “design effects” included in the survey’s confidence intervals are there for. Do you have any specific reason why this methodology is not appropriate to this dataset, or are you just trying to claim that the researchers didn’t address cluster sampling issues when they did?
Furthermore, the effect that we are looking at here is a cohort study. While the cross-sectional variance is affected by the clustering, it is hard to see how the in-group variation of each individual cluster could be. For any of the clusters, it is very hard to argue that the post-invasion death rate is a draw from a process with the same expectation as the pre-invasion death rate.
All replies are in the original thread on the subject.
Comments are closed.