Lancet Letters Part II

Courtesy of Amac comes letters just published in Lancet concerning the Iraqi Mortality Survey.

Stephen Apfelroth, Department of Pathology, Albert Einstein College of Medicine writes:


In their Article on mortality before and after the 2003 invasion of Iraq (Nov 20, p 1857),1 Les Roberts and colleagues use several questionable sampling techniques that should have been more thoroughly examined before publication.

Although sampling of 988 households randomly selected from a list of all households in a country would be routinely acceptable for a survey, this was far from the method actually used–a point basically lost in the news releases such a report inevitably engenders. The survey actually only included 33 randomised selections, with 30 households interviewed surrounding each selected cluster point. Again, this technique would be adequate for rough estimates of variables expected to be fairly homogeneous within a geographic region, such as political opinion or even natural mortality, but it is wholly inadequate for variables (such as violent death) that can be expected to show extreme local variation within each geographic region. In such a situation, multiple random sample points are required within each geographic region, not one per 739 000 individuals

So cluster sampling is inadequate for sampling heterogeneous phenomena! Wish I had thought to point that out. Oh, wait I did, 5 months ago.

” In my opinion, such a flaw by itself is fatal, and should have precluded publication in a peer-reviewed journal.”

Glad I’m not the only nut job out here.

The rest of the letter concerns the details of sampling each individual cluster. I think he makes some minor points about the sampling technique itself, but I think those are largely peripheral issues unlikely to affect the major outlines of the study. I don’t think that Apfelroth has realized that the inclusion or exclusion of the Falluja cluster swamps every other effect in the study save the cluster methodology itself.

Les Roberts replies:

” The ability of a 33-neighbourhood sample to portray adequately the mortality experience of an entire country has seemed problematic to many critics of our study in Iraq. However, most mortality surveys in war zones only contain 30 clusters, as do the recommended approaches of United Nations agencies and the US Agency for International Development.1″

Except no one has ever attempted to conduct such a study under wartime conditions. Moreover, except for the Kosovo attempt nobody has tried to use cluster sampling to measure deaths by violence. The results of the Kosovo study have not been replicated by other means, so we have no means of assessing whether it succeeded or not. The general consensus right now seems to be that it did not (see below).

“Ample evidence suggests that 30 locations are reasonably adequate for measuring the level of malnutrition or immunisation coverage in an entire country.2”

If violence had the same distribution as malnutrition or immunization we would have no disagreement. Malnutrition and immunization don’t clump together like incidences of violence do. I think this actually reveals Roberts’s main weakness as a researcher in this case. I don’t think he actually knows anything about warfare. He really seems to believe that the patterns produced by the conscious acts of violence will be close to the patterns produced by malnutrition, disease or accident.

” Unfortunately, as Stephen Apfelroth rightly points out, our study and a similar one in Kosovo,3 suggest that in settings where most deaths are from bombing-type events, the standard 30-cluster approach might not produce a high level of precision in the death toll”

Thank you — although he is avoiding Apfelroth’s main criticism, i.e., cluster sampling fails with highly heterogeneous (unevenly distributed) phenomena. He implies that the problem is restricted purely to bombing. In fact, cluster sampling will fail for any phenomenon with a highly heterogeneous distrubution. Choosing to study the incidence of violence, which he knew before hand to be highly geographically concentrated, using cluster sampling was bad, bad design. It was so bad that it might invalidate the entire study.

“But the key public-health findings of this study are robust despite this imprecision.”

How can the key findings, all of which refer to deaths from violence be considered robust if the sampling method is highly imprecise? Precision is important not only in determining the scale of the problem but in determining the causes of death and the profiles of the victims.

“These findings include: a higher death rate after the invasion;”

Which came as an utter shock to everyone I am sure. I am going out on limb here and say that when armies fight the death rate always goes up.

“a 58-fold increase in death from violence, making it the main cause of death; and most violent deaths being caused by air-strikes from Coalition Forces.”

Only true if the Falluja outlier cluster is included. Its nice to see Roberts up to his old tricks. Check out the next line.

“Whether the true death toll is 90 000 or 150 000…”

Or maybe 300,000 or higher! Golly, Roberts, why did you settle on those low numbers? Your actual study shows the death toll is much, much higher. Why downplay it? (I also like the way he makes 90,000 the implied floor for the estimate instead of being near the mainline.) Got to love that precision. It doesn’t matter if the actual number is 90,000 or 150,000. What’s 60,000 dead people give or take? What is important is who to blame.

“these three findings give ample guidance towards understanding what must happen to reduce civilian deaths.”

Putting aside the fact that the three findings are wholly dependent on the results from a single cluster and that the combined findings from the other 32 clusters are radically different, what exactly in the study could actually be used to reduce civilian deaths? The study just assigns blame for the deaths and then stops. How does that help in any practical way?

Let me recap this paragraph:

Roberts admits that cluster sampling does not measure violence well (at least in regard to airstrikes). Nevertheless, he then asserts that his results are “robust” even though they are grounded on an imprecise sampling methodology. He continues his tactic of switching back and forth from findings including the Falluja cluster to those not including the cluster without any indication he is doing so.

“As we noted in the paper, by storing the randomly picked point in a global positioning system (GPS) unit and visiting the nearest 30 households as defined by the GPS, there was little subjectivity in the choice of households”

I think Roberts is on firm ground in this case. Unlike Apfelroth, I don’t have a problem with this part of the methodology. I do think Roberts should have pointed out that all his main conclusions are in fact based on one single cluster in Falluja, defined using a different method and collected under highly adverse conditions[p6 pg6-7].

” We also stated in the paper that people had to have been sleeping under the same roof with a family for 2 months before their death to be considered a household death. This strict definition of household member may have prevented the recoding of some deaths, particularly among former military members who did not live with any household in the weeks before their death, but it ensured that the type of overestimation that concerns Apfelroth did not occur.”

This two month window means that men in the military and those living transiently, like insurgents and jihadists, would never be counted in the survey[p6 pg2]. It would skew the age and gender of the victims of violence away from adult males and toward women, children and the elderly. It might also skew the cause of death away from small arms and toward airstrikes. Basically, this attribute of the study means the study was designed to measure the deaths of individuals that occurred in or near their own homes. It is therefore hardly surprising that the study shows a large percentage of non-combatant deaths.

” Before publication, the article was critically reviewed by many leading authorities in statistics and public health and their suggestions were incorporated into the paper.”

Trust us, we have many unnamed authority figures backing us up. Honest, some even have lab coats and clipboards!

“The death toll estimated by our study is indeed imprecise…”

Nice of them to admit it. Funny, is it not, that this imprecision doesn’t actually affect the overall quality of the study? If this was a study of disease or drug safety, the question of whether 30,000, 90,000, 150,000, 300,000+ people died would be considered rather relevant. (And just to repeat myself, the precision also affects the measurements of who died and how.)

“and those interested in international law and historical records should not be content with our study”

Neither should those interested in scientific integrity.

“In the interim, we feel this study, as well as the only other published sample survey we know of on the subject,5 point to violence from the Coalition Forces as the main cause of death and remind us that the number of Iraqi deaths is certainly many times higher than reported by passive surveillance methods or in press accounts”

” We declare that we have no conflict of interest.”

Beyond the political, that is.

To sum up: Apfelroth asserts that cluster sampling was a poor methodology to study a heterogeneous (unevenly clumped together) phenomenon like military violence (making the same point I made in my original critique of the study). It was such a poor choice as to fatally undermine the entire study. Roberts never refutes this point but merely pays it lip service, and plows on making statements based wholly on the Falluja cluster data.

I am more convinced than ever that Roberts is not being honest.

59 thoughts on “Lancet Letters Part II”

  1. Shannon,
    would it be relatively easy to simulate the study’s cluster sampling method using fictious data in an excel spreadsheet? For a non statistician like me it would be illuminating to run the simulation on 1. smooth data and 2. multiple times on heterogeneous data to get a feel for how inaccurate the sampling method can be.

  2. I noticed the semantic shift between “violent death” in the Apfelroth letter and “immunization” in the authors’ response, too. The reply finesses the issue to the point of being unresponsive to Apfelroth’s argument.

  3. “I am going out on limb here and say that when armies fight the death rate always goes up.”

    Shannon, you keep this shit up, you’re gonna get known as an expert. Then you can be in headlines:

    “Experts: rain events correlate with runoff.”
    “Accident rates increase during storms, say experts.”
    “New study reveals academics are liberals, say sociologists.”

    It would be so much easier for me to preserve and manifest a due reverence for academics, if everything they write didn’t evoke a single quotation: “It needs no ghost come from the grave, my lord, to tell us this.”

  4. Straight question, requesting straight answer, Shannon:

    Is it more likely that cluster sampling would underestimate a heterogeneous death rate, or overestimate it?

    An answer in the form of “Underestimate” or “Overestimate” would be appreciated.

  5. And while my most important question is the one above, so I don’t want to muddy the waters, it wouldn’t be a Shannon Love post if it didn’t have at least one major error requiring correction:

    Except no one has ever attempted to conduct such a study under wartime conditions. Moreover, except for the Kosovo attempt nobody has tried to use cluster sampling to measure deaths by violence

    Les Roberts helpfully provides a list of citations to his letter, including a paper he cowrote called “Mortality in eastern Democratic Republic of Congo: results from eleven mortality surveys”. Shannon might not know this, but the eastern Democratic Republic of Congo experienced wartime conditions during the period when Roberts was sampling there.

    There is a clue as to the number of surveys carried out in DRC, given in the title. All of them used cluster sampling. Therefore it is not true to say that “except for the Kosovo attempt nobody has tried to use cluster sampling to measure deaths by violence”. Correction, please.

  6. (in which context, the claim that “I think this actually reveals Roberts’s main weakness as a researcher in this case. I don’t think he actually knows anything about warfare. He really seems to believe that the patterns produced by the conscious acts of violence will be close to the patterns produced by malnutrition, disease or accident. ” is pretty laughable)

  7. Oh what the hell, might as well go through the card … I’ve been boning up on cluster sampling recently, so why shouldn’t you lot suffer too?

    While Shannon asserts that “Malnutrition and immunization don’t clump together like incidences of violence do.”, there is very little actual evidence from the data that this is the case, once the Fallujah cluster is thrown out. We can take the reported design effect as a rough proxy for the degree of heterogeneity in the data. In the prewar death rates, this design effect was 0.81, which would not be consistent with a lot of between-cluster variance. In the post war death rates, the design effect was 29, which is obviously huge. However, it appears that the bulk of this between-cluster variation was accounted for by the single Fallujah cluster; discarding that, the design effect was 2.0, which is (a quick google search suggests) entirely in line with the design effects found in immunisation cluster surveys.

    Of course, the design effect isn’t really a measure of heterogeneity (although it is heavily dependent on between-cluster variance). So it probably makes more sense to do this the educated layman’s way, by looking at the bar charts on page 3. You can see that the death rate falls in Sulaymaniya governorate, but rises in every other governorate surveyed. Does this really look like heterogeneous data?

    So, if one looks at the actual data rather than making dogmatic assertions about what one believes about violence, immunisation, malnutrition and disease (btw, as far as I can tell, malnutrition does not cluster, but epidemic disease often does), then one ends up concluding that the “heterogeneity” which Shannon is trying to make so much of consists of the fact that, as well as the baseline increase in death rates, there were areas in Iraq like Fallujah (Ramadi, Najaf, Samarra) which saw a lot more violence and a greater increase in the death rate. This justifies the description of the 100k excess deaths figure as “conservative”.

  8. dsquared notes that Mr. Love said:

    I think this actually reveals Roberts’s main weakness as a researcher in this case. I don’t think he actually knows anything about warfare.

    Mr. Love’s ignorant comments would only be “pretty laughable”, as dsquared describes them, if they were not so despicable.

    Not only was Dr. Roberts in the Congo during the civil war performing research for the IRC, but he was also in Rwanda working for WHO in 1994. (I’ll leave it as a homework assignment for Mr. Love to determine what was happening in Rwanda in 1994.)

    When before the US House Subcommittee on International Operations And Human Rights in 2001, Dr. Roberts testified in his dry manner that he has “worked in only seven wars“. [emphasis added]

    The National Press Club lists several places where Dr. Roberts has been involved in relief efforts:

    Malawi and Zimbabwe 1992-3
    Bosnia, 1993
    Armenia, 1993
    Rwanda & Goma, Zaire, 1994
    Tajikistan 1997
    Northern Uganda, 1999
    DRC 1999-2002
    Sierra Leone, 2001
    Burundi, 1999 – 2002

    For Mr. Love to question Dr. Roberts credentials and experience in this matter is beyond the pale, though, given his track record, not surprising.

    And one more thing: Mr. Love mocks Dr. Roberts for not revealing the names of the reviewers of his Lancet paper. If Mr. Love had any experience reading peer reviewed publications — much less successfully published or selected as a reviewer himself — he would know that most journals use *blind* peer review (ie, the author doesn’t know the identify of the reviewers).

  9. Oh! Here’s a few more:

    Kosovo and Rwanda/Zaire (note: it also lists a study in Cambodia, but AFAICT that one did not use cluster sampling)

    Sierra Leone (Medecins Sans Frontieres, surveys mental trauma rather than death rates, but is still a survey of the effects of violence in a war zone)

    War-related sexual violence in Sierra Leono (uses a combination of systematic random sampling and cluster sampling)

    Bosnia

    WHO Guidance for Surveillance of Injuries due to Landmines

    Colombia

    I have not, obviously, read all of these and do not endorse any of them. A number of them appear to have been carried out by the same firm (Greenberg Research, operating under a contract from the ICRC), and the precise cluster sampling methodology is unlikely to be exactly the same as that used in the Lancet survey in any of them. However, I think we can now put to bed the assertion that this was the second time that cluster sampling had ever been used either “in a war zone” or “to detect the effects of violence”.

  10. Dsquared — why on earth would you toss out an outlier if you’re measuring spread in a population? Particularly in order to argue that the population is so homogenous you don’t need more clusters?

  11. Dsquared — why on earth would you toss out an outlier if you’re measuring spread in a population? Particularly in order to argue that the population is so homogenous you don’t need more clusters?

    If you believed that your sample was coming from two populations rather than one; a baseline “most of Iraq” population, plus a “extremely high violence areas” population, then you could justify doing this, at the cost of having to admit that your results would only describe the comparatively low-violence population and that you *do* need more clusters to get a decent estimate of what really happened in Fallujah (Najaf, Ramadi, Samarra, etc). It would bias your estimates of the true excess deaths figure downward, as the study notes.

    I certainly wouldn’t want to stake anything important on the design effect numbers as measures of homogeneity, because that’s a hopelessly unrigorous argument (and the fact I used numbers doesn’t make it any more rigorous). I think my stronger argument is the visual one; you can see from the graphic that all the death rates except Fallujah can be plotted on the same vertical scale, so it’s not obvious that heterogeneity (which would of course tend to bias the estimate DOWN, a fact which Shannon always seems to forget to mention) is a big problem here.

  12. dsquared,

    “Is it more likely that cluster sampling would underestimate a heterogeneous death rate, or overestimate it?”

    Let me rephrase the question for greater accuracy.

    “Absent all other considerations, would any particular study using cluster sampling be more likely to underestimate a heterogeneous death rate, or overestimate it?”

    Underestimate. If all you knew about about a study is that (1) it is a cluster-study and (2) the phenomenon has a highly heterogeneous distribution then you can assume, based on mathematics alone that any particular study most likely returned a severe underestimate.

    But here’s the rub. Cluster-sampling under those conditions will return either an overestimation or an underestimation. It will be mathematically impossible for it to return an number like that produced by a random sample (even by sheer chance.) After some degree of heterogeneity, you will always get a number back that will always be so far off either way that you will have to pronounce any study based on cluster-sampling fatally flawed from the design stage.

    So, looking at the mainline estimate of 300,000 excess deaths returned by the Lancet study, do we assume that the number represents a potentially fatal overestimation or do we assume it represents a potentially fatal underestimation? It has to be one or the other.

    Mathematically, we should assume that the number represents a significant underestimation and that the actual death toil is most likely hundreds of thousands higher. We must also assume that (1) the majority of victims are women and children, that (2) they died predominately from Coalition helicopter strikes, that (3) the deaths were geographically concentrated, that (4) fewer than 1 in 10 of the civilian deaths were recorded by any media or authority and that (5) there have been no incidents recorded wherein several hundred women and children were killed in a single incident.

    Since this result is strongly at variance with both on-the-ground observation and historical results from other conflicts, we can make a common sense assessment that this particular instance of the study returned a severely high overestimate. Given that, we can conclude that the study design was fatally flawed and that none of its results are trustworthy.

    But I think the design of the study presents an even greater design flaw than the obvious. Let me issue a challenge in return:

    Given the actual conditions on the ground and the known history of Iraq over the period covered by the study, was a cluster study more likely to overestimate or underestimate deaths caused by Coalition action?

  13. but the eastern Democratic Republic of Congo experienced wartime conditions during the period when Roberts was sampling there.

    And only a small fraction of the deaths recorded in this survey were violent deaths. I would epidemiological survey methods to be far more relevant to Congo than to Iraq.

  14. Cluster-sampling under those conditions will return either an overestimation or an underestimation. It will be mathematically impossible for it to return an number like that produced by a random sample (even by sheer chance.)

    Excuse me if I’m being stupid, but I don’t see why this is the case. The simple cluster sampling example given by Brad DeLong gives a correct measure 37% of the time.

  15. It will be mathematically impossible for it to return an number like that produced by a random sample (even by sheer chance.) After some degree of heterogeneity, you will always get a number back that will always be so far off either way that you will have to pronounce any study based on cluster-sampling fatally flawed from the design stage

    Not true (the fact that you have claimed “even by sheer chance” ought to have given you a clue that this was wrong)

    Brad DeLong gives a good pictorial explanation. Note that the peak of the likelihood function still corresponds to the true value. Btw, you still haven’t made that correction I asked for.

    So, looking at the mainline estimate of 300,000 excess deaths returned by the Lancet study, do we assume that the number represents a potentially fatal overestimation or do we assume it represents a potentially fatal underestimation? It has to be one or the other.

    No it doesn’t; see above.

    Mathematically, we should assume that the number represents a significant underestimation and that the actual death toil is most likely hundreds of thousands higher.

    This is obviously not true. You seem to be pretending that the only information the survey would ever give would be the mean, which is literally logically impossible. A glance at the data collected (like the one that the team actually took) would reveal that there were 32 clusters dispersed roughly as one would expect, and one very large positive outlier. Therefore you would conclude that the 300,000 estimate might be biased upward by this outlier and throw it out. Having done so, you would then be entitled to assume that the remaining clusters did not form an overestimate because a) this would be a priori unlikely and even more so after throwing out the largest observation and b) the remaining data were not heterogeneous.

    we can make a common sense assessment that this particular instance of the study returned a severely high overestimate. Given that, we can conclude that the study design was fatally flawed and that none of its results are trustworthy.

    This is clearly not true, and certainly not implied by the mathematics. Sampling theory does not tell you that the presence of one outlier means that you have to throw away the whole dataset. Sampling theory actually predicts that there is a risk of very large oversamples, as Brad explains in the linked post above. If you have reason to believe that you have a bad cluster in the survey, then you throw it out and use the remaining ones.

    Given the actual conditions on the ground and the known history of Iraq over the period covered by the study, was a cluster study more likely to overestimate or underestimate deaths caused by Coalition action?

    Still underestimate. There was some reason to believe that the deaths would be concentrated in places like Fallujah, Najaf, Samarra, Ramadi etc and thus undersampled. This is why the team have consistently suggested that the 100k figure is conservative.

    Is this the defence of the “cluster sampling critique” that I’ve been waiting for for so long, by the way?

  16. dsquared,

    It is very simple to construct a cluster sample that returns one of two values but no values in between.

    Imagine a town of 100 households divided up on ten streets of ten houses each. (a 10×10 grid). Suppose the counted incidence of the phenomenon you wish to measure is 1 house in 10. Suppose you decide to study the phenomenon by defining each street as a cluster and then interviewing every house on that street.

    If the actual distribution of the phenomenon is actually perfectly symmetrical with regards to the clusters with exactly one instance per street then the cluster sample will return and incidence of 1/10th every time.

    However, if the distribution is perfectly asymmetrical with respect to the cluster, with all ten instances occurring on 1 street and the other 9 streets having zero occurrences then the cluster sample will always return one of two values, 100 or zero and never anything in between.

    You can get the same problem with rare instances. Suppose, only 1 house out the entire 100 exhibited an instance. In this case, the cluster would always return an incidence of 1/10th or zero. (Clusters see rarity as a heterogeneity)

    Cluster create a kind of quantum effect in sampling making certain outcomes impossible. Granted in most real world circumstances, the chances that cluster sampling would reveal the same value as a random sample with a highly heterogeneous distribution would be describe as “vanishingly small” instead of “impossible” but mathematically, “impossible” is possible with cluster sampling.

    What you have repeatedly failed to realize about the Falluja cluster is that it is an outlier because it is a group of 30 physically contiguous houses. The same number of houses sampled randomly would not have produced such an extreme value. The chance that the study would produce such a cluster was easily predictable but no criteria was established beforehand to define when and when not a cluster was to be considered an outlier. The exclusion of Falluja is therefore arbitrary and done on no other basis than, “it looked funny.” That is bad design.

    Having tossed out the Falluja cluster as an outlier you then want to treat all the remaining clusters as underestimates even though one or more clusters are probably outliers as well.

    Why not exclude the cluster with the next highest death rate as well? The study is designed to be valid if up to 3 clusters are removed. The next two highest death rates from violence are 4 and 3 deaths respectively so the clusters must run something like 4, 3,3,1,1,1,1,1,1,1,1,1,1. Why shouldn’t we consider the cluster with 4 deaths an outlier? After all it is 400% larger than 70% of the remaining clusters.

    As I have said before, I have no problem with excluding the Falluja cluster if it is excluded cleanly and honestly. As you can read in Roberts remarks above he is not doing so. He continues to pick and choose unethically.

  17. It is very simple to construct a cluster sample that returns one of two values but no values in between.

    That is not what you said, and your cooked up example isn’t as relevant as you think it is. As far as I can tell (I’m having trouble following your example, so tell me if I’ve got it wrong here), the problem here is not so much “cluster sampling” as the fact that your sample size is one cluster. I think that the problem here is that you’ve been working with small spreadsheet examples and thus come up with results that are only true for very small populations. If we multiply your example by 30 (3000 houses in 300 streets, 300 houses in 30 streets have green doors, sample 30 streets), then you have every chance of getting 3 streets with green doors in your sample and thus making the right point estimate; I haven’t done the calculation but I would suspect that it’s the most likely single answer. As the population scales, the quantising effect gets smaller until, for populations large enough to be of interest, it disappears. You still have the problem of a big standard error if you don’t have many clusters (you wouldn’t want to do a Presidential opinion poll with only 30 clusters), but that’s a different problem, and after all, the study did report a wide standard error.

    This looks like an honest mistake (I think I’ve made a similar error of generalisation from a small model myself in the past – find it on the internet if you can!), and I now withdraw and apologise for some of the harsher things I’ve said about you because I can see you’ve been thinking about it. But there is a reason why 30 clusters is usually taken as the default size for a cluster survey no matter what the size of the population; it’s at about that level that diminishing returns starts to set in. It is the small population (not so much the small sample size relative to the population) that creates this quantizing effect, not cluster sampling.

    no criteria was established beforehand to define when and when not a cluster was to be considered an outlier

    No formal criterion, but the Falluja cluster would have failed more or less any test that one could think of to test whether a particular observation was a member of the same population as another sample. There is no need to establish criteria for “wild observations” beforehand; have another look at Kruskal’s paper.

    Why not exclude the cluster with the next highest death rate as well? The study is designed to be valid if up to 3 clusters are removed. The next two highest death rates from violence are 4 and 3 deaths respectively so the clusters must run something like 4, 3,3,1,1,1,1,1,1,1,1,1,1. Why shouldn’t we consider the cluster with 4 deaths an outlier? After all it is 400% larger than 70% of the remaining clusters.

    Because it’s much less of an outlier. I just put together an Excel spreadsheet with a 4, two 3s, three 2s, five 1s and 21 zeroes, to give us that kind of profile of 21 deaths over 32 clusters. The mean is 0.6525 and the sd is 1.0957, meaning that the 4 observation is just about 3.05 standard deviations away from the mean. That’s not really all that far.

  18. dsquared,

    Thanks for you for the kind words. I was struggling to come up with examples that people with little statistical background could follow. I used the simplified 10×10 grid because I thought people could visualize it more easily. I also wanted something that didn’t require any understanding of deviance.

    The examples, though extreme, do demonstrate fundamental attributes of cluster-sampling. The basic problem is that as heterogeneity increases so does the variance in results. The results begin to pop back and forth between extreme values as heterogeneity increases until eventually cluster-sampling just produces nonsense for given distribution.

    Critics of using cluster-sampling for studying violence go back several years. It wasn’t something that I invented.

    The Falluja cluster is an outlier not because Falluja was unusually violent but because it was sampled as a cluster. Violence is heterogeneous, especially organized military violence. The odds that a particular house will suffer from a military attack goes up enormously if the house next door is also attacked. By sampling contiguous houses, the study virtually guaranteed it would hit a run of houses all of whom experienced violence in the war. The rules of cluster-sampling would then amplify the effect to produce an unrealistic number of deaths. A result like the Falluja-cluster was a predictable aspect of the study’s methodology. In other words, if we reproduced the same study several times, many if not most individual studies would have one or more clusters like Falluja. Therefore, I think it indicates poor design that the researchers did not anticipate criteria for excluding outlier clusters

    (Notice I don’t really have a problem with the natural mortality numbers produced by cluster-sampling because I presume that natural mortality, even if aggravated by the war, will have a fairly homogenous distribution. I think those statistics have other problems but not the sampling method.)

    I would be willing to tolerate this study if (1) the consequences of excluding the Falluja cluster were clearly explained and the true results of the study were publicized and (2) the inherent low quality of the results, as represented by the extremely wide confidence interval were clearly explained.

    I don’t see any of this happening. People are neither being told what the study actually revealed nor of its inherent weaknesses. They treat it like revealed truth instead of scientific study awaiting reproduction.

  19. Thanks for you for the kind words

    That’s OK, although I still must insist that you withdraw the accusation that I have slandered US troops.

    The Falluja cluster is an outlier not because Falluja was unusually violent but because it was sampled as a cluster.

    No, it was an outlier because Fallujah was unusually violent. Fallujah was unusually violent, the survey team knew that it was unusually violent, but had to sample it when its number came up because there are extremely violent areas in Iraq and not sampling any of them would be wrong.

    Violence is heterogeneous, especially organized military violence. The odds that a particular house will suffer from a military attack goes up enormously if the house next door is also attacked. By sampling contiguous houses, the study virtually guaranteed it would hit a run of houses all of whom experienced violence in the war

    Look, theories don’t trump facts; facts trump theories. Ex Fallujah, the maximum number of violent deaths in a cluster was 4. If all of Iraq was like the not-Fallujah clusters, it is not guaranteed at all that any such big outlier would be found. Ex Fallujah, the data is not that heterogeneous, and the big outlier happened in an area which we know from non-sample information to have been extremely violent. There is really no reason to believe that Fallujah looks very violent other than that it was very violent.

    In other words, if we reproduced the same study several times, many if not most individual studies would have one or more clusters like Falluja.

    Only because it is difficult to construct a sample of Iraq that doesn’t include at least one of the high-violence areas. If you sampled only the city of Basra, you would be very unlikely to get a cluster like Fallujah.

    Therefore, I think it indicates poor design that the researchers did not anticipate criteria for excluding outlier clusters

    You keep talking about “indicates poor design”, but it really doesn’t look like you know what you’re talking about. If an observation is 50 standard deviations from the mean, you treat it as an outlier. You don’t necessarily “exclude” it, but you deal with it in a way which recognises that it’s 50 sd from the mean. Nobody has to “anticipate criteria” for this; it’s completely standard practice.

    And furthermore, they did anticipate the problem you identify, in so far as it is a problem with the sampling process rather than a genuine issue in the underlying data. You write:

    The odds that a particular house will suffer from a military attack goes up enormously if the house next door is also attacked

    This is what “design effects” are there to measure. The quoted confidence intervals are inflated to take account of this. As it happens, for the ex-Fallujah data, the design effect was 2.0, suggesting that the problem you identify existed to some extent, but not so seriously as to be out of the normal run of things for which cluster sampling is used. You really are still short of any explanation of why you think it is that there is a problem with this survey which would make you think that the quoted standard errors are not reliable.

    I would be willing to tolerate this study if (1) the consequences of excluding the Falluja cluster were clearly explained and the true results of the study were publicized and (2) the inherent low quality of the results, as represented by the extremely wide confidence interval were clearly explained.

    Both these things were done in the study. Anyway, you now appear to be criticising presentation rather than science; I don’t like it when you do this because it makes it impossible to be sure whether or not you are making an argument against the central claim of the study; that the result of the invasion was a significant (statistically and practically) increase in the death rate. I’d also appreciate it if you stopped pretending that this increased death rate had anything to do with “armies fighting”; the chart on page 5 makes it clear that the increase in the death rate is due to a gradual worsening over the last 18 months, not to a spike in March 2003.

    People are neither being told what the study actually revealed nor of its inherent weaknesses. They treat it like revealed truth instead of scientific study awaiting reproduction.

    That’s not true. People don’t treat it like revealed truth, and if you haven’t been having much luck in making your points, it’s mainly because of your own actions.

  20. That’s not a bad rule of thumb, but it can’t be turned into a hard and fast rule; if you have a dataset like the one we’re talking about above, then rejecting the “4” cluster for being 3.05 sd from the mean would be equivalent to setting the bar at 3, which would be a very tough threshold indeed.

  21. So, I’m not used to reading this kind of sociological statistics. But I’ve got a couple questions dsquared:

    First, what kind of distribution are they assuming the Extra Deaths would take? It’s not Gaussian, since the CI doesn’t frame the mean evenly. Why would they use a different distribution for this? Perhaps it’s the subtraction of two poisson distributions?

    Second, are these Std Deviations normal? I mean really. We’ve got sigma equal to half of the mean. It seems silly when the numbers are that big. Is this normal for sociological studies? Is the data considered reliable like this? I mean, in physics, they’d just shrug and say they don’t know much. The average seems meaningless at that point.

    Third, there’s a 2% chance that we REDUCED the number of violent deaths!

    Anyhow, I’ve got no disagreement with the science. Though you do have to admit the inclusion/exclusion of Falluja data gets muddled and confusing. I’m also really surprised that these surveys (any sociological survey, I guess) have such huge margins of error.

  22. Perhaps it’s the subtraction of two poisson distributions?

    Man, I just don’t know what I’m talking about.

  23. First, what kind of distribution are they assuming the Extra Deaths would take?

    They use bootstrapped standard errors; it’s a non-parametric approach rather than an assumed distribution.

    Is the data considered reliable like this?

    I really don’t know why people ask this. The data is the data. Having a lower standard deeviation wouldn’t necessarily make it more reliable. The confidence interval specified reflects the design effect and the sample size, not anything about the data.

  24. I don’t have time to read the study, or the knowledge to fully absorb all of the ins and outs of statistical intricacies. But I have had one intro Stats course so far a year ago. So feel free to kindly pat me on my head and send me on my way, but is it known how much of the Iraqi population lived near actual Coalition military operation, particularly bombins sites during this period, and the ratio of cluster sites that were near sites of Coalition action?

    I mean, if by chance, the clusters that were selected were overrepresentative, or underrepresentative, of sites of Coaliation action during the period the study looked at than the actual percentage of the whole Iraqi population, shouldn’t the results be scaled by the percentage of over/under-representation?

    Not having examined the data myself, had the study looked at this?

    I guess what I’m asking is not whether the clusters were representative enough of the Iraqi population, but whether the clusters were representative enough of Coalition action.

    Thanks for humoring a statistics-noob. I hope I wasn’t totally inarticulate.

    Regards,
    Eric Anondson

  25. Hey dsquared:

    I’m still waiting for you to show us all how the study came up with the 98,000 figure using the raw data from the 32 clusters. If you know so much about this you ought to at least be able to demonstrate that.

    —Ray D.

  26. Eric: Basically, nobody knows that information, or at least not in the degree of map-grid detail. I would suspect that even the coalition commanders wouldn’t have it altogether in one place.

    Ray D: Well, since you haven’t asked nicely, provided any contribution to the debate yourself or given me any reason to believe that you’re actually interested in the answer, I suspect you’re gonna be waiting a while longer. Try holding your breath until you go blue; that sometimes works when my 3-year-old son does it.

  27. Correction: I haven’t contributed anything to the debate that you’ve been able to refute. I guess you’ve already forgotten all of the comments you had no answer for in the last comments’ section.

    By the way, I have a 5 year old and a 3 year old. And both of them could have copied the explanation from the study as to how it came up with the answer from the standpoint of methodology (as you did on Lambert’s site). Despite that, no one, including Mr. Roberts, has provided a clear explanation for how they reached the 98,000 figure.

    I think your sudden turn from respectfulness to ad hominem attacks is a clear sign that you are losing the argument.

    Again, to repeat, the entire 98,000 figure relies on 46 pre-war deaths over a 14.6 month period vs 89 “post-invasion” deaths over a 17.8 month period in the 32 clusters sampled excluding Falluja. A change of even 10 deaths in either period would have changed the results by around 25,000 to 30,000. Please see the link above for my other arguments.

    I’m still waiting for an answer dsquared. But I won’t hold my breath as you suggested.

    One final note: I love it how you and Tim dismiss arguments by saying “this has been dealt with in past debates.” Dealth with according to who? And it is interesting how every expert who criticizes Lancet is suddenly a crank and a quack to you two. I guess if you can’t beat em, smear em.

    —Ray D.

  28. if you have a dataset like the one we’re talking about above, then rejecting the “4” cluster for being 3.05 sd from the mean would be equivalent to setting the bar at 3, which would be a very tough threshold indeed.

    It would be a tough threshold, for that type of distribution (highly skewed.) For other non-skewed distributions it would be an easy threshold. What is the general rule for a study such as this, with a high concentration of skewed results, for defining outliers? Were any such criteria defined in advance of the survey?

    They use bootstrapped standard errors; it’s a non-parametric approach rather than an assumed distribution.

    Since the standard errors are bootstrapped, how anyone refer to intervals within the 95% range with any precision? Surely there is some kind of probability density function associated with the results of the bootstrap. What is its exact shape?

  29. It is quite amusing to me to see Ray D. complain about the tone of the debate when Shannon Love has called dquared a traitor, a liar and Fascist supporter over at:

    http://www.windsofchange.net/archives/006569.php

    Have you even bothered to read the original article in question? The methodology they used to come up with the numbers is well explained in the Lancet.

  30. Jim Ausman,

    Nice try at sliming Shannon by asserting that he slimed someone else. I notice that you don’t cite what Shannon actually said.

    The only accusation of being a liar that I found in the thread you cite is one by dsquared, who accused Shannon of being a “ dirty liar.” But I guess that’s OK.

    Then there’s this graceful comment. You’re a classy guy, aren’t you?

  31. Jonathan, you don’t seem to have looked very hard. Here is a link to the comment where Love accused dsquared of making “casual accusations of mass murder and war crimes against U.S. service members based on no better evidence.”

    And there is still Love’s unretracted accusation of treason directed at the researchers and the Lancet.

  32. Tim,

    I saw the remark that you cite. I do not see where Shannon called dsquared “a traitor, a liar and Fascist supporter”. I do see Shannon’s rather civil clarification of his position in response to dsquared’s tantrum. You seem to have as low a threshold for deciding that an honest mistake that is later corrected or a difference of opinion is a personal attack as dsquared has for making personal attacks.

    I don’t think Shannon called the investigators or the Lancet traitors. And anyway, the nature of their motives is not established fact. All you are saying is that your opinion differs from Shannon’s.

  33. Does Mr. Lambert consider “making casual accusations of mass murder and war crimes against U.S. service members based on no better evidence” to be treasonous? Consider the implications of yes or no.

    What would it take for an academic to be treasonous today? What would it take for a media professional to be treasonous today?

    Because everyone needs to work with the same definition here if we’re going avoid flipping out over illusionary innuendo.

  34. Shannon Love did not retract his accusation that dsquared made “casual accusations of mass murder and war crimes against U.S. service members based on no better evidence.” All he did was offer some tortured reasoning to support his claim by misrepresenting what the last paragraph of the report said and then arguing that dsquared must agree with it because he had defended other parts of the study. If his characterization of dsquared views was an honest mistake, why didn’t he correct it?

    Here is where Love accused the authors and the Lancet of treason:

    When you realize that without the Falluja data the study tells a very different story than the one widely reported and that the Falluja data could only have been collected with active collusion of the Baathist and the Jihadist who ruled Falluja at the time, the publication of this study assumes a very sinister cast. Either through intention or willful disregard, the researchers and publisher acted as a propaganda tool for the Fascist elements in Iraq. Given the degree to which they carefully spun their results, I conclude the effect was intended.

    Collusion with our enemies to make propaganda is treason — that’s what Lord Haw-haw was executed for.

  35. Isn’t this like saying that Dan Rather’s interview with Saddam Hussein was treasonous because he colluded with the Ba’athists so as to not embarrass about the man during the interview, just so he could get the interview on the eve of the invasion Mr. Rather opposed?

    Or that you are saying that CNN’s secret sgreement with the regime to obtain access to the country during the sanction’s era Iraq, that CNN frequently published puff pieces on the wonderfulness of Saddam (like birthday parades) and his enlightened rule and buried the slightest negative news, was treason?

    Because we’re not living in Lord Haw-haw’s era. Love it or don’t, actionable treason has been redefined since the Vietnam War.

    Are you of the opinion that it was possible for foreigners to gather data (by themselves or through surrogates) in insurgent-occpied Falluja without coordinating in the slightest with the jihadists running the place? That it was done under their very noses? Because from reports it seems that even the native residents had difficulty getting through one day without interference from the jidadist occupiers.

  36. Tim Lambert wrote:

    Shannon Love did not retract his accusation that dsquared made “casual accusations of mass murder and war crimes against U.S. service members based on no better evidence.” All he did was offer some tortured reasoning to support his claim by misrepresenting what the last paragraph of the report said and then arguing that dsquared must agree with it because he had defended other parts of the study. If his characterization of dsquared views was an honest mistake, why didn’t he correct it?

    In the comment that Tim cites for “tortured reasoning,” Shannon wrote:

    Well, the Lancet study accuses the US of a studied indifference to civilian casualties. (See the concluding paragraph) which is a war crime in anybody’s book. You are a passionate defender of the study so I rather concluded that you shared Roberts’ outlook.

    This strikes me as a straightforward explanation rather than a “tortured” one.

    The passage that follows, not the passage above, is what I meant by “an honest mistake that is later corrected”:

    More specifically however, I had in mind some rather intemperate remarks I recall you having posted on Crooked Timber. Perhaps I misremembered and if so I apologize in advance.

    Which, again, strikes me as straightforward and well intentioned.

    On Tim’s other point, the assertion that Shannon accused the investigators of treason is unsupported by Tim’s evidence. Tim also ignores what Shannon himself said in an earlier comment.

    Readers can make up their own minds as to whose accusations are more credible.

  37. It’s appalling the way regulars on here just make up stuff to bolster their weak arguments.

    Mr. Anondson says:

    Love it or don’t, actionable treason has been redefined since the Vietnam War.

    In the United States, the crime of “treason” is defined in Article 3 of the US Constitution as “levying War against [the United States], or in adhering to their Enemies, giving them Aid and Comfort.”

    This has never been amended, and certainly not since Viet Nam.

    Furthermore, treason requires *intent*, neither of which is present in Mr. Anondson’s two strained analogies, but which Mr. Love specifically ascribes to Roberts et al.

    However, this is all just diddling over minutia. If one truly believes that Roberts et al. colluded with the enemy to publish false data injurious to the United States, that person should have no problem labelling the authors as “traitors”, especially during the current jingoistic climate when pro-war folks have no compunction labelling as a “traitor” anyone who disagrees with them. Semantic quibbling is merely a way for the dishonorable to neither stand by their comments nor to apologize for them.

  38. You guys are talking past each other. The Constitutional definition of treason hasn’t changed but the functional definition has. John Kerry wasn’t prosecuted for giving aid and comfort to the enemy when he testimony falsely to Congress that US troops routinely committed war crimes. Jesse Jackson hasn’t been prosecuted for various self-promoting adventures in which he undermined the US govt’s position toward hostile nations (e.g., Syria). Jimmy Carter hasn’t been prosecuted for similar activities, etc.

    Oh, and Disputo is appalled. Personally I am shocked, so I guess we’re even.

  39. Kerry’s testimony would not count as “aid and comfort”. You can speak against any war the United States is in. I can’t stand the guy, but testifying falsely may be perjury but it isn’t treason. I’m going to do a little research on this and see what has traditionally been considered to be “aid and comfort” sufficient to invoke treason under the Constitution. Note that the Founders were acutely aware of the abuse of the charge of treason by the English monarch and wanted to make it very hard to impose any punishment for treason.

  40. This is getting quite ridiculous. Are you folks incapable of *not* making up stuff when your arguments are backed into a corner?

    There has been no change in the “functional definition” (whatever that means) of treason. Treason has never applied to any cases like the examples Mr. G ewirtz proffers, even assuming that his distorted (to put it mildly) presentations of them are accurate. Treason, as I stated previously, requires *intent*.

    However, I thank Mr. G ewirtz for giving us some insight into the tin foil conspiracy theories that circulate among the far right, and for providing further confirmation that rightwingers consider policy disagreements “treason”, if not of the post Viet Nam “functional” kind.

  41. Are you folks incapable of *not* making up stuff when your arguments are backed into a corner?

    I would ask if you are capable of not being rude but I think the answer is obvious.

    I accept correction about the definition of treason. However, I also don’t see where Shannon called the investigators traitors. What Shannon wrote was:

    I don’t think their actions rise to the level of treason but they had to know that the study represented a propaganda boon to the insurgents and jihadist whether it will eventually be verified or not.

    Compare Shannon’s words with Disputo’s:

    If one truly believes that Roberts et al. colluded with the enemy to publish false data injurious to the United States, that person should have no problem labelling the authors as “traitors”. . .

    Shannon made clear that he doesn’t believe that, so Disputo’s statement is a red herring.

    BTW, who is backed into a corner? The Lancet study is clearly flawed. Does anyone not think that its summary should be amended to reflect the most serious questions about the data and analysis? Does anyone not think that the study should be replicated using a larger sample?

  42. Treason, as I stated previously, requires *intent*.

    Intent to what? To intentionally act so as to malign Coalition action in a way regardless whether the other side is pleased and is able to increase recruiting? Or to intentionally work with the “other side” because they have the same rough outcome in mind. What intent would invoke action against seemingly (true or mistaken) treasonous behavior?

    Doesn’t seem to matter what you mean since Shannon has already stated, “If Kerry’s actions weren’t treasonous, I don’t think we can accuse Roberts of it.” Seems that Shannon considers that Roberts’ actions are worth comparing to Kerry’s. He doesn’t accuse Roberts’ of treason, as I read it, Shannon declares what Roberts’ did was comparable to, or fell under the level of, Kerry’s behavior in the 1970’s. He didn’t say that Kerry should have been executed for what he did! He didn’t say it was treason.

    Are we, or are we not, in an era when the concept of the Constitution as a “living docment” is largely held in high regard? Are you a strict constructionist, Disputo? Or is it just with regard to treason…

    Seems to me that to say that Shannon called Roberts’ treasonous, or even Kerry treasonous, is putting words in Shannon’s mouth that you probably wish he would say.

  43. Jonathan writes:

    Does anyone not think that the study should be replicated using a larger sample?

    Yes, the US and UK governments.

    As for the study being “flawed”. All studies are flawed to some extent. Are the flaws significant enough that we should reject the study’s findings? Clearly not.

  44. Ray, asking me to reproduce the EpiInfo calculations by hand isn’t a debating point at all; it’s just you being silly.

    Since the largest cluster outside Fallujah had only 4 violent deaths and the total number of violent deaths ex-Fallujah was 21, then “10 extra deaths” would be a pretty huge variance and extremely improbable. It is trivially true that if very improbable events happened, then the results of a statistical survey would be misleading but this is a very uninteresting truism indeed.

    I’ve addressed all of your points on that previous thread; as far as I can see, the two links you’ve provided above go to a discussion in which you made a couple of mistakes, I corrected them, you made them again and I didn’t feel like repeating myself (plus the “only 10 deaths” error, discussed above).

    I think your sudden turn from respectfulness to ad hominem attacks is a clear sign that you are losing the argument.

    No, it’s a clear sign that I’ve lost patience with you, because you started saying things like “if you know so much” and “I’m still waiting”. I profoundly wish I hadn’t bothered being polite to you in the first place.

    My three year old often tries to get away with saying “and that means I win!” too. I tend to ignore it, as there is no danger of neutral onlookers being fooled.

  45. I’m going to be setting a short test on stochastic calculus soon for anyone who wants to represent themselves as a “hard science” type in conversation with me. So better start mugging up on Feynman-Kac derivatives.

  46. Jonathan, with several of your Lancet posts bearing red and strikethrough text providing corrections to calculation errors and misreadings and thanking me, it’s a bit silly to call me a troll.

  47. “Troll” is the nicest term I can think of to characterize someone who uses the comments section of my own blog to speculate about suing me.

  48. If you wanted to use “troll” with some approximating its original meaning, you should apply it to the person who sprays out ridiculous charges of treason and fraud.

  49. Tim,

    Shannon responded to your accusation 7 days ago, yet you continue to repeat it.

    Here is part of what Shannon wrote then:

    Actually, I don’t think I used the words fraud or treason. Fraud would imply outright fabrication of some or all of the study which I don’t think occurred. They do repeatedly make statements unsupported by their data which might be construed as fraud I suppose.
    I would typify their actions as corruption. They seek to hijack the prestige and reputation for objectivity of a major scientific publication in order to advance their own political agenda.

    It appears that Tim reads selectively.

  50. Here, yet again, is Shannon Love’s unretracted accusation of treason:

    When you realize that without the Falluja data the study tells a very different story than the one widely reported and that the Falluja data could only have been collected with active collusion of the Baathist and the Jihadist who ruled Falluja at the time, the publication of this study assumes a very sinister cast. Either through intention or willful disregard, the researchers and publisher acted as a propaganda tool for the Fascist elements in Iraq. Given the degree to which they carefully spun their results, I conclude the effect was intended.

    Why do you keep denying this?

  51. Because it’s not true. The passage you cite doesn’t support it. Your accusation that Shannon accused the investigators of treason and fraud is based on your flawed interpretation, not on fact. Your repeated assertion of your interpretations as though they were facts does not make them facts.

    The more interesting questions concern why you and your colleagues keep insisting that we accept a flawed study, and why you keep trying to discredit people who are critical of that study.

  52. Jonathan, calling the study “flawed” doesn’t make it flawed; magic doesn’t work. If you want other people to believe you, you have to find some flaws. So far, you’re coming up short.

  53. Your repeated assertion of your interpretations as though they were facts does not make them facts.

    Yes it does.

Comments are closed.