It occurs to me that many of those reading my criticism of the Lancet Iraqi-mortality study don’t know what cluster sampling is or what types of failure it is prone to..

Cluster Sampling works like this:

Say you have 100 balls colored either black or white in an unknown ratio. Now, in a traditional random sample, you would put all the balls in one container and mix them well, blindly draw out a sample, usually around 10 balls, and then use statistics to tell you the likely ratio of all the balls.

In cluster sampling, you would divide the balls up evenly into ten different containers. Then you would select one container at random and dump out all of its balls and determine their color ratio. At first glance it works exactly like a random sample and in some cases it actually does.

Take these two extreme cases. In case A, the balls are thoroughly mixed before being put into the individual containers. In this case, counting all the balls in one container is the same as drawing a blind sample of ten from one container holding all the balls. In Case B, you start filling up all the containers with one color until you run out, and then you switch to another color. In that case, counting the balls in one container will give a wildly inaccurate answer EVERY time. For example, if you have a fifty-fifty ratio of black to white, you would have five containers with all black, and five with all white. Choosing any one container would give a ratio of either 100% black or 100% white.

So we have a spectrum where, at one end, cluster sampling works just as well as random sampling, and at the other end it fails every time.

A more realistic model of how cluster sampling is actually used would run like this: You have one thousand colored balls in an unknown ratio. You divide them into 100 containers of ten balls each. Then you randomly select 10 containers and count all the balls in each container. Then you assume that each container is representative of 9 of the unexamined containers (and of itself, of course.) It’s easy to see that the accuracy of this method would decrease as the distribution of the balls in the containers grew less random (more like Case B above). Your ten sample containers would be more and more likely to be unrepresentative of the entire population of balls. If you repeated the experiment many times the results of the cluster experiment would converge on the results of a random sample, but if you only have one chance to count the balls you can never be sure how accurate it was.

The critical thing to understand here is that statistics cannot tell you the likely deviation of a cluster sample unless the distribution is random! In Case B, for example, the result would have a deviation of zero even though the result was absolutely wrong!

In choosing whether to use cluster sampling in the design of a real-world study, the first question the researcher has to ask is, “How symmetrical is the distribution of the phenomenon that I want to measure?” If you know beforehand that the phenomenon is symmetrically distributed, you can use cluster sampling safely, whereas if you know or suspect the distribution is highly asymmetrical then you would not use cluster sampling.

In the real world, researchers are often forced to use methods that, ideally, they would not use due to constraints of time or resources. Thus researchers often use cluster sampling when ideally they would not. It is logistically easier to interview all of the members of one neighborhood, say, than to interview the same number of people scattered randomly across the city. The neighborhood interview is usually realistically safe to do, since for most phenomena that are studied we have a rough idea what the results should be. If we get extreme results we can assume that the cluster sampling failed and try again later. Through repetition we can eventually converge on the true value.

In the case of the Iraqi-mortality study published in The Lancet, the initial results produced an estimate of deaths in excess of 250,000, a figure so improbably high they had to toss out the Falujah cluster entirely to get a “conservative” estimate of 100,000 deaths (51,000 combat related), and even that result is at least double the value of every other estimate. The study’s measurement of pre-war infant mortality is also absurdly low (27/1000) compared to all other published sources. In short, the study has all the earmarks of a cluster-sample study that failed.

The study probably failed because the clusters were not sufficiently randomly selected. After assigning clusters randomly, the researchers had to combine and reassign clusters (Page 2, Paragraph 2). This reassignment had the effect of moving clusters from the Kurdish North and Shia South into the Sunni Center (Table 1 and Figure 1), where, coincidentally enough, most of the violence is. This magnified the effects of violence and disguised the improvements brought by clean water and medical care to the previously under-served Shia areas.

### 10 thoughts on “The Madness of Methods”

1. But surely this numerical example just goes to show that a clustered sample is far more likely to underestimate than to overestimate rare events like deaths?

2. Furthermore, you are wrong in your claim about what Table 1 shows. You appear to be correct about clusters moving out of Kurdish areas into Sunni ones; the movement from “North” into “Centre” is a result of the pairing of Dehuk and Ninawa governorates; this would tend to oversample the violent province of Ninawa relative to Dehuk.

However, against this, there is a net movement out of the region “Upper South” and into “Lower South” as the regions of Qadisiyah and Dhi Qar were paired. This adds one cluster to the Shiite province of Dhi Qar and means that Qadisiyah (which contains Samarra, which saw significant violence) is not sampled at all. Improvements in Shiite areas would be exaggerated, not minimised by this effect.

(and I maintain that your ball/urn story is entirely a priori, while inspection of the data does not give grounds for believing that the sample is heterogeneous)

3. And finally, I don’t understand why you say that the Lancet figure is “at least double every other estimate”, when there are no other estimates. The Iraq Body Count estimate is not an estimate of total casualties; it’s a lower bound on the number of civilian casualties of acts of violence. And to think that you are using the phrase “lying by omission” to describe the Lancet’s discussion of infant mortality rates! For shame!

4. And finally finally, this numerical example doesn’t work. If you have 1000 balls in containers of ten, then a lower bound on how unrepresentative your sample can be would be the case where the containers are either all black or all white. But this would be just the same as having 100 balls and drawing ten from them. And I don’t think anyone would seriously argue that drawing ten balls from a hundred isn’t a perfectly reasonable way to estimate the proportion of black and white balls, or that your measured standard errors would be utterly perverse.

The error here is the statement that “in case B the deviation would be zero”. This would only be the case if you pulled out ten cases, all containing only black balls. If the true proportion of black balls was, say 70%, then the chance of this happening would be no more than 1 in 50.

5. dsquared,

You keep assuming that a cluster sample is always equivalent to a random sample of the same size. That is true only if the distribution is random. You also assume that you can iterate the experiment to arrive at the correct value. In the case of the Iraq study, we can’t yet do so, we just have one snapshot.

Try this: I hand you a container with 5 purple balls and 5 yellow balls in it. I tell you that the container is 1 of 10 containers each containing 10 balls. That’s all you know. Now what can you tell me statistically about the distribution of the other balls?

If you assume that the distribution of balls is random you can make a good guess but what if a I tell you the distribution of balls in the containers is non-random?

The problem with cluster sampling is that you are grabbing balls not one at a time at random but in chunks of ten. You deviation will always be higher because the theoretical minimum deviation is 10. Random samples (from a thousand) of a hundred balls could produce numbers like (1 white, 99 black), (43 white, 57 black), (72 white, 28 black) etc. Cluster samples from a non-random sample (each container all one color) would produce numbers like (0 white, 100 black), (30 white, 70 black), (50 white, 50 black), (100 white, zero black) etc.

Because the composition of balls within each container (cluster) is non-random you MAGNIFY YOUR VARIANCE.

Think of it this way: clustering effectively reduces your sample size. Selecting from 10 containers of either all white or black out a 100 containers (1000 balls total) is the statistical equivalent of selecting 10 balls from a population of 100. It obvious that your far more likely to get an extreme value than if you chose 100 balls from a random sample.

6. Think of it this way: clustering effectively reduces your sample size

Think of it this way; the words “design effect” reappear frequently in the paper, and they are there for a reason.

7. As I’ve said before, statistical estimators are not a magical black box that correct for bad samples. See the references on survey methodology I cite in the original thread on this subject if you want more details on why a cluster sample is likely to be misleading here.

Of course, I’m beginning to get the impression you don’t actually care what the right answer is …

8. As I’ve said before, statistical estimators are not a magical black box that correct for bad samples.

Now this is purest rubbish. I never said they were. However, the Iraqi clusters, excluding Fallujah, are not a bad sample, and constantly saying that they were doesn’t make them one. They were a small sample relative to the population, but as luck would have it (or rather, unfortunately) the effect they were measuring was a very significant one indeed. In all the governorates except one, the death rate post invasion was between 150% and three times the pre-invasion death rate.

This means that, when due allowance is made for the size of the sample, including reducing its effective size to take account of clustering, the effect is still there.

If the overall effect was only about a 25% increase in the death rate, then I would have agreed with you that this sample was not really big enough to say with confidence that the effect had been picked up. However, large effects, present in nearly every cluster, are really quite an improbable outcome under the null hypothesis that the underlying death rate was unchanged.

Here’s a reference for you on cluster sampling and design effects. Perhaps you’d now be able to explain to me precisely why you think that this methodology is not valid, or whether you believe (incorrectly) that it was not followed.

I think that what you’re trying to do by talking about a “bad sample” is to equivocate between a nonrandom sample (which of course could not be statistically corrected, but is not what the JHU team actually took) and a sample which is merely clustered (in which case the variance inflation method described produces valid estimates). Do be very sure that I’m not going to be fooled by this one.

9. Shannon,

would you agree that in case of a small sample, and where the entities studied are ralatively rare and skewd, then the MOST LIKELY result is that those rare entities are under represented in the sample?

10. All replies are in the original thread on the subject.