It occurs to me that many of those reading my criticism of the Lancet Iraqi-mortality study don’t know what cluster sampling is or what types of failure it is prone to..
Cluster Sampling works like this:
Say you have 100 balls colored either black or white in an unknown ratio. Now, in a traditional random sample, you would put all the balls in one container and mix them well, blindly draw out a sample, usually around 10 balls, and then use statistics to tell you the likely ratio of all the balls.
In cluster sampling, you would divide the balls up evenly into ten different containers. Then you would select one container at random and dump out all of its balls and determine their color ratio. At first glance it works exactly like a random sample and in some cases it actually does.
Take these two extreme cases. In case A, the balls are thoroughly mixed before being put into the individual containers. In this case, counting all the balls in one container is the same as drawing a blind sample of ten from one container holding all the balls. In Case B, you start filling up all the containers with one color until you run out, and then you switch to another color. In that case, counting the balls in one container will give a wildly inaccurate answer EVERY time. For example, if you have a fifty-fifty ratio of black to white, you would have five containers with all black, and five with all white. Choosing any one container would give a ratio of either 100% black or 100% white.
So we have a spectrum where, at one end, cluster sampling works just as well as random sampling, and at the other end it fails every time.
A more realistic model of how cluster sampling is actually used would run like this: You have one thousand colored balls in an unknown ratio. You divide them into 100 containers of ten balls each. Then you randomly select 10 containers and count all the balls in each container. Then you assume that each container is representative of 9 of the unexamined containers (and of itself, of course.) It’s easy to see that the accuracy of this method would decrease as the distribution of the balls in the containers grew less random (more like Case B above). Your ten sample containers would be more and more likely to be unrepresentative of the entire population of balls. If you repeated the experiment many times the results of the cluster experiment would converge on the results of a random sample, but if you only have one chance to count the balls you can never be sure how accurate it was.
The critical thing to understand here is that statistics cannot tell you the likely deviation of a cluster sample unless the distribution is random! In Case B, for example, the result would have a deviation of zero even though the result was absolutely wrong!
In choosing whether to use cluster sampling in the design of a real-world study, the first question the researcher has to ask is, “How symmetrical is the distribution of the phenomenon that I want to measure?” If you know beforehand that the phenomenon is symmetrically distributed, you can use cluster sampling safely, whereas if you know or suspect the distribution is highly asymmetrical then you would not use cluster sampling.
In the real world, researchers are often forced to use methods that, ideally, they would not use due to constraints of time or resources. Thus researchers often use cluster sampling when ideally they would not. It is logistically easier to interview all of the members of one neighborhood, say, than to interview the same number of people scattered randomly across the city. The neighborhood interview is usually realistically safe to do, since for most phenomena that are studied we have a rough idea what the results should be. If we get extreme results we can assume that the cluster sampling failed and try again later. Through repetition we can eventually converge on the true value.
In the case of the Iraqi-mortality study published in The Lancet, the initial results produced an estimate of deaths in excess of 250,000, a figure so improbably high they had to toss out the Falujah cluster entirely to get a “conservative” estimate of 100,000 deaths (51,000 combat related), and even that result is at least double the value of every other estimate. The study’s measurement of pre-war infant mortality is also absurdly low (27/1000) compared to all other published sources. In short, the study has all the earmarks of a cluster-sample study that failed.
The study probably failed because the clusters were not sufficiently randomly selected. After assigning clusters randomly, the researchers had to combine and reassign clusters (Page 2, Paragraph 2). This reassignment had the effect of moving clusters from the Kurdish North and Shia South into the Sunni Center (Table 1 and Figure 1), where, coincidentally enough, most of the violence is. This magnified the effects of violence and disguised the improvements brought by clean water and medical care to the previously under-served Shia areas.