Even though the lead author of the Lancet Iraqi Mortality Survey (LIMS) [free req reg], Les Roberts, has himself repeatedly described cluster sampling as an “imprecise” method of studying violence from airstrikes, many continue to argue that the quality of the LIMS falls just short of divinely revealed truth. Since both I and others regard cluster sampling as the primary methodological weakness of this study, I thought I would explain in detail the unsuitability of cluster sampling to measuring the incidence of military violence in Iraq.
This post is part one of two. In part one, I will examine the theory of cluster sampling and explain under what conditions it will or will not produce accurate results. In part two, I will apply the concepts explained in part one to the LIMS itself.
Those who want a more technical explanation of the weaknesses of cluster sampling should check out Stubbs and Zellan’s “Cluster Sampling: A False Economy”
The extended post is rather long and contains graphics. If you have a slow connection be prepared for an unusually long load time.
2.0 What is Cluster Sampling?
Cluster sampling is a convenient method for taking samples in a statistical study. In a random-sample study, each individual sample is chosen separately and has no connection, physical or otherwise, to any other sample save by sheer chance. A cluster-sample study, by contrast, groups individual samples together, usually on the basis of physical proximity. Each grouping of samples is a “cluster.”
Clustering is done not for any methodological or mathematical reasons but for practical logistical ones. It is easier to reach individual samples if they are physically grouped together than if they are individually scattered across the entire population. For example, a nationwide household study conducted with random sampling would randomly select 900 individual households from all the households in the country. Except by chance, none of the households would be physically adjacent to each other. By contrast, a cluster-sample study of the same sample size would randomly select 30 points across the country and then sample each of the 30 houses physically adjacent to each point. Each individual household would always be physically proximate to 29 other samples. In the random-sample study, the researchers would have to travel separately to each of the 900 individual households. In the cluster-sample study, the researchers would only have to travel to 30 individual neighborhoods and then walk door-to-door to sample 30 adjacent households.
3.0 Limitations of Cluster Sampling
Because the individual samples in a cluster sample are not chosen in a wholly random manner, cluster sampling can produce inaccurate results for some types of phenomena. Reducing the randomness of the sampling reduces what statisticians call the “power” of a study. The power of a study is a statement of its ability to accurately measure the studied phenomenon. The power of a cluster-sample study is dependent on the degree of evenness or unevenness in the distribution of the studied phenomenon in the overall population. (The technical terms for even and uneven are homogeneous and heterogeneous respectively. I’ll stick with the more common terms.) If the distribution is highly even (homogeneous) then a cluster-sample study will be almost as accurate as a random-sample study of the same size (same total number of samples). If the distribution is highly uneven (heterogeneous) then a cluster-sample study will be much more inaccurate than a random-sample study of the same size.
Uneven distributions destroy the power of cluster-sample studies because they increase the likelihood that all the individual samples within a particular cluster will return similar values. In the real-world, the samples usually return similar values because the studied phenomena affect entire areas. For example, in the household survey above, if you used cluster sampling to study housing prices, percentage of sewer hookups vs. septic tanks, or rainfall over the past year, every household within each cluster would return a similar value because all of those phenomena affect the entire area of the cluster. Sampling one household in each cluster gives you as much information as sampling all 30 households. Such a study would effectively have only as many samples as it had clusters. A random-sample study, by contrast, would effectively have as many samples as it had individual samples. The cluster-sample study would have only 30 effective samples versus 900 for the random-sample study.
Key Concept: Cluster sampling is an inaccurate methodology for measuring phenomena that affect areas as opposed to individuals. Phenomena that affect entire areas can affect multiple samples within each cluster, thereby destroying the statistical power of the study.
So, when deciding whether cluster sampling will be a good tool for measuring a particular phenomenon we should ask two related questions:
1) Is the known or suspected distribution of the phenomenon even or uneven?
2) Does the phenomenon likely affect individual samples or does it affect entire areas?
4.0 Simple Cases
To understand how and when cluster sampling fails let us look at some super-simple examples.
First a few caveats. These examples don’t correspond to anything real-world. They are mathematically valid but you won’t find anything like them outside a basic textbook. They largely ignore the concept of deviation, a key concept in statistics but one that many people find confusing. So you won’t see 10% plus or minus 2% or the like. For clarity, the examples use only one cluster, which a real-world study would never do. Despite these limitations, these examples will accurately demonstrate the conditions under which cluster sampling is considered either an accurate method or an inaccurate one.
(Note: If you find one of the descriptions of a simple example inaccurate, try putting an adjective like “simplistically”, “crudely” or “ideally” in front of it and see if that fixes it.)
Some terms: Instance refers to an occurrence of the studied phenomenon. Incidence or incident rate refers to the percentage of the population that exhibits the phenomenon. So if an example samples an instance once out of 10 samples it would report an incidence of 10%.
Imagine that you have a small town of 100 households divided up on 10 streets of 10 houses each, forming a 10×10 grid. Each left-to-right row is a street. Suppose you have some phenomenon that actually affects 1 in 10 houses for an incidence of 10%. (You don’t know that when you design, conduct and evaluate the study, but we will use the god’s eye view here for clarity.)
Key Concept: In cases A through D, a random sample will return an incidence of 10% regardless of how evenly or unevenly the incidents group together, but a cluster sample will return an incidence varying from 0% to 100% depending on the evenness of the distribution.
4.1 Case A: Random Sample of a Randomly Even Distribution
Fig 1 shows Case A, a purely random sample of 10 houses when the actual distribution is random and even.
The Counting View is the results researchers would get if they visited every house and checked for the phenomenon (statisticians call this a census). It’s the best picture of the phenomenon possible. The white squares represent houses without the phenomenon. Red diamonds represent houses with the phenomenon. The Sample View shows the view that emerges when one randomly samples 10 houses in the town. The black squares represent the unsampled and thus unknown houses. The Result View shows the general view of the phenomenon that the study returns.
The study selects 10 out of 100 houses entirely at random. Two adjacent houses will both be sampled only by rare chance. If repeated many times, the study will produce an incidence of 10% the vast majority of the time. As a bonus, the random sampling will also provide information about the actual distribution of the phenomenon.
Logistically, conducting this study would involve traveling all over town to visit each individual sampled house. In order to save shoe-leather and time, a researcher might decide to randomly select 1 street, assume that that 1 street is representative of the remaining 9, and then just walk down that street interviewing every house on that street. That street is a cluster. Whether this methodology would produce accurate results or not would depend wholly on the actual distribution of the phenomenon relative to the streets. If the phenomenon is scattered all over town, cluster-sampling will work. If the phenomenon occurs only on particular streets, it will fail.
4.2 Case B: Random Sample of a Randomly Even Distribution
The sample view shows a representative cluster comprised of all the houses on a single street.
The major difference between this cluster sample and the previous random sample is that, while the random sample had 10 random choices, 1 for each of the 10 houses, the cluster sample has only 1 random choice, the selection of 1 out of 10 streets. The cluster sample does not select each individual house in a purely random manner. Whether a house is sampled depends on whether it is on the same street as another house that is also sampled. Decreasing the randomness of the sampling immediately reduces the predictive power of the study.
In this case, it doesn’t matter much. The cluster-sample study will return results close to those of the random-sample study. 80% of the time the cluster will fall on a street like street 3 which has 1 instance and will return an incidence of 10%. However, 10% of the time the cluster will fall on street 5 which has zero instances and will return a 0% incidence, and 10% of the time it will fall on street 6 which has two instances and so will return an incidence of 20%. If the study was repeated many times it would consistently return values close to what the random-sample study did. Sometimes it would be way off but the tradeoff in decreased accuracy versus time and resources saved would be acceptable.
4.3. Case C: Cluster Sample of a Perfectly Even Distribution
In Case C, each street has exactly one instance of the phenomenon. In this case, cluster sampling would actually produce an accurate result more consistently than would random sampling. No matter which street was selected it would sample the phenomenon 1 time in 10 and would always return an incidence of 10%. Random-sampling would return the same results as in Case A, because random-sampling is immune to the effects of distribution.
Such super-even distributions rarely occur in the real world, so cluster sampling is rarely more accurate than the equivalent random sample. Nevertheless, this example shows how important even distribution is to the accuracy of cluster sampling.
4.4 Case D: Cluster Sample of a Perfectly Uneven Distribution
As in all previous cases, a random sample returns the same results regardless of the distribution, but a cluster-sample study is rendered completely useless. 9 out 10 times the cluster will fall on street with zero instances (Fig 4-B) and return an incidence of 0% (Fig 4-C). 1 out of 10 times it will fall on street 5 (Fig 4-D) with 10 instances and return an incidence of 100% (Fig 4-E). The cluster-sample study will have the same effective power as a random sample that took only 1 individual sample.
Realistically, it is best to think of Cases C and D as the extreme ends of a spectrum of distributions running from perfectly even to perfectly uneven. Cluster sampling becomes increasingly inaccurate as the unevenness of the distribution increases. Where a real-world study falls on this spectrum is keenly important to evaluating both its design and its results.
4.5. Case E: Cluster Sample of a Rare Instance
One common misconception regarding cluster sampling is that the method under-samples rare instances. This is not true. Any given cluster-sample study is just as likely to sample any individual instance as is a random-sample study of the same size. However, when it fails to sample the rare instance the cluster-sample study will severely underestimate its incidence, and when it does sample it it will severely overestimate its incidence.
The Counting View shows a single instance of a phenomenon in a sample of 100 houses. A random sample of 10 samples would have a crude 10% chance of sampling the phenomenon. Much of the time, a random-sample study would miss the phenomenon completely. A cluster sample of the same size would also have a 10% chance of selecting the street on which the single instance occurred. 90% of the time it would miss sampling the phenomenon entirely (Fig 5-B) and return an incidence of 0% (Fig 5-C). 10% of the time it would sample the event (Fig 5-D) and would return and incidence of 10%.
The extremely high or extremely low incidence arises because cluster-sampling methodology assumes that the incidence within the cluster is the incidence within the total population. If the incidence within the cluster is zero, cluster sampling returns an extremely low estimate of 0% incidence. If the incidence within the cluster is 10%, cluster sampling returns an overestimate of 10% incidence for the entire population. By contrast, a random-sample study will never return an overestimate because it can never sample the rare event more than once. (A real-world random sample would virtually never oversample as well.)
For cluster sampling, rarity is just a form of uneven distribution. It’s not the rarity per se that causes the inaccuracy, but rather the increased likelihood that the rare phenomenon will only be sampled in a minority of clusters. Cluster sampling amplifies deviation, making a rare phenomenon look either rarer or more common than it truly is. It is impossible to tell from any single study whether this has happened or to what degree. Only multiple studies or information from different methods can reveal the true picture.
4.6 Case F: Cluster Sample of a Common but Uneven Distribution
You can see the negative effects of an uneven distribution more clearly in Case F in Fig 6, which shows the cluster sampling of a phenomenon with a counted incidence of 50% but with a completely uneven distribution.
A cluster-sample study will either sample streets 6-10 (Fig 6-B) and report an incidence of 0% (Fig 6-C) or it will sample streets 1-5 (Fig 6-D) and report an incidence of 100% (Fig 6-E). A random-sample study will report an incidence of 50% almost all of the time.
5.0 Multiple Clusters
The above simple cases all used one cluster each, for clarity, so one might ask whether having multiple clusters in a single study changes the results significantly. It does somewhat, but the basic problem with uneven distributions persists. The more clusters you have and the fewer samples in each cluster, the more like a random sample the cluster sample becomes and the less sensitive it is to uneven distributions. (You could think of a random-sample study as a cluster study where each cluster holds only one sample.) Conversely, the greater the degree of clustering, the more sensitive the study is to the effects of uneven distributions.
The standard for household surveys is to use 30 clusters of 30 houses each (30 is a magic number in statistics). Using 60 clusters of 15 house each would make a study less sensitive to distribution and thus potentially more accurate. Using 15 clusters of 60 house each would make it more sensitive to distribution and thus potentially less accurate.
It is easy to see that if a single instance of a phenomenon can affect every sample within a cluster, then the statistical validity of the samples within the cluster is destroyed. The samples are no longer 30 individual samples of 30 instances of the phenomenon, but are instead collectively just one sample of a single incident.
6.0 Correcting for Cluster Errors
Can statistical methods correct for the inaccuracies introduced by cluster sampling? In some cases, yes. If you are studying a phenomenon whose distribution is well known from previous experience, then you can use statistical methods to more accurately assess the study’s results. You can tweak the incidence rate based on the known distribution. If, however, you don’t have any external information, you can’t do that. If you are using cluster-sampling to study a phenomenon that had never been measured before, then you cannot use statistical tools to adjust the results with any expectation of accuracy.
Statistical sampling methods are just like any other kind of scientific instrument in that they must be calibrated against known results. If you are measuring a phenomenon for the first time, you will never know for sure if your results are accurate until they are reproduced, preferably using a different method altogether. Only when we have a body of experience using different methods to measure a particular phenomenon can we confidently say that a given study has returned accurate results. Until then, we’re just guessing.
7.0 Clustering Rules
We can formulate three rules for evaluating, in a particular circumstance, whether cluster sampling is likely to be an accurate methodology or whether a study that used cluster sampling produced an accurate measurement.
7.1. Uneven Distribution Rule
The first rule applies primarily to the design stage of a study, when a sampling methodology is chosen.
Rule 1: The Even Distribution Rule. The more evenly distributed (more homogeneous) a phenomenon is across the entire studied population, the more accurate cluster sampling will be. Conversely, the more unevenly distributed (more heterogeneous) the phenomenon is across the entire studied population, the less accurate cluster sampling will be.
If the phenomenon is known or suspected to be evenly distributed then cluster sampling is probably an acceptable compromise, but if the phenomenon is known or suspected to be unevenly distributed then cluster sampling would be a poor choice for a methodology and would be more likely than not to produce highly inaccurate results.
A phenomenon that affects areas or contiguous groups of samples will be much more likely to have an uneven distribution than a phenomenon that affects only individual samples.
7.2. The Microcosm Rule and The Cluster Clone Rule
There are two rules which can be applied to data that a cluster-sampling study has collected, to try to assess the accuracy of the study. Both rules seek to measure the degree of variation in the distribution as uncovered by the study itself.
Ideally clusters should be “microcosms” of the whole population and be as internally heterogeneous (i.e. have the full range of variability within them) and be as externally homogeneous (i.e. be as similar to each other) as possible. The greater the extent to which this is fulfilled, the closer the design effects would be to 1.00 and hence the lower the price paid for reducing traveling cost and time.
Rule 2: The Microcosm Rule: The distribution of the phenomenon within any individual cluster should mirror the distribution of the phenomenon within the larger population. Each cluster should be a statistical scale-model of the entire population. This happens naturally when the distribution of the phenomenon is even in the total population.
Rule 3: The Cluster Clone Rule: Ideally, all of the clusters should return similar values. One could duplicate any particular cluster and substitute the duplicated value for the value of any other cluster without significantly altering the results. Each cluster is a duplicate or clone of each of the others. The more similar the results returned by each cluster, the more evenly distributed the phenomenon was in the total population.
You can see both of these rules working in the extreme cases 3 and 4 above. Imagine that we have a study of 10 towns of 100 houses each, where each town is just like the 10×10 town in our sample cases. We do a cluster-sample study with 10 clusters, one for each town. If the distribution is extremely even, as in case C, then each cluster will follow the Microcosm Rule and return an incidence identical to that of the total population. Each cluster will also be identical to each of the 9 other clusters. If the distribution is like that in case D, however, none of the clusters will look like a microcosm of the total population; clusters will return widely differing results, either an incidence of 0% or an incidence of 100%. Duplicating the value of one cluster and substituting if for another cluster would radically change the overall results.
I hope I have demonstrated that the principal weakness of cluster sampling is its sensitivity to the degree of evenness (homogeneity) in the distribution of the phenomenon in the total studied population. As the degree of unevenness (heterogeneity) increases, so too does the inaccuracy of cluster sampling.
Cluster sampling is a particularly poor choice for studying phenomena that affect entire areas, and especially when a single incident of a phenomenon might cover the entire cluster.
When evaluating any particular study that uses cluster sampling we can ask four questions to help us decide whether the study returned a usefully accurate result or not:
1) Was the distribution of the phenomenon studied more even (homogeneous) or uneven (heterogeneous)?
2) How close did the actual data within each cluster come to comprising a microcosm of the entire population. (See Rule 2 above.)
3) Did each cluster report similar results? (See Rule 3 above.)
4) Does cluster sampling have a proven track record of accurately measuring this phenomenon? Do we have any means independent of the study itself for assessing its accuracy?
I will apply these four questions to the LIMS study in part II.
[Note: To keep the discussion focused on the general theory of cluster-sampling, I reserve the right to delete any comments that go off-topic.]