1.0 Intro
Even though the lead author of the Lancet Iraqi Mortality Survey (LIMS) [free req reg], Les Roberts, has himself repeatedly described cluster sampling as an “imprecise” method of studying violence from airstrikes, many continue to argue that the quality of the LIMS falls just short of divinely revealed truth. Since both I and others regard cluster sampling as the primary methodological weakness of this study, I thought I would explain in detail the unsuitability of cluster sampling to measuring the incidence of military violence in Iraq.
This post is part one of two. In part one, I will examine the theory of cluster sampling and explain under what conditions it will or will not produce accurate results. In part two, I will apply the concepts explained in part one to the LIMS itself.
Those who want a more technical explanation of the weaknesses of cluster sampling should check out Stubbs and Zellan’s “Cluster Sampling: A False Economy”
The extended post is rather long and contains graphics. If you have a slow connection be prepared for an unusually long load time.
2.0 What is Cluster Sampling?
Cluster sampling is a convenient method for taking samples in a statistical study. In a random-sample study, each individual sample is chosen separately and has no connection, physical or otherwise, to any other sample save by sheer chance. A cluster-sample study, by contrast, groups individual samples together, usually on the basis of physical proximity. Each grouping of samples is a “cluster.”
Clustering is done not for any methodological or mathematical reasons but for practical logistical ones. It is easier to reach individual samples if they are physically grouped together than if they are individually scattered across the entire population. For example, a nationwide household study conducted with random sampling would randomly select 900 individual households from all the households in the country. Except by chance, none of the households would be physically adjacent to each other. By contrast, a cluster-sample study of the same sample size would randomly select 30 points across the country and then sample each of the 30 houses physically adjacent to each point. Each individual household would always be physically proximate to 29 other samples. In the random-sample study, the researchers would have to travel separately to each of the 900 individual households. In the cluster-sample study, the researchers would only have to travel to 30 individual neighborhoods and then walk door-to-door to sample 30 adjacent households.
3.0 Limitations of Cluster Sampling
Because the individual samples in a cluster sample are not chosen in a wholly random manner, cluster sampling can produce inaccurate results for some types of phenomena. Reducing the randomness of the sampling reduces what statisticians call the “power” of a study. The power of a study is a statement of its ability to accurately measure the studied phenomenon. The power of a cluster-sample study is dependent on the degree of evenness or unevenness in the distribution of the studied phenomenon in the overall population. (The technical terms for even and uneven are homogeneous and heterogeneous respectively. I’ll stick with the more common terms.) If the distribution is highly even (homogeneous) then a cluster-sample study will be almost as accurate as a random-sample study of the same size (same total number of samples). If the distribution is highly uneven (heterogeneous) then a cluster-sample study will be much more inaccurate than a random-sample study of the same size.
Uneven distributions destroy the power of cluster-sample studies because they increase the likelihood that all the individual samples within a particular cluster will return similar values. In the real-world, the samples usually return similar values because the studied phenomena affect entire areas. For example, in the household survey above, if you used cluster sampling to study housing prices, percentage of sewer hookups vs. septic tanks, or rainfall over the past year, every household within each cluster would return a similar value because all of those phenomena affect the entire area of the cluster. Sampling one household in each cluster gives you as much information as sampling all 30 households. Such a study would effectively have only as many samples as it had clusters. A random-sample study, by contrast, would effectively have as many samples as it had individual samples. The cluster-sample study would have only 30 effective samples versus 900 for the random-sample study.
Key Concept: Cluster sampling is an inaccurate methodology for measuring phenomena that affect areas as opposed to individuals. Phenomena that affect entire areas can affect multiple samples within each cluster, thereby destroying the statistical power of the study.
So, when deciding whether cluster sampling will be a good tool for measuring a particular phenomenon we should ask two related questions:
1) Is the known or suspected distribution of the phenomenon even or uneven?
2) Does the phenomenon likely affect individual samples or does it affect entire areas?
4.0 Simple Cases
To understand how and when cluster sampling fails let us look at some super-simple examples.
First a few caveats. These examples don’t correspond to anything real-world. They are mathematically valid but you won’t find anything like them outside a basic textbook. They largely ignore the concept of deviation, a key concept in statistics but one that many people find confusing. So you won’t see 10% plus or minus 2% or the like. For clarity, the examples use only one cluster, which a real-world study would never do. Despite these limitations, these examples will accurately demonstrate the conditions under which cluster sampling is considered either an accurate method or an inaccurate one.
(Note: If you find one of the descriptions of a simple example inaccurate, try putting an adjective like “simplistically”, “crudely” or “ideally” in front of it and see if that fixes it.)
Some terms: Instance refers to an occurrence of the studied phenomenon. Incidence or incident rate refers to the percentage of the population that exhibits the phenomenon. So if an example samples an instance once out of 10 samples it would report an incidence of 10%.
Imagine that you have a small town of 100 households divided up on 10 streets of 10 houses each, forming a 10×10 grid. Each left-to-right row is a street. Suppose you have some phenomenon that actually affects 1 in 10 houses for an incidence of 10%. (You don’t know that when you design, conduct and evaluate the study, but we will use the god’s eye view here for clarity.)
Key Concept: In cases A through D, a random sample will return an incidence of 10% regardless of how evenly or unevenly the incidents group together, but a cluster sample will return an incidence varying from 0% to 100% depending on the evenness of the distribution.
4.1 Case A: Random Sample of a Randomly Even Distribution
Fig 1 shows Case A, a purely random sample of 10 houses when the actual distribution is random and even.
The Counting View is the results researchers would get if they visited every house and checked for the phenomenon (statisticians call this a census). It’s the best picture of the phenomenon possible. The white squares represent houses without the phenomenon. Red diamonds represent houses with the phenomenon. The Sample View shows the view that emerges when one randomly samples 10 houses in the town. The black squares represent the unsampled and thus unknown houses. The Result View shows the general view of the phenomenon that the study returns.
The study selects 10 out of 100 houses entirely at random. Two adjacent houses will both be sampled only by rare chance. If repeated many times, the study will produce an incidence of 10% the vast majority of the time. As a bonus, the random sampling will also provide information about the actual distribution of the phenomenon.
Logistically, conducting this study would involve traveling all over town to visit each individual sampled house. In order to save shoe-leather and time, a researcher might decide to randomly select 1 street, assume that that 1 street is representative of the remaining 9, and then just walk down that street interviewing every house on that street. That street is a cluster. Whether this methodology would produce accurate results or not would depend wholly on the actual distribution of the phenomenon relative to the streets. If the phenomenon is scattered all over town, cluster-sampling will work. If the phenomenon occurs only on particular streets, it will fail.
4.2 Case B: Random Sample of a Randomly Even Distribution
>
Fig 2 shows Case B, which has the same distribution as Case A in fig 1.
The sample view shows a representative cluster comprised of all the houses on a single street.
The major difference between this cluster sample and the previous random sample is that, while the random sample had 10 random choices, 1 for each of the 10 houses, the cluster sample has only 1 random choice, the selection of 1 out of 10 streets. The cluster sample does not select each individual house in a purely random manner. Whether a house is sampled depends on whether it is on the same street as another house that is also sampled. Decreasing the randomness of the sampling immediately reduces the predictive power of the study.
In this case, it doesn’t matter much. The cluster-sample study will return results close to those of the random-sample study. 80% of the time the cluster will fall on a street like street 3 which has 1 instance and will return an incidence of 10%. However, 10% of the time the cluster will fall on street 5 which has zero instances and will return a 0% incidence, and 10% of the time it will fall on street 6 which has two instances and so will return an incidence of 20%. If the study was repeated many times it would consistently return values close to what the random-sample study did. Sometimes it would be way off but the tradeoff in decreased accuracy versus time and resources saved would be acceptable.
4.3. Case C: Cluster Sample of a Perfectly Even Distribution
Fig 3 shows Case C, a cluster-sample of a non-random evenly distributed phenomenon.
In Case C, each street has exactly one instance of the phenomenon. In this case, cluster sampling would actually produce an accurate result more consistently than would random sampling. No matter which street was selected it would sample the phenomenon 1 time in 10 and would always return an incidence of 10%. Random-sampling would return the same results as in Case A, because random-sampling is immune to the effects of distribution.
Such super-even distributions rarely occur in the real world, so cluster sampling is rarely more accurate than the equivalent random sample. Nevertheless, this example shows how important even distribution is to the accuracy of cluster sampling.
4.4 Case D: Cluster Sample of a Perfectly Uneven Distribution
Fig 4 shows case D, a cluster sample of a perfectly uneven distribution. The incidence is still 10% but all ten individual instances occur on one street.
As in all previous cases, a random sample returns the same results regardless of the distribution, but a cluster-sample study is rendered completely useless. 9 out 10 times the cluster will fall on street with zero instances (Fig 4-B) and return an incidence of 0% (Fig 4-C). 1 out of 10 times it will fall on street 5 (Fig 4-D) with 10 instances and return an incidence of 100% (Fig 4-E). The cluster-sample study will have the same effective power as a random sample that took only 1 individual sample.
Realistically, it is best to think of Cases C and D as the extreme ends of a spectrum of distributions running from perfectly even to perfectly uneven. Cluster sampling becomes increasingly inaccurate as the unevenness of the distribution increases. Where a real-world study falls on this spectrum is keenly important to evaluating both its design and its results.
4.5. Case E: Cluster Sample of a Rare Instance
One common misconception regarding cluster sampling is that the method under-samples rare instances. This is not true. Any given cluster-sample study is just as likely to sample any individual instance as is a random-sample study of the same size. However, when it fails to sample the rare instance the cluster-sample study will severely underestimate its incidence, and when it does sample it it will severely overestimate its incidence.
Case E in Fig 5 shows how this works.
The Counting View shows a single instance of a phenomenon in a sample of 100 houses. A random sample of 10 samples would have a crude 10% chance of sampling the phenomenon. Much of the time, a random-sample study would miss the phenomenon completely. A cluster sample of the same size would also have a 10% chance of selecting the street on which the single instance occurred. 90% of the time it would miss sampling the phenomenon entirely (Fig 5-B) and return an incidence of 0% (Fig 5-C). 10% of the time it would sample the event (Fig 5-D) and would return and incidence of 10%.
The extremely high or extremely low incidence arises because cluster-sampling methodology assumes that the incidence within the cluster is the incidence within the total population. If the incidence within the cluster is zero, cluster sampling returns an extremely low estimate of 0% incidence. If the incidence within the cluster is 10%, cluster sampling returns an overestimate of 10% incidence for the entire population. By contrast, a random-sample study will never return an overestimate because it can never sample the rare event more than once. (A real-world random sample would virtually never oversample as well.)
For cluster sampling, rarity is just a form of uneven distribution. It’s not the rarity per se that causes the inaccuracy, but rather the increased likelihood that the rare phenomenon will only be sampled in a minority of clusters. Cluster sampling amplifies deviation, making a rare phenomenon look either rarer or more common than it truly is. It is impossible to tell from any single study whether this has happened or to what degree. Only multiple studies or information from different methods can reveal the true picture.
4.6 Case F: Cluster Sample of a Common but Uneven Distribution
You can see the negative effects of an uneven distribution more clearly in Case F in Fig 6, which shows the cluster sampling of a phenomenon with a counted incidence of 50% but with a completely uneven distribution.
A cluster-sample study will either sample streets 6-10 (Fig 6-B) and report an incidence of 0% (Fig 6-C) or it will sample streets 1-5 (Fig 6-D) and report an incidence of 100% (Fig 6-E). A random-sample study will report an incidence of 50% almost all of the time.
5.0 Multiple Clusters
The above simple cases all used one cluster each, for clarity, so one might ask whether having multiple clusters in a single study changes the results significantly. It does somewhat, but the basic problem with uneven distributions persists. The more clusters you have and the fewer samples in each cluster, the more like a random sample the cluster sample becomes and the less sensitive it is to uneven distributions. (You could think of a random-sample study as a cluster study where each cluster holds only one sample.) Conversely, the greater the degree of clustering, the more sensitive the study is to the effects of uneven distributions.
The standard for household surveys is to use 30 clusters of 30 houses each (30 is a magic number in statistics). Using 60 clusters of 15 house each would make a study less sensitive to distribution and thus potentially more accurate. Using 15 clusters of 60 house each would make it more sensitive to distribution and thus potentially less accurate.
It is easy to see that if a single instance of a phenomenon can affect every sample within a cluster, then the statistical validity of the samples within the cluster is destroyed. The samples are no longer 30 individual samples of 30 instances of the phenomenon, but are instead collectively just one sample of a single incident.
6.0 Correcting for Cluster Errors
Can statistical methods correct for the inaccuracies introduced by cluster sampling? In some cases, yes. If you are studying a phenomenon whose distribution is well known from previous experience, then you can use statistical methods to more accurately assess the study’s results. You can tweak the incidence rate based on the known distribution. If, however, you don’t have any external information, you can’t do that. If you are using cluster-sampling to study a phenomenon that had never been measured before, then you cannot use statistical tools to adjust the results with any expectation of accuracy.
Statistical sampling methods are just like any other kind of scientific instrument in that they must be calibrated against known results. If you are measuring a phenomenon for the first time, you will never know for sure if your results are accurate until they are reproduced, preferably using a different method altogether. Only when we have a body of experience using different methods to measure a particular phenomenon can we confidently say that a given study has returned accurate results. Until then, we’re just guessing.
7.0 Clustering Rules
We can formulate three rules for evaluating, in a particular circumstance, whether cluster sampling is likely to be an accurate methodology or whether a study that used cluster sampling produced an accurate measurement.
7.1. Uneven Distribution Rule
The first rule applies primarily to the design stage of a study, when a sampling methodology is chosen.
Rule 1: The Even Distribution Rule. The more evenly distributed (more homogeneous) a phenomenon is across the entire studied population, the more accurate cluster sampling will be. Conversely, the more unevenly distributed (more heterogeneous) the phenomenon is across the entire studied population, the less accurate cluster sampling will be.
If the phenomenon is known or suspected to be evenly distributed then cluster sampling is probably an acceptable compromise, but if the phenomenon is known or suspected to be unevenly distributed then cluster sampling would be a poor choice for a methodology and would be more likely than not to produce highly inaccurate results.
A phenomenon that affects areas or contiguous groups of samples will be much more likely to have an uneven distribution than a phenomenon that affects only individual samples.
7.2. The Microcosm Rule and The Cluster Clone Rule
There are two rules which can be applied to data that a cluster-sampling study has collected, to try to assess the accuracy of the study. Both rules seek to measure the degree of variation in the distribution as uncovered by the study itself.
Ideally clusters should be “microcosms” of the whole population and be as internally heterogeneous (i.e. have the full range of variability within them) and be as externally homogeneous (i.e. be as similar to each other) as possible. The greater the extent to which this is fulfilled, the closer the design effects would be to 1.00 and hence the lower the price paid for reducing traveling cost and time.
Rule 2: The Microcosm Rule: The distribution of the phenomenon within any individual cluster should mirror the distribution of the phenomenon within the larger population. Each cluster should be a statistical scale-model of the entire population. This happens naturally when the distribution of the phenomenon is even in the total population.
Rule 3: The Cluster Clone Rule: Ideally, all of the clusters should return similar values. One could duplicate any particular cluster and substitute the duplicated value for the value of any other cluster without significantly altering the results. Each cluster is a duplicate or clone of each of the others. The more similar the results returned by each cluster, the more evenly distributed the phenomenon was in the total population.
You can see both of these rules working in the extreme cases 3 and 4 above. Imagine that we have a study of 10 towns of 100 houses each, where each town is just like the 10×10 town in our sample cases. We do a cluster-sample study with 10 clusters, one for each town. If the distribution is extremely even, as in case C, then each cluster will follow the Microcosm Rule and return an incidence identical to that of the total population. Each cluster will also be identical to each of the 9 other clusters. If the distribution is like that in case D, however, none of the clusters will look like a microcosm of the total population; clusters will return widely differing results, either an incidence of 0% or an incidence of 100%. Duplicating the value of one cluster and substituting if for another cluster would radically change the overall results.
8.0 Conclusion
I hope I have demonstrated that the principal weakness of cluster sampling is its sensitivity to the degree of evenness (homogeneity) in the distribution of the phenomenon in the total studied population. As the degree of unevenness (heterogeneity) increases, so too does the inaccuracy of cluster sampling.
Cluster sampling is a particularly poor choice for studying phenomena that affect entire areas, and especially when a single incident of a phenomenon might cover the entire cluster.
When evaluating any particular study that uses cluster sampling we can ask four questions to help us decide whether the study returned a usefully accurate result or not:
1) Was the distribution of the phenomenon studied more even (homogeneous) or uneven (heterogeneous)?
2) How close did the actual data within each cluster come to comprising a microcosm of the entire population. (See Rule 2 above.)
3) Did each cluster report similar results? (See Rule 3 above.)
4) Does cluster sampling have a proven track record of accurately measuring this phenomenon? Do we have any means independent of the study itself for assessing its accuracy?
I will apply these four questions to the LIMS study in part II.
[Note: To keep the discussion focused on the general theory of cluster-sampling, I reserve the right to delete any comments that go off-topic.]
Shannon, I’ve tried to explain this before, but this time you’ve made the central misunderstanding cleanly in one paragraph so maybe we can get it cleared up:
One common misconception regarding cluster sampling is that the method under-samples rare instances. This is not true. Any given cluster-sample study is just as likely to sample any individual instance as is a random-sample study of the same size. However, when it fails to sample the rare instance the cluster-sample study will severely underestimate its incidence, and when it does sample it it will severely overestimate its incidence.
So we are agreed that:
1. If the sample does not sample the rare instance[1], it will underestimate the effect
2. If the sample does sample the rare instance, then it will overestimate the effect.
[1] Or, in a survey that uses multiple clusters, samples it an insufficient number of times.
We also know that
3. Because this is a *rare* effect, any given sample is more likely to not sample it that to sample it.
Therefore, a clustered sample with any number N points is more likely to underestimate a clustered, rare effect than a random sample with the same number of points. Brad DeLong has a very nice chart demonstrating this. The mistake you’re making is that a sample is equally likely to hit or miss an affected member of the population; if this were the case, then the “rare” effect wouldn’t be rare!
I also think it might be a mistake to try and reason exclusively from limiting cases rather than constructing a few more realistic cases. Look at your final paragraph. What you’re describing here is a situation in which you’re taking a sample of 100 houses from 1000, but gaining only the power of a test sampling 10 houses from 1000. In this case the design effect would be 10.0 (as opposed to 2.0 in the Iraq data without Fallujah and 29.3(!) with Fallujah). Clearly, 10 houses from 1000 is too small a sample, but this design effect is so big because there is zero within-cluster variance and substantial between-cluster variance. The Iraq data didn’t show such big effects.
If we look at this example, we can see how the “undersampling” effect exists. If you sample 10 clusters of 10 houses from 1000 in a population like your case D then by my (almost certainly erroneous, and assuming you are sampling with replacement which you wouldn’t be) calculation, you have the following chances.
0.9^10 = 34.8% chance of sampling zero (undersample):
((0.9^9)*0.1)=3.87% chance that any one cluster lands on the red dots and none of the others do:
10 ways that this can happen, so 38.7% chance of sampling exactly one cluster.
and therefore a 73.6% chance that you will either undersample or sample correctly[2]. So the odds of an overestimate, in this limiting worst case, with 10 clusters, are about three to one against.
[2] I damn well know I’ve made some sort of calculation error here, because I always do, but I can’t be bothered working out what it is.
To help you out with your next post, my answers to your questions concerning the Iraq survey are:
1. Excluding the Fallujah cluster, the dataset was acceptably homogeneous. It had a design effect of 2.0, and a very similar pattern of increasing mortality was seen in all governorates except those that had formed part of the Kurdish semi-free zone before the war. Including the Fallujah data, the design effect was much larger, which is why the main analysis was performed with this cluster removed.
2. Again, pretty well, apart from the Fallujah cluster. The map on page 5 of the study shows this quite well.
3. Yes, apart from Fallujah and the Kurdish clusters. In both cases, there were very good reasons to expect, ahead of time, that Fallujah and the Kurdish sector would report different results from the rest of Iraq. The survey team could have been forgiven for simply not sampling Fallujah and saving themselves a lot of grief, but they were apparently too honest.
4. The track record is short, but by no means nonexistent. In the particular case in which an attempt was made to check a clustered versus a random sample (Les Roberts’ work in the Democratic Republic of Congo), the results matched pretty well, despite the fact that violent deaths in the DRC were just as heterogeneous and somewhat rarer than total excess deaths in Iraq.
You also need to be asking a fifth question:
5. In the case of any particular study, is there any reason to believe that the calculated and quoted design effects and standard errors do not correctly summarise the results.
Link
Interesting piece on the shortcomings of cluster randomization and bootstrapping from a skewed distribution here.
dsquared,
Thank you for your cogent comments, however I think you are missing some important subtleties:
First, the problem with cluster-sampling isn’t that it under or over samples. A cluster study is just as likely to sample any given instance as a random-sample study. However, after the sampling is done and its time to crunch the numbers the rules of cluster-sampling will almost always produce a much more extreme underestimation or overestimation than the random-sample.
The main point is that the method is inaccurate for uneven distributions. Whether it is more likely to be inaccurate on the low side or the high side isn’t particularly relevant when designing a study.
The fact that cluster-sampling will force the overestimation of rare phenomenon it does sample is very important to a multi-variant study like the LIMS. Phenomena that pile up in a minority of clusters will produce overestimations. I’ll write more about that later.
I am glad we appear to have the basic ground rules laid down though.
Instapundit points to Ali’s Free Iraqi – I was not living before the 9th of April and now I am, so let me speak! site. His
anniversary post is irrelevant to Shannon’s dissection of clusters & statistics. It is not, I think, to Shannon’s argument:
Two years since I finally became
The man in me, and the kid in me.
And “they” want to take this away?
“They” would have to kill them both first
The man and the kid
And turn the clock back around
And still “they” can’t change me back
[Update From Shannon: I really should bounce this comment for being off topic but I really like the poem]
I still think you need to take on board question 5; why do you think that this uncertainty is greater than that captured by the design effects and quoted standard errors?
dsquared,
“why do you think that this uncertainty is greater than that captured by the design effects and quoted standard errors?”
You can not calculate design effect or standard error without having independent information about the actual underlying distribution. (See point 6 above).
For example, if you measure the height of adults in the general population your standard error will be badly off if you assume a normal distribution. Adult height has a bimodal distribution with one peak for women and one peak for men. (update: originally mistakenly wrote binomial instead of bimodal.)
From Stubb’s and Zellan:
“The DE is the extent to which the statistical reliability is made better or worse than a similar-sized SRS is known as the “Design Effect” (DE), and can be defined as the ratio of the actual sample size to the effective sample size.
So, Design Effect = Actual Samples / Effective Samples.
Consider a cluster-sample of two phenomenon: percentage of people who play a lottery who have won a lottery and rain fall in the last 30 days. You sample 30 clusters of 30 houses each.
In the case of the lottery, if your next-door-neighbor wins the lottery that in no way influence the odds that you have also won the lottery (assuming you both play). One could assume that the distribution of people who won the lottery is even so in a cluster-sample study each individual household in the cluster would function as a individual sample. So, the Design effect would be:
DE = 900 actual samples/ 900 effective samples = 1.0
With rainfall, however, if your next-door-neighbor gets 6 inches of rain you probably did to and so did the other 28 adjacent houses. Each cluster would effectively be just one sample. The design effect would be:
DE = 900 actual samples / 30 effective samples = 30.0
You have to know what the underlying distribution is in order to calculate the Design Effect. For example, if, unknown to the researchers it was illegal in half the areas sampled to win the lottery then the calculation of the design effect would be way off. 15 of the clusters would return the same value for each sample (zero) so the actual design effect would be:
DE = 900 actual samples / 315 effective samples = ~2.86.
We use statistics to measure so many well understood phenomenon that we forget that someone in the past had to thrash around and experiment with trial and error to find the actual underlying distributions.
Without such experimental knowledge of the actual distribution all our calculations are just guesswork based on assumptions about the underlying the distributions.
In my previous post the sentence:
“Adult height has a binomial distribution with one peak for women and one peak for men.”
should have read:
“Adult height has a bimodal distribution with one peak for women and one peak for men.”
Quite right, one can assume an underlying distribution, where say the Fallujah is indeed representative of “high violence” or “high bombing areas”, or one where it’s clearly unrepresentative. Likewise, for the two bombing incidents outside of Fallujah, which are responsible for all the coalition caused bombing deaths of women and children outside of Fallujah (and a significant proportion of the claimed increase in infant mortality!).
Hats off all around for this issues-focused and enlightening discussion!
You can not calculate design effect or standard error without having independent information about the actual underlying distribution
I think you’re getting mixed up here; of course you can calculate the standard error and design effect, since these are just summary statistics of the data. What you need a distribution for is to convert those measures into confidence intervals. This is what modelling is all about; the use of the full information in your dataset to estimate parameters that make sense. I’m glad that you picked the example of height distributions because it shows what I mean; although we happen to know that men and women have different average heights, it turns out that the difference between the two means is small compared to the variance, so assuming a unimodal (indeed a normal) distribution of heights does not lead you into any errors that would make a practical difference; to any sensible number of decimal places, the distribution might as well be unimodal.
It is possible to estimate the design effect from your dataset (the explanation of this in the “False Economy” paper is not very good, but it’s there at the bottom of the page); it depends on the intracluster correlation, a measure of the relationship between the within-cluster and the between-cluster variance (link, also). In an ideal world, you’d do a separate pilot study to estimate the ICC, but there is nothing particularly wrong with taking the ICC from your sample, calculating the ICC observed and using it as a diagnostic; if you get a measured design effect of 29.3 (as with the Iraq data cum-Fallujah) then there is something clearly wrong and you need to see whether this is the result of a “wild observation” or whether it’s a general property of the dataset.
If, on the other hand, the measured effect is 2.0 (as in the Iraq data once the Fallujah cluster is removed), then this is telling us that the within-cluster variance is large compared to the between-cluster variance (we can tell this from a glance at the dataset, btw; all the cluster averages plot on more or less the same axes, but within the clusters there is substantial variance between the majority of households who had no deaths and the small number that had some). This tells us that the clustering is unlikely to have reduced the effective sample size by too great an amount, and that it is most likely safe to make the usual distributional assumptions when calculating confidence intervals.
Just to put this into concrete terms, look again at your lottery example:
For example, if, unknown to the researchers it was illegal in half the areas sampled to win the lottery then the calculation of the design effect would be way off.
You would be absolutely right on this if the researchers had just calculated the design effect a priori and never checked whether their data bore out the assumption. However (assuming that this was a lottery with a lot of prizes so there is no problem with the overall sample size!), if they did check, they would find that this issue showed up in the data. They would, in particular find that the between-cluster variance was high relative to the within-cluster variance (because the clusters where the lottery was illegal would have zero within-cluster variance but would contribute substantially to the between-cluster variance). In other words, if they checked the data, they would find that their original assumption of a 1.0 design effect was incorrect.
If they then decided to throw out the clusters in which winning the lottery was illegal (assuming, slightly ridiculously but for the sake of analogy to Fallujah, that these were such big contributors to the between-cluster variance that they could be identified), then it would actually be legitimate to deal with the remaining data as if it had a design effect of 1.0 (or more accurately, the dataset would in this case have an empirical DE close to 1.0). Again, it’s very difficult to make categorical statements about “this cannot be done” or “this is not legitimate”; different datasets allow for different degrees of precision in the statements you can make about them and my assessment is that the Iraq dataset supported the statements made about it.
I suspect that this explanation is not the model of lucid clarity that I had in mind, but I think it covers the important points. I really think that you’re trying to reinvent the wheel here and falling into a lot of the pitfalls set by sampling theory to trap the innocent. Sampling theory is hellish stuff (think of the Birthday Paradox; the whole subject is like that), which is why anyone who wants to get anything done in applied work tries to leave as much of it as possible to the software designers.
By the way, just to show I’m not religious about the subject, here’s a couple of links to a case in which the clustering effect made a difference (although on the specific issue of state-level clustering, the National Academy of Sciences report concluded that Lott’s original approach was right)
dsquared,
Don’t take this way the wrong way but you think more like a engineer than a scientist. You spend so much time working with long-proven methods that you don’t think a lot about those methods got accepted as tried and true in the first place.
Let me state this explicitly: The mathematical validity of all statistical methods relies on assumption about the underlying distribution.
All formal proofs in statistics begin with a statement of the true distribution along the lines of “Assume N occurrences of S in P…” In basic textbooks, the examples all start out with “assuming a standard deck of cards”, “assuming a fair coin or fair dice”, “assuming a thoroughly mixed urn of X black balls and Y white balls” etc. These assumption must be made because without them the mathematics are meaningless.
For example, magician’s card tricks and cheating at dice both rely on changing the underlying distribution without the knowledge of others.
In one variant of the “pick a card trick” the magician fans out a standard desk or cards and shows them to the audience which makes the audience assume that the odds of selecting anyone card from the deck is 1 in 52. However, the magician then substitutes a second deck composed of all one card from which the pigeon chooses. When the magician then correctly identifies the card chosen it looks like he has “magically” massively beaten the odds because the audience’s calculation of the odds is based on an incorrect assumption about the distribution of the cards in the deck.
Cheating with loaded dice works the same way. Other players place bets based the known distribution of fair dice but the cheaters makes his bets based on the his unique knowledge of the distribution of the loaded dice.
Assumptions about distributions is everything in statistics.
Engineers, financial analyst, marketing researchers and the like don’t spend a lot of time thinking about underlying distribution in their day-to-work because they spend most their time applying a pre-existing tools to general classes of problems. Scientist on the other hand are often tasked with figuring out what the distribution is in the first place.
An engineer would use design effect as you describe, to assess the likely quality of the data obtained after a study has been done. To do so he would make standard implicit assumptions about the distribution. A scientist, however, would use the design effect to test assumptions about the underlying distribution by calculating how different distributions would produce different effects before the study was conducted.
All your statements depend on implicit assumptions about the underlying distribution. The data in isolation tells us nothing. Suppose I hand you a bag of 30 colored balls and tell you I sampled them from a population of unknown size and distribution. You have no other information. What can you statistically tell me about the size and distribution of balls in the population. Not much. You know it had at least 30 balls in whatever ratio of colors the balls in the bag were in but thats really it. You could make no other statements without making assumptions about the size and distribution of the population.
If a study method has no proven track record in measuring a particular phenomena then we can only evaluate its results by making assumptions about the underlying distribution. We might say that the results of a particular study resemble the results returned by studies of other phenomena but that is hardly firm proof the method worked.
Shannon, this ain’t right. The mathematics which underlies cluster sampling, is the same mathematics which underlies sampling. If you look at a rigorous theoretical textbook (and “looking at” one is about all I am capable of at that level), then in all the key results, you won’t see a specific distribution mentioned. What you’ll see is a whole load of results which are valid for any distribution so long as it meets a few very general conditions. I’m talking about conditions like “having a mean and variance”, not anything restrictive on the shape of the underlying density function.
And in social and epidemiological statistics in cross-section, we can be pretty confident that these conditions are satisfied. (If we were talking about 50 year GDP forecasts, I would have a lot of sympathy with your point and have written on this subject myself; in that case, I don’t believe there is good evidence that the conditions for valid statistical inference are met in that case, but this is different). We know that the pre- and post-war death rates can’t be below zero or above 100%. We know that each sample we take will have a mean and a variance, and we have reason to believe, simply by looking around, that in most cases we are sampling the same sort of thing.
The most important results of the survey don’t make any specific assumptions about the underlying distribution; they just assume that it is a distribution to which the central limit theorem applies and one where the sampling distribution converges in the normal manner rather than a Polya urn or something weird like that. Note that both your “cheating” examples are exactly the sort of thing that would turn up in a repeated sample.
Again, look at your own example:
Suppose I hand you a bag of 30 colored balls and tell you I sampled them from a population of unknown size and distribution. You have no other information. What can you statistically tell me about the size and distribution of balls in the population. Not much. You know it had at least 30 balls in whatever ratio of colors the balls in the bag were in but thats really it. You could make no other statements without making assumptions about the size and distribution of the population.
This just isn’t true. I couldn’t estimate the size of the population, but that’s not relevant to this case; we know the population of Iraq. And a sample of 30 balls from a population is a perfectly valid way to estimate the proportion of different colours in most cases. I would have to ask a few questions about how you went about obtaining that sample, roughly equivalent to the “Methodology” section of a social statistics paper. If, however, you said that you’d picked five from the top, five from the bottom, five from the back, five from the front, five from the left and five from the right (and if there didn’t appear to be systematic differences between these six groups, which I might treat as six “samples” of the homogeneity of the underlying population), then I would say that it was most likely that this was a representative sample and in most cases I would be right.
So this looks like you’re heading in the direction of a “mere random sample” critique, which would be disappointing. Anything you say along these lines about cluster sampling would generalise to a critique of sampling in general, because it’s the same issue. And we know that sampling in general, works.
Just to make this clear and summarise in a couple of sentences: The general statement that a sample is likely to represent the population, is not dependent on any specific distributional assumptions. It depends only on the assumption that the distribution is a member of a very general class of distributions, of which class almost all social phenomena have been members in the past.
(And to reiterate; the cluster sampling methodology for estimating deaths by violence has been crosschecked, in Les Roberts’ work in the Democratic Republic of Congo, and it gave similar results to other methodologies).
dsquared,
I think you are confusing two different concepts of “distribution”
The first concept of distribution is the actual physical pattern in which the studied phenomenon is arrayed. For example, if I have the classic urn colored balls, the physical balls within the urn have a specific physical distribution both in space and in color. Lets call this the physical-distribution.
The second concept of distribution is the variance between measurements that are taken of the physical phenomenon. Lets call this the measurement-distribution.
All statements about statistical validity rely on assumptions about the physical-distribution. If a method has been experimentally verified to accurately measure a phenomenon, then we know that the measurement-distribution will be representative of the physical-distribution to some acceptable degree. But without experimental knowledge of the physical-distribution the measurement-distribution tells us nothing.
If I just hand you a bag of 30 colored balls you can make no mathematically valid statements about the total population. The 30 balls might be the entire population or they might be a subset. The population might be randomly mixed or it might be highly ordered. You can’t tell. This kind of problem confronts scientist all the time. (Codebreakers also face a similar problem).
On the other hand, if I merely provide the information that the physical-distribution of balls in the parent population is evenly and randomly mixed then suddenly the 30 balls in the sack have enormous statistical significance and you could make very firm statements about the composition of the parent population.
Highly uneven physical-distributions break cluster-sampling before an evaluation of the measurement-distribution even gets made. It takes a much larger number of samples clustered together to get the same statistical power as a much smaller random-sample. Assumptions about the underlying physical-distribution are absolutely critical to judging the likely accuracy of a cluster-sample study, especially if the methodology has no track record for a particular phenomenon.
Lets call this the physical-distribution.
No. Let’s refer to “the population distribution” and “the sampling distribution”, rather than inventing our own private language, please?
But without experimental knowledge of the physical-distribution the measurement-distribution tells us nothing
If this were true, then there could be no such thing as epidemiology, social science, auditing, agricultural science or even astronomy. Your claim that only experimental science is science is just wrong. The only assumption that is being made here is that the very general conditions have been satisfied under which the sampling distribution converges to the population distribution.
The 30 balls might be the entire population or they might be a subset. The population might be randomly mixed or it might be highly ordered. You can’t tell
But this is a disanalogy. In all the cases in which we are interested, you have significant non-sample information; you know that there was a war in Iraq, you know that people die in wars, you know that sampling death rates is a reasonably good way to measure death rates, etc. In the absence of a reason to believe that the sampling distribution will misrepresent the population distribution, then we would normally believe that sampling is valid, because we have the entire history of sampling telling us that by and large, sampling works.
Highly uneven physical-distributions break cluster-sampling before an evaluation of the measurement-distribution even gets made.
No. Populations with concentrated effects are difficult to estimate using cluster sampling. But not impossible. And, at the danger of becoming a bore, there is no evidence to support your (I assume) claim that the distribution of excess deaths in Iraq was “highly uneven” outside the Fallujah clusters. It was actually pretty even across the sample. We have 31 samples of the between-cluster variance (33 clusters, minus Fallujah, minus one to give a degree of freedom) and they are low.
It takes a much larger number of samples clustered together to get the same statistical power as a much smaller random-sample
“Much”, in this context is measured by the design effect, which was 2.0 in the ex-Fallujah sample. And once more, “the same statistical power”, in this context, can only refer to the standard errors of the estimates; you are still short a basis for asserting that the true standard errors of the estimates are different from the ones published.
Assumptions about the underlying physical-distribution are absolutely critical to judging the likely accuracy of a cluster-sample study
But only very, very weak assumptions like “the population distribution must have a mean and variance”; the standard assumptions of convergence that underly all sampling. If the between-cluster variance is not there in 32 samples, then there is no reason for believing it is there in the population. You are massively underestimating the difficulty of getting a sample where 30 clusters out of 33 saw a rising death rate, if the underlying death rate did not rise. This would be a very unusual event for any population-distribution that I can think of. Remember that underlying heterogeneity cannot push the estimate down all that much because we know that the population distribution is bounded; it is entirely possible for there to be heterogeneous clusters where the relative risk ratio is 2x, 3x or 12x, but since you can’t have a negative number of deaths, the negative outliers can’t be any lower than -1x.
especially if the methodology has no track record for a particular phenomenon
The methodology does have a track record and a very long one. You seem here to be appealing to a massive qualitative difference between excess deaths in a war zone and other phenomena that are measured accurately by cluster sampling; epidemics are often quite heterogeneous. There is no evidence why this might be the case; even more strongly, there is no reason at all to believe that the population distribution of deaths due to war would be so perverse as to regularly produce samples in which the death rate rose when it had actually fallen. In any case, you keep asserting that this methodology has “no track record”, but I don’t see any real literature search here. I’ve presented a number of references in which the cluster sampling methodology has provided valuable information and one in which it successfully cross-checked against a random sample.
By the way, there’s now a paper out in the New England Journal of Medicine, which surveyed US forces returning from Iraq and found that 13.5% of the survey respondents in a brigade of the 3rd Infantry Division reported themselves as “having been responsible for the death of a noncombatant”. This result would, by my calculations, be consistent with about 30k civilian excess deaths caused by coalition forces in the eight months after the invasion.
If the between-cluster variance is not there in 32 samples, then there is no reason for believing it is there in the population
Erratum; of course, the massive variance in the 33rd cluster would be such a reason; but it is also a reason for believing that we are dealing with two populations rather than one, and that the remaining 32 clusters describe a population with different physical and statistical properties to the population from which the Fallujah cluster was sampled.
Sir,
I am currently doing my undergraduate thesis paper entitled, Mobile Phone Usage Among Youths in Davao City”.
I am supposed to use two-stage cluster sampling but I have difficulty in getting the sample size. Having 180 barangays/blocks with three (3) congressional districts, I am baffled how many places to randomly choose and how to cluster into groups