Go Read Ginny’s Post First

I wanted to add a couple of things to what Ginny said about SATs. The SATs have been a burr under my saddle since ETS “re-centered” the test in 1994. Mensa* no longer accepts the new SAT because the statistics are now so screwy – they no longer correlate with IQ. The scores had been dropping for a while prior to 1994 and the educational powers-that-be wanted to hide it, so they re-jiggered the baseline. In practice, this meant a 100 – 200 point increase in the composite score – I knew of 2 kids who took both the 1993 and 1994 version (within a few months, and without extra test prep, so extra schooling was not a factor), and that was the difference in their scoring. N of 2, I know, but I have evidence I’ll provide lower down that this is pretty typical.

Once Ginny turned my mind back to this problem this morning, I went a-hunting for the raw data – it’s very dangerous practice on a blog in general, and the Chicago Boyz in particular, to go off on something without facts. So I found this little gem from ETS themselves. One of the first things I learned is that the incredible stability of the scores from about 1980 to 1995 was an anomaly. You might be surprised to hear that the scores also slipped drastically during the height of the GI Bill and the core Boomer years, as well.

The original SATs were administered once per year, and normalized only to the group taking the test that year, but:

In 1938, an important change occurred. The SAT was administered twice that year, in April and in June. The practice of setting a new scale within each year continued to occur. Scores in April were given a mean of 500 and a standard deviation of 100, as were scores in June. This practice only made sense if the April and June groups were equivalent in SAT math (SAT M) and equivalent in SAT verbal (SAT V). They weren’t. The same practice was continued in 1939, as the 14th and 15th sets of SAT scales were established in April and May of that year.

By 1940, it was clear that setting scales anew with each administration was unfair to candidates who took the test with the more able cohort. So in 1940, the SAT V scored was scaled to {500/100} in April, and the June 1940 SAT V scores were linked to the April scale via common item equating.

It was not until 1941 – 1942 that the scales were set in stone, and from then on, each incoming set of scores was normalized against the 1941-42 distribution obtained from the previous cohorts – in essence, incoming freshmen were measured against the performance of the graduating classes of 1945 and 1946. But the post-war boom brought something new: the GI Bill. A huge influx of new students came in, impacting scores:

From 1941 until 1951-52, the SAT V mean dropped from 501 to 476, and the SAT M mean dropped from 502 to 494, such that the SAT V and SAT M means differed by 18 points in 1951-52. Ten years later, in 1961-62, the SAT V mean had dropped an additional two points to 474, while the SAT M mean increased by one point to 495.

The reason for this was the demographics of those taking the test:

The report acknowledged that the educational arena had changed dramatically between 1941 and 1961, and that this change led to a major shift in the SAT test-taking population. The testtaking population was no longer mostly restricted to a selective self-selected group of students applying to Ivy League colleges and other prestigious Eastern colleges. World War II had changed the role of women. The GI Bill had expanded educational opportunity. College Board member colleges had gone from 44 to 350 between 1941 and 1961, nearly a nine-fold increase. Many of these new colleges came from the South and the West. Scholarship programs had also expanded opportunity. These increases in educational opportunity resulted in changed populations and presented scaling problems for the 1941-42 scales.

And then the Boomers hit. You just knew I would not be able to resist taking a shot at the Boomers, didn’t you? OK, I’ll take the high road and let someone else do it:

The real decline in SAT scores did not start until after the Wilks report was issued. Shortly after the Wilks report, from about 1963 until 1980, both SAT V and SAT M means dropped noticeably from about 475 for SAT V to around 425, and from about 500 to 470 for SAT M. Now the difference in SAT V and SAT M mean scores was close to 45 points. By 1990, the SAT M mean had increased to near 475, while the SAT V mean remained around 425, a 50-point difference.

It’s really hard to look at those two passages without coming to the conclusion that any adjustments made to the reference population would be dumbing down the test scores, but that is not entirely the case. At some point the SAT scoring process maps a raw score (say 84 out of 85 questions correct) onto that lovely 800 point scale. In the 1940s that reference curve used to be a distribution whose center of mass was at 500 and whose mean SD was 100. Why even do the mapping at all? The test is not validated on raw scoring – slightly harder questions might be asked on one year’s test, slightly easier the next year. So, in order to keep a level playing field from year to year, the ETS went for that screwy 800 point system we all know and love. But, even today, that means scores from one year are force mapped onto a distribution curve from the reference year(s). Fine and dandy, except that the ETS imposes a boundary condition – a perfect score must equal 800. Since the shape of the distribution from one year to the next is different, that resulted in gaps in the scaled scores at the top end of the scale:

New editions of the test, especially for SAT V, were not scaling out to 800. In other words, a perfect raw score would correspond to a 760 or 770 or 780. The score reporting policy was to award an 800 to a perfect raw score. Hence the top score would be an 800, but one omission out of 85 items might cost a student 30 to 40 points.

So, an 84 / 85 score might result in a 770 or 760, with no 780 or 790 score possible in a given year, which is not the impression that one gets looking at the middle of the scoring range, where differences of 10 points are common. Since pretty much everyone scoring over 700 is in the top 1%, it’s hard to use percentages to differentiate between those students, either. In the ultra-competitive Ivy League, differentiating between those students in the top 1% is ultra-important, so ETS was forced to do something. The “re-centered” the scores based on data from 1990.

However, re-centering to string out the high scores has an (unintended? fortuitous? planned?) consequence in the middle scores – it gives them a good bump. Doesn’t do much for the lowest scores, but the average scores in the 1000 -1200 range are impacted by a margin that looks really good to someone like me who came up under the old system. If you look on page 11 from that ETS report, you will see two graphs at the top for converting old scores to new. Both kids I knew who took the test scored in the 1100 range before and in the 1300 range after the 1994 adjustment. When I use those cruddy graphs to get new scores for the 500 – 600 range in both V and M tests, I get approximately a 30 – 60 20 point increase for V and an 80 – 120 60 – 100 point increase for M – right in line with my vast statistical sample of 2.

If the questions before and after re-centering are generally the same, and if (quite a big if) we keep absolute rather than relative expectations of what it takes to be a college scholar, as a society, we should not have that much of problem with declining scores as the number of kids taking the test increases. School systems have been pushing for years now to increase the percentage of students taking the SAT, watering down the applicant pool. But the average high school course content has been watered down, too, so it has been hard to use the SAT to measure the decline in standards, versus the effects of regression to the mean. I’m sure the educational establishment likes it that way, too.

Let’s be honest: both times scores have declined in the past, it has reflected in increase in the overall test-taking pool that meant lower abilities in the pool. One might counter that the types of questions asked in the past were so different from those asked today that the tests are not comparable, and I’ll admit that I have not seen a 1940s-era SAT. However, I’d be willing to bet that the difference in scores between any year’s test for an average student should be less than 30 points – the tests have not changed that much.

This latest decline reported by Ginny has not been accompanied by a significant increase in the test taking pool, so one can now start to get a glimmer of the effects of the curricular degradations that have been taking place over the last 2 decades. If the educational theorists had had a positive impact, then Boomer, Gen-X and Gen-Y students ought to be crushing the class of 1946, self-selecting sample of tutored rich kids or not, when in fact they crushed us (on average). So, in the education world something’s rotten in Denmark – in fact something’s rotten in Denmark, Norway – all of Scandinavia and the Low Countries.

[Update – I got the chance to actually print out the graphs and use a ruler to plot out scores, rather than eyeballing off of the screen as I did this morning. I plotted out mid-range scores from 400 – 700 on both the V and M scales from page 11. First the V: a score of 400 went to about 410, after that, the bump was about the same: 20 points all the way up to 700. The M scores are a bit wonkier: a 400 mapped to a 460, a 700 to 760, 500 went up to 580, whereas 600 and 650 gained about 100 points. In both cases, the curve starts to climb after 700, but is not actually shown on the graph – although the whole point of the re-centering was allegedly to separate the kids in the 700 – 800 block. All-in-all the score should have been about 100 – 150 points higher for someone in the middle of the pack taking the second test versus someone taking the first – still in line with my huge statistical sample.]

[Update2: Thanks to Mrs. Davis for catching the statistical typo.]

* I am in no way endorsing membership in that pompous organization.

4 thoughts on “Go Read Ginny’s Post First”

  1. “both times scores have declined in the past, it has reflected in increase in the overall test-taking pool that meant lower abilities in the pool.”

    This is the key statement.

    The pool for the CB report was 1.4 million seniors, which is about 35% of the total number of people in that age cohort (about 4 million). What no test given to that many students can do is accurately differentiate among students at the right end of the distribution. To do that would require a more difficult test and different norming procedure.

    The re-centering, that you complained of, had the effect of pushing that tail farther from the middle. It made the test more accurate in the middle, but sacrificed some accuracy at the tail. I would argue that is a big so what. The top 1% can fend for themselves. If they can’t go to the top 10 colleges*, they can get an education just as well, and just a good, at State.

    *Think about it. The top 10 schools have rather small classes, Penn is the largest at 2,500, but some are under 1,000 (e.g. Cal Tech at 235). So there may be 10,000 seats for the 40,000 people in the top 1%. It is probably worse than that. I have been told that about half of the seats at one top 10 school are filled by athletes, legacies, and development cases. So the number of seats available to smart kids is significantly smaller than that.

  2. No, if you read the College Board paper I cited, you see that the re-centering made the test more differentiating at the top end and really did nothing for the middle. If you follow that chart and pick some scores for re-scoring (I picked 1000, 1050, 1100, 1150, and 1200), what you see is that the absolute score for the middle shifted, but the distance between scores in the middle is about the same, so no, the middle kids are not more accurately depicted in the new system, they just get higher scaled scores. From the report:

    For SAT V, the conversion from the original scale to the recentered scale affects all scores in roughly the same manner. Hence, score differences between students at different score levels are virtually unchanged by recentering. The only exceptions to this statement are reported scores at either extreme of the score scale. Scores truncated at 200 are separated. Scores that were stretched out in the 700s are brought in line with each other, which leads to more comparability for SAT forms at the upper end of the scale. With the exception of scores at either end of the score distribution, score differentials are unchanged (except for division by 10).

    On the SAT M, it is scores below 400 (which should not be getting into college anyway, but that’s another argument for another day…) on the re-centered scale that are most decompressed and more accurately represented.

    The eduwonks seized on that confusion for a while in the 1990s because geezers like me whose mental model had been calibrated in the “1250 = pretty good” days were thinking that kids were scoring better than they were.

    I agree that emphasizing the top end is probably not productive – there is no real difference between a 750 and an 800, proabably not between a 700 and an 800, either

  3. a distribution whose center was at 500 and whose mean was 100.

    Any Boomer knows you meant a normal distribution centered at 500 with a standard deviation of 100.

  4. Thanks for the catch – I sould have copied the line instead of retyping it. I’m not entirely sure it was a normal distribution onto which the scores were mapped, but the mean was 500 and the SD was 100, so a normal distribution is probably a good assumption.

Comments are closed.