To continue our discussion of value-added modeling, I’d like to point readers to the recently-released 2010 Brown Center Report on American Education. The overall theme of the report is to be cognizant of the strengths and limitations of any standardized assessment—domestic or international—when interpreting the resulting scores and rankings.
In this post I will focus on part 2 of this report, which describes two different regression models used to create a type of value-added measure of each state’s education system after controlling for each state’s demographic characteristics and prior academic achievement. In simpler terms, the model considers prior NAEP scores and demographic variables, predicts how well a state “should have” done, then looks at how well the state actually did and spits out a value-added number. These numbers were then normed so a state performing exactly as predicted would net a score of zero. States doing better than predicted would have a positive score and vice versa.
The first model uses all available NAEP scores through 2009. This is as much as 19 years of data for some states—state participation in NAEP was optional until 2003. The second model only uses scores from 2003-2009, when all states had to participate. The models are equally rigorous ways of looking at state achievement data, but they have slightly different emphases. To quote from the report:
The Model 1 analysis is superior in utilizing all state achievement data collected by NAEP. It analyzes trends over a longer period of time, up to nineteen years. But it may also produce biased estimates if states that willingly began participating in NAEP in the 1990s are different in some way than states that were compelled to join the assessment in 2003—and especially if that “some way” is systematically related to achievement….
Model 2 has the virtue of placing all states on equal footing, time-wise, by limiting the analysis to 2003-2009. But that six-year period may be atypical in NAEP’s history—No Child Left Behind dominated national policy discussions—and by discarding more than half of available NAEP data (all of the data collected before 2003), the model could produce misleading estimates of longer term correlations. (p. 16)
There are some similarities in the results from the two models. Seven states—Florida, Maryland, Massachusetts, Kentucky, New Jersey, Hawaii, and Pennsylvania—and the District of Columbia appear in the top ten of both models while five states—Iowa, Nebraska, West Virginia, and Michigan—appear in the bottom of both models.
However, there are also wild swings in the ratings and rankings of many states. Five states (or 10% of the sample) rise or fall in the rankings by 25 or more places—and there are only 51 places total. Fewer than half the states are placed into the same quintile in both models.
Keep in mind the two models are qualitatively similar, using the same demographic variables and the same outcome measure. The main difference is that Model 2 runs from a subset of the data for Model 1. A third model that included different measures and outcome variables could produce results that differ critically from both these models.
To further complicate things, a sharp drop in rankings from Model 1 to Model 2 can still reflect an absolute gain in student performance in that state, as can a negative value-added score. The Brown Report highlights New York as an example. Over all time, New York (which was an early adopter of the NAEP) gained an average of 0.74 NAEP scale points per year, compared to 0.65 NAEP scale points per year for the rest of the states. Over all time, New York’s gains on NAEP are greater than what the model predicted, with a value-added score of 0.58. But between 2003 and 2009, New York only gained 0.38 NAEP scale points per year, while the gains of other states held steady at 0.62. In this period, New York’s gains are less than what the model predicted, with a value-added score of -1.21. But in terms of absolute scores, New York finished better than it started, no matter which time frame you look at.
I wanted to focus on these regression models from the Brown Report because they clearly illustrate some of the problems with using value-added models for high-stakes decisions like teacher contracts and pay. While the large disparities in rankings caused by relatively small differences in the models are interesting to researchers trying to understand the underpinnings of education, they are also exactly why value-added modeling is difficult to defend as a fair and reliable method of teacher evaluation.