If assessments are diagnoses, what are the prescriptions?

I happen to like statistics. I appreciate qualitative observations, too– data of all sorts can be deeply illuminating. But I also believe that the most important part of interpreting them is understanding what they do and don’t measure. And in terms of policy, it’s important to consider what one will do with the data once collected, organized, analyzed, and interpreted. What do the data tell us that we didn’t know before? Now that we have this knowledge, how will we apply it to achieve the desired change?

In an eloquent, impassioned open letter to President Obama, Education Secretary Arne Duncan, Bill Gates and other billionaires pouring investments into business-driven education reforms (revised version at Washington Post), elementary teacher and literacy coach Peggy Robertson argues that all these standardized tests don’t give her more information than what she already knew from observing her students directly. She also argues that the money that would go toward administering all these tests would be better spent on basic resources such as stocking school libraries with books for the students and reducing poverty.

She doesn’t go so far as to question the current most-talked-about proposals for using those test data: performance-based pay, tenure, and firing decisions. But I will. I can think of a much more immediate and important use for the streams of data many are proposing on educational outcomes and processes: Use them to improve teachers’ professional development, not just to evaluate, reward and punish them.

Simply put, teachers deserve formative assessment too.

Advertisements

Regression models and the 2010 Brown Center report

To continue our discussion of value-added modeling, I’d like to point readers to the recently-released 2010 Brown Center Report on American Education. The overall theme of the report is to be cognizant of the strengths and limitations of any standardized assessment—domestic or international—when interpreting the resulting scores and rankings.

In this post I will focus on part 2 of this report, which describes two different regression models used to create a type of value-added measure of each state’s education system after controlling for each state’s demographic characteristics and prior academic achievement. In simpler terms, the model considers prior NAEP scores and demographic variables, predicts how well a state “should have” done, then looks at how well the state actually did and spits out a value-added number. These numbers were then normed so a state performing exactly as predicted would net a score of zero. States doing better than predicted would have a positive score and vice versa.

The first model uses all available NAEP scores through 2009. This is as much as 19 years of data for some states—state participation in NAEP was optional until 2003. The second model only uses scores from 2003-2009, when all states had to participate. The models are equally rigorous ways of looking at state achievement data, but they have slightly different emphases. To quote from the report:

The Model 1 analysis is superior in utilizing all state achievement data collected by NAEP. It analyzes trends over a longer period of time, up to nineteen years. But it may also produce biased estimates if states that willingly began participating in NAEP in the 1990s are different in some way than states that were compelled to join the assessment in 2003—and especially if that “some way” is systematically related to achievement….

Model 2 has the virtue of placing all states on equal footing, time-wise, by limiting the analysis to 2003-2009. But that six-year period may be atypical in NAEP’s history—No Child Left Behind dominated national policy discussions—and by discarding more than half of available NAEP data (all of the data collected before 2003), the model could produce misleading estimates of longer term correlations. (p. 16)

There are some similarities in the results from the two models. Seven states—Florida, Maryland, Massachusetts, Kentucky, New Jersey, Hawaii, and Pennsylvania—and the District of Columbia appear in the top ten of both models while five states—Iowa, Nebraska, West Virginia, and Michigan—appear in the bottom of both models.

However, there are also wild swings in the ratings and rankings of many states. Five states (or 10% of the sample) rise or fall in the rankings by 25 or more places—and there are only 51 places total. Fewer than half the states are placed into the same quintile in both models.

Keep in mind the two models are qualitatively similar, using the same demographic variables and the same outcome measure. The main difference is that Model 2 runs from a subset of the data for Model 1. A third model that included different measures and outcome variables could produce results that differ critically from both these models.

To further complicate things, a sharp drop in rankings from Model 1 to Model 2 can still reflect an absolute gain in student performance in that state, as can a negative value-added score. The Brown Report highlights New York as an example. Over all time, New York (which was an early adopter of the NAEP) gained an average of 0.74 NAEP scale points per year, compared to 0.65 NAEP scale points per year for the rest of the states. Over all time, New York’s gains on NAEP are greater than what the model predicted, with a value-added score of 0.58. But between 2003 and 2009, New York only gained 0.38 NAEP scale points per year, while the gains of other states held steady at 0.62. In this period, New York’s gains are less than what the model predicted, with a value-added score of -1.21. But in terms of absolute scores, New York finished better than it started, no matter which time frame you look at.

I wanted to focus on these regression models from the Brown Report because they clearly illustrate some of the problems with using value-added models for high-stakes decisions like teacher contracts and pay. While the large disparities in rankings caused by relatively small differences in the models are interesting to researchers trying to understand the underpinnings of education, they are also exactly why value-added modeling is difficult to defend as a fair and reliable method of teacher evaluation.

%d bloggers like this: