Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Advertisements

Some history and context on VAM in teacher evaluation

In the Columbia Journalism Review’s Tested: Covering schools in the age of micro-measurement, LynNell Hancock provides a rich survey of the history and context of the current debate over value-added modeling in teacher evaluation, with a particular focus on LA and NY.

Here are some key points from the critique:

1. In spite of their complexity, value-added models are based on very limited sources of data: who taught the students, without regard to how or under what conditions, and standardized tests, which are a very narrow and imperfect measure of learning,

No allowance is made for many “inside school” factors… Since the number is based on manipulating one-day snapshot tests—the value of which is a matter of debate—what does it really measure?

2. Value-added modeling is an imprecise method whose parameters and outcomes are highly dependent on the assumptions built into the model.

In February, two University of Colorado, Boulder researchers caused a dustup when they called the Times’s data “demonstrably inadequate.” After running the same data through their own methodology, controlling for added factors such as school demographics, the researchers found about half the reading teachers’ scores changed. On the extreme ends, about 8 percent were bumped from ineffective to effective, and 12 percent bumped the other way. To the researchers, the added factors were reasonable, and the fact that they changed the results so dramatically demonstrated the fragility of the value-added method.

3. Value-added modeling is inappropriate to use as grounds for firing teachers or calculating merit pay.

Nearly every economist who weighed in agreed that districts should not use these indicators to make high-stakes decisions, like whether to fire teachers or add bonuses to paychecks.

Further, it’s questionable how effective it is as a policy to focus simply on individual teacher quality, when poverty has a greater impact on a child’s learning:

The federal Coleman Report issued [in 1966] found that a child’s family economic status was the most telling predictor of school achievement. That stubborn fact remains discomfiting—but undisputed—among education researchers today.

These should all be familiar concerns by now. What this article adds is a much richer picture of the historical and political context for the many players in the debate. I’m deeply disturbed that NYS Supreme Court Judge Cynthia Kern ruled that “there is no requirement that data be reliable for it to be disclosed.” At least Trontz at the NY Times acknowledges the importance of publishing reliable information as opposed to spurious claims, except he seems to overlook all the arguments against the merits of the data:

If we find the data is so completely botched, or riddled with errors that it would be unfair to release it, then we would have to think very long and hard about releasing it.

That’s the whole point: applying value-added modeling to standardized test scores to fire or reward teachers is unreliable to the point of being unfair. Adding noise and confusion to the conversation isn’t “a net positive,” as Arthur Browne from The Daily News seems to believe; it degrades the discussion, at great harm to the individual teachers, their students, the institutions that house them, and the society that purports to sustain them and benefit from them.

Look for the story behind the numbers, not the numbers alone

This time I’ll let the journalists get away with their fondness for reporting the compelling individual story, since the single counterexample is the whole point here.

High-stakes testing was bad enough. But high-stakes evaluating and hiring? This is a great example of the dangers of applying quantitative metrics inappropriately. While value-added modeling may be able to capture properties of the aggregate, it makes occasional errors at the level of the individual. Just one error (whether it’s a factual or exaggerated case, it still illustrates the point) demonstrates the ethical and managerial problems in firing the wrong person based on aggregated data.

Nor do I understand the political eagerness to fire teachers so readily. I’m not convinced that teachers are such an abundant resource that we can afford to burn through them so callously. With teacher shortages in multiple areas and a national teacher attrition rate of 15-20%, we would do better to keep, train, and support the teachers we already have, rather than toss them out and discourage new recruits from joining an increasingly unfriendly profession.

While I agree that it’s important to judge teaching by its merits rather than just the years spent, we need to formulate those measurements carefully. Test scores alone give a misleading illusion of greater precision than they actually have and

Regression models and the 2010 Brown Center report

To continue our discussion of value-added modeling, I’d like to point readers to the recently-released 2010 Brown Center Report on American Education. The overall theme of the report is to be cognizant of the strengths and limitations of any standardized assessment—domestic or international—when interpreting the resulting scores and rankings.

In this post I will focus on part 2 of this report, which describes two different regression models used to create a type of value-added measure of each state’s education system after controlling for each state’s demographic characteristics and prior academic achievement. In simpler terms, the model considers prior NAEP scores and demographic variables, predicts how well a state “should have” done, then looks at how well the state actually did and spits out a value-added number. These numbers were then normed so a state performing exactly as predicted would net a score of zero. States doing better than predicted would have a positive score and vice versa.

The first model uses all available NAEP scores through 2009. This is as much as 19 years of data for some states—state participation in NAEP was optional until 2003. The second model only uses scores from 2003-2009, when all states had to participate. The models are equally rigorous ways of looking at state achievement data, but they have slightly different emphases. To quote from the report:

The Model 1 analysis is superior in utilizing all state achievement data collected by NAEP. It analyzes trends over a longer period of time, up to nineteen years. But it may also produce biased estimates if states that willingly began participating in NAEP in the 1990s are different in some way than states that were compelled to join the assessment in 2003—and especially if that “some way” is systematically related to achievement….

Model 2 has the virtue of placing all states on equal footing, time-wise, by limiting the analysis to 2003-2009. But that six-year period may be atypical in NAEP’s history—No Child Left Behind dominated national policy discussions—and by discarding more than half of available NAEP data (all of the data collected before 2003), the model could produce misleading estimates of longer term correlations. (p. 16)

There are some similarities in the results from the two models. Seven states—Florida, Maryland, Massachusetts, Kentucky, New Jersey, Hawaii, and Pennsylvania—and the District of Columbia appear in the top ten of both models while five states—Iowa, Nebraska, West Virginia, and Michigan—appear in the bottom of both models.

However, there are also wild swings in the ratings and rankings of many states. Five states (or 10% of the sample) rise or fall in the rankings by 25 or more places—and there are only 51 places total. Fewer than half the states are placed into the same quintile in both models.

Keep in mind the two models are qualitatively similar, using the same demographic variables and the same outcome measure. The main difference is that Model 2 runs from a subset of the data for Model 1. A third model that included different measures and outcome variables could produce results that differ critically from both these models.

To further complicate things, a sharp drop in rankings from Model 1 to Model 2 can still reflect an absolute gain in student performance in that state, as can a negative value-added score. The Brown Report highlights New York as an example. Over all time, New York (which was an early adopter of the NAEP) gained an average of 0.74 NAEP scale points per year, compared to 0.65 NAEP scale points per year for the rest of the states. Over all time, New York’s gains on NAEP are greater than what the model predicted, with a value-added score of 0.58. But between 2003 and 2009, New York only gained 0.38 NAEP scale points per year, while the gains of other states held steady at 0.62. In this period, New York’s gains are less than what the model predicted, with a value-added score of -1.21. But in terms of absolute scores, New York finished better than it started, no matter which time frame you look at.

I wanted to focus on these regression models from the Brown Report because they clearly illustrate some of the problems with using value-added models for high-stakes decisions like teacher contracts and pay. While the large disparities in rankings caused by relatively small differences in the models are interesting to researchers trying to understand the underpinnings of education, they are also exactly why value-added modeling is difficult to defend as a fair and reliable method of teacher evaluation.

Some limitations of value-added modeling

Following this discussion on teacher evaluation led me to a fascinating analysis by Jim Manzi.

We’ve already discussed some concerns with using standardized test scores as the outcome measures in value-added modeling; Manzi points out other problems with the model and the inputs to the model.

  1. Teaching is complex.
  2. It’s difficult to make good predictions about achievement across different domains.
  3. It’s unrealistic to attribute success or failure only to a single teacher.
  4. The effects of teaching extend beyond one school year, and therefore measurements capture influences that go back beyond one year and one teacher.

I’m not particularly fond of the above list—while I agree with all the claims, they’re not explained very clearly and they don’t capture the below key issues, which he discusses in more depth.

  1. Inferences about the aggregate are not inferences about an individual.
  2. More deeply, the model is valid at the aggregate level, “but any one data point cannot be validated.” This is a fundamental problem, true of stereotypes, of generalizations, and of averages. While they may enable you to make broad claims about a population of people, you can’t apply those claims to policies about a particular individual with enough confidence to justify high-stakes outcomes such as firing decisions. As Manzi summarizes it, an evaluation system works to help an organization achieve an outcome, not to be fair to the individuals within that organization.

    This is also related to problems with data mining—by throwing a bunch of data into a model and turning the crank, you can end up with all kinds of difficult-to-interpret correlations which are excellent predictors but which don’t make a whole lot of sense from a theoretical standpoint.

  3. Basing decisions on single instead of multiple measures is flawed.
  4. From a statistical modeling perspective, it’s easier to work with a single precise, quantitative measure than with multiple measures. But this inflates the influence of that one measure, which is often limited in time and scale. Figuring out how to combine multiple measures into a single metric requires subjective judgment (and thus organizational agreement), and, in Manzi’s words, “is very unlikely to work” with value-added modeling. (I do wish he’d expanded on this point further, though.)

  5. All assessments are proxies.
  6. If the proxy is given more value than the underlying phenomenon it’s supposed to measure, this can incentivize “teaching to the test”. With much at stake, some people will try to game the system. This may motivate those who construct and rely on the model to periodically change the metrics, but that introduces more instability in interpreting and calibrating the results across implementations.

In highlighting these weaknesses of value-added modeling, Manzi concludes by arguing that improving teacher evaluation requires a lot more careful interpretation of its results, within the context of better teacher management. I would very much welcome hearing more dialogue about what that management and leadership should look like, instead of so much hype about impressive but complex statistical tools expected to solve the whole problem on their own.

Using student evaluations to measure teaching effectiveness

I came across a fascinating discussion on the use of student evaluations to measure teaching effectiveness upon following this Observational Epidemiology blog post by Mark, a statistical consultant. The original paper by Scott Carrell and James West uses value-added modeling to estimate teachers’ contributions to students’ grades in introductory courses and in subsequent courses, then analyzes the relationship between those contributions and student evaluations. (An ungated version of the paper is also available.) Key conclusions are:

Student evaluations are positively correlated with contemporaneous professor value‐added and negatively correlated with follow‐on student achievement. That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value‐added in follow‐on courses).

We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow‐on related curriculum.

Not having closely followed the research on this, I’ll simply note some key comments from other blogs.

Direct examination:

Several have posted links that suggest an endorsement of this paper’s conclusion, such as George Mason University professor of economics Tyler Cowen, Harvard professor of economics Greg Mankiw, and Northwestern professor of managerial economics Sandeep Baliga. Michael Bishop, a contributor to Permutations (“official blog of the Mathematical Sociology Section of the American Sociological Association“), provides some more detail in his analysis:

In my post on Babcock’s and Marks’ research, I touched on the possible unintended consequences of student evaluations of professors.  This paper gives new reasons for concern (not to mention much additional evidence, e.g. that physical attractiveness strongly boosts student evaluations).

That said, the scary thing is that even with random assignment, rich data, and careful analysis there are multiple, quite different, explanations.

The obvious first possibility is that inexperienced professors, (perhaps under pressure to get good teaching evaluations) focus strictly on teaching students what they need to know for good grades.  More experienced professors teach a broader curriculum, the benefits of which you might take on faith but needn’t because their students do better in the follow-up course!

After citing this alternative explanation from the authors:

Students of low value added professors in the introductory course may increase effort in follow-on courses to help “erase” their lower than expected grade in the introductory course.

Bishop also notes that motivating students to invest more effort in future courses would be a desirable effect of good professors as well. (But how to distinguish between “good” and “bad” methods for producing this motivation isn’t obvious.)

Cross-examination:

Others critique the article and defend the usefulness of student evaluations with observations that provoke further fascinating discussions.

Andrew Gelman, Columbia professor of statistics and political science, expresses skepticism about the claims:

Carrell and West estimate that the effects of instructors on performance in the follow-on class is as large as the effects on the class they’re teaching. This seems hard to believe, and it seems central enough to their story that I don’t know what to think about everything else in the paper.

At Education Sector, Forrest Hinton expresses strong reservations about the conclusions and the methods:

If you’re like me, you are utterly perplexed by a system that would mostly determine the quality of a Calculus I instructor by students’ performance in a Calculus II or aeronautical engineering course taught by a different instructor, while discounting students’ mastery of Calculus I concepts.

The trouble with complex value-added models, like the one used in this report, is that the number of people who have the technical skills necessary to participate in the debate and critique process is very limited—mostly to academics themselves, who have their own special interests.

Jeff Ely, Northwestern professor of economics, objects to the authors’ interpretation of their results:

I don’t see any way the authors have ruled out the following equally plausible explanation for the statistical findings.  First, students are targeting a GPA.  If I am an outstanding teacher and they do unusually well in my class they don’t need to spend as much effort in their next class as those who had lousy teachers, did poorly this time around, and have some catching up to do next time.  Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.

In agreement, Ed Dolan, an economist who was also for ten years “a teacher and administrator in a graduate business program that did not have tenure,” comments on Jeff Ely’s blog:

I reject the hypothesis that students give high evaluations to instructors who dumb down their courses, teach to the test, grade high, and joke a lot in class. On the contrary, they resent such teachers because they are not getting their money’s worth. I observed a positive correlation between overall evaluation scores and a key evaluation-form item that indicated that the course required more work than average. Informal conversations with students known to be serious tended to confirm the formal evaluation scores.

Re-direct:

Dean Eckles, PhD candidate at Stanford’s CHIMe lab offers this response to Andrew Gelman’s blog post (linked above):

Students like doing well on tests etc. This happens when the teacher is either easier (either through making evaluations easier or teaching more directly to the test) or more effective.

Conditioning on this outcome, is conditioning on a collider that introduces a negative dependence between teacher quality and other factors affecting student satisfaction (e.g., how easy they are).

From Jeff Ely’s blog, a comment by Brian Moore raises this critical question:

“Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.”

Do we know this for sure? Perhaps they know when they have an outstanding teacher, but by definition, those are relatively few.

Closing thoughts:

These discussions raise many key questions, namely:

  • how to measure good teaching;
  • tensions between short-term and long-term assessment and evaluation[1];
  • how well students’ grades measure learning, and how grades impact their perception of learning;
  • the relationship between learning, motivation, and affect (satisfaction);
  • but perhaps most deeply, the question of student metacognition.

The anecdotal comments others have provided about how students respond on evaluations are more fairly couched in the terms “some students.” Given the considerable variability among students, interpreting student evaluations needs to account for those individual differences in teasing out the actual teaching and learning that underlie self-reported perceptions. Buried within those evaluations may be a valuable signal masked by a lot of noise– or more problematically, multiple signals that cancel and drown each other out.

[1] For example, see this review of research demonstrating that training which produces better short-term performance can produce worse long-term learning:
Schmidt, R.A., & Bjork, R.A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science, 3, 207-217.

The value-added wave is a tsunami

Edweek ran an article earlier this week in which economist Douglas N. Harris attempts to encourage economists and educators to get along.

He unfortunately lost me in the 3rd paragraph.

Drawing on student-level achievement data across years, linked to individual teachers, statistical techniques can be used to estimate how much each teacher contributed to student scores—the value-added measure of teacher performance. These measures in turn can be given to teachers and school leaders to inform professional development and curriculum decisions, or to make arguably higher-stakes decisions about performance pay, tenure, and dismissal.

Emphasis mine.

Economists and their education reform allies frequently make this claim, but it is not true, at least not yet. Value-added measures are based on standardized-test scores and neither currently provide information an educator can actually use to make professional development or curriculum decisions. When the scores are released, administrators and teachers receive a composite score and a handful of subscores for each student. In math, these subscores might be for topics like “Number and Operation Sense” and “Geometry”.

It does not do an educator any good to know last year’s students struggled with a topic as broad as “Number and Operation Sense”. Which numbers? Integers? Decimals? Did the students have problems with basic place value? Which operations? The non-commutative ones? Or did they have specific problems with regrouping and carrying? In what way are the students struggling? What errors are they making? What misconceptions might these errors point to? None of this information is contained in a score report. So, as an educator faced with test scores low in “Number and Operation Sense” (and which might be low in other areas as well), where do you start? Do you throw out the entire curriculum? If not, how do you know which parts of it need to be re-examined?

People trained in education recognize a difference between formative assessment—information collected for the purpose of improving instruction and student learning, and summative assessment—information collected to determine whether a student or other entity has reached a desired endpoint. Standardized tests are summative assessments—bad scores on them are like knowing that your football team keeps losing its games. This information is not sufficient for helping the team improve.

Why do economists see the issue so differently?

An economist myself, let me try to explain. Economists tend to think like well-meaning business people. They focus more on bottom-line results than processes and pedagogy, care more about preparing students for the workplace than the ballot box or art museum, and worry more about U.S. economic competitiveness. Economists also focus on the role financial incentives play in organizations, more so than the other myriad factors affecting human behavior. From this perspective, if we can get rid of ineffective teachers and provide financial incentives for the remainder to improve, then students will have higher test scores, yielding more productive workers and a more competitive U.S. economy.

This logic makes educators and education scholars cringe: Do economists not see that drill-and-kill has replaced rich, inquiry-based learning? Do they really think test preparation is the solution to the nation’s economic prosperity? Economists do partly recognize these concerns, as the quotations from the recent reports suggest. But they also see the motivation and goals of human behavior somewhat differently from the way most educators do.

This false dichotomy makes me cringe. As a trained education research scientist who is no stranger to statistical models, value-added is not ready for prime time because its primary input—standardized test scores—is deeply flawed. In science and statistics, if you put garbage data into your model, you will get garbage conclusions out. It has nothing to do with valuing art over economic competitiveness, and everything to do with the integrity of the science.

The divide between economists and others might be more productive if any of the reports provided specific recommendations. For example, creating better student assessments and combining value-added with classroom assessments are musts.

Thank you. Here where I start agreeing—if only that had been the central point of the article. I don’t dismiss value-added modeling as a technique, but I do not believe we have anything resembling good measures of teaching and learning.

We also have to avoid letting the tail wag the dog: Some states and districts are trying to expand testing to nontested grades and subjects, and to change test instruments so the scores more clearly reflect student growth for value-added calculations. This thinking is exactly backwards.

I agree completely, but that won’t stop states and districts from desperately trying to game the system. Since economists focus so much on financial incentives, this should be easy for them to understand: when the penalty for having low standardized test scores (or low value-added scores) is losing your funding, you will do whatever will get those scores up fastest. In most cases, that is changing the rules by which the scores are computed. Welcome to Campbell’s law.

In other words, test-based value-added doesn’t make sense

Education Week posts a defense of value-added metrics, attempting to address concerns about their use and reliability. By their admission, it all hinges on a central assumption:

If student test achievement is the desired outcome, value-added is superior to other existing methods of classifying teachers.

What if student test achievement isn’t the desired outcome?

But what do the data say?

Perhaps this is the time for a counter-reformation” summarizes some choice tidbits on charter schools, test-based metrics & value-added modeling, and performance-based pay and firing, from a statistician’s perspective.

On charter schools:

The majority of the 5,000 or so charter schools nationwide appear to be no better, and in many cases worse, than local public schools when measured by achievement on standardized tests.

On value-added modeling:

A study [using VAM] found that students’ fifth grade teachers were good predictors of their fourth grade test scores… [which] can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

On performance-based pay and firing:

There is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones.

[A study] conducted by the National Center on Performance Incentives at Vanderbilt… found no significant difference between the test results from classes led by teachers eligible for bonuses and those led by teachers who were ineligible.

In summary:

Just for the record, I believe that charter schools, increased use of metrics, merit pay and a streamlined process for dismissing bad teachers do have a place in education, but all of these things can more harm than good if badly implemented and, given the current state of the reform movement, badly implemented is pretty much the upper bound.

I’m less pessimistic than Mark is about the quality of implementation of these initiatives, but I agree that how effectively well-intentioned reforms are implemented is always a crucial concern.

Doctor quality, meet teacher quality

The New York Times ran an article on doctor “pay for performance”:

Health care experts applauded these early [“pay for performance”] initiatives and the new focus on patient outcomes. But over time, many of the same experts began tempering their earlier enthusiasm. In opinion pieces published in medical journals, they have voiced concerns about pay-for-performance ranging from the onerous administrative burden of collecting such large amounts of patient data to the potential worsening of clinician morale and widening disparities in health care access.

But there has been no research to date that has directly studied the one concern driving all of these criticisms — that the premise of the evaluation criteria is flawed. Patient outcomes may not be as inextricably linked to doctors as many pay-for-performance programs presume.

Now a study published this month in The Journal of the American Medical Association has confirmed these experts’ suspicions. Researchers from the Massachusetts General Hospital in Boston and Harvard Medical School have found that whom doctors care for can have as much of an influence on pay-for-performance rankings as what those doctors do.

Replace every instance of “doctor” with “teacher” and “patient” with “student”, and you have a cogent explanation of why teacher merit pay based on students’ standardized test scores is a terrible idea.

%d bloggers like this: