Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Some history and context on VAM in teacher evaluation

In the Columbia Journalism Review’s Tested: Covering schools in the age of micro-measurement, LynNell Hancock provides a rich survey of the history and context of the current debate over value-added modeling in teacher evaluation, with a particular focus on LA and NY.

Here are some key points from the critique:

1. In spite of their complexity, value-added models are based on very limited sources of data: who taught the students, without regard to how or under what conditions, and standardized tests, which are a very narrow and imperfect measure of learning,

No allowance is made for many “inside school” factors… Since the number is based on manipulating one-day snapshot tests—the value of which is a matter of debate—what does it really measure?

2. Value-added modeling is an imprecise method whose parameters and outcomes are highly dependent on the assumptions built into the model.

In February, two University of Colorado, Boulder researchers caused a dustup when they called the Times’s data “demonstrably inadequate.” After running the same data through their own methodology, controlling for added factors such as school demographics, the researchers found about half the reading teachers’ scores changed. On the extreme ends, about 8 percent were bumped from ineffective to effective, and 12 percent bumped the other way. To the researchers, the added factors were reasonable, and the fact that they changed the results so dramatically demonstrated the fragility of the value-added method.

3. Value-added modeling is inappropriate to use as grounds for firing teachers or calculating merit pay.

Nearly every economist who weighed in agreed that districts should not use these indicators to make high-stakes decisions, like whether to fire teachers or add bonuses to paychecks.

Further, it’s questionable how effective it is as a policy to focus simply on individual teacher quality, when poverty has a greater impact on a child’s learning:

The federal Coleman Report issued [in 1966] found that a child’s family economic status was the most telling predictor of school achievement. That stubborn fact remains discomfiting—but undisputed—among education researchers today.

These should all be familiar concerns by now. What this article adds is a much richer picture of the historical and political context for the many players in the debate. I’m deeply disturbed that NYS Supreme Court Judge Cynthia Kern ruled that “there is no requirement that data be reliable for it to be disclosed.” At least Trontz at the NY Times acknowledges the importance of publishing reliable information as opposed to spurious claims, except he seems to overlook all the arguments against the merits of the data:

If we find the data is so completely botched, or riddled with errors that it would be unfair to release it, then we would have to think very long and hard about releasing it.

That’s the whole point: applying value-added modeling to standardized test scores to fire or reward teachers is unreliable to the point of being unfair. Adding noise and confusion to the conversation isn’t “a net positive,” as Arthur Browne from The Daily News seems to believe; it degrades the discussion, at great harm to the individual teachers, their students, the institutions that house them, and the society that purports to sustain them and benefit from them.

%d bloggers like this: