Not all uses of data are equal

Gil Press worries that “big data enthusiasts may encourage (probably unintentionally) a new misguided belief, that ‘putting data in front of the teacher’ is in and of itself a solution [to what ails education today].”

As an advocate for the better use of educational data and learning analytics to serve teachers, I worry about careless endorsements and applications of “big data” that overlook these concerns:

1. Available data are not always the most important data.
2. Data should motivate providing support, not merely accountability.
3. Teachers are neither scientists nor laypeople in their use of data. They rely on data constantly, but need representations that they can interpret and turn into action readily.

Assessment specialists have long noted the many uses of assessment data; all educational data should be weighed as carefully, even more so when implemented at a large scale which magnifies the influence of errors.

Linda Darling-Hammond on TFA and teacher preparation

Linda Darling-Hammond’s article on teacher preparation in this week’s EdWeek should be required reading for anyone interested in education policy. The article was written in recognition of Teach for America’s 20th anniversary.

Yes, my vision is that in 10 years, the United States, like other high-achieving nations, will recruit top teaching candidates, prepare them well in state-of-the-art training programs (free of charge), and support them for career-long success in high-quality schools. Today, by contrast, teachers go into debt to enter a career that pays noticeably less than their alternatives—especially if they work in high-poverty schools— and reach the profession through a smorgasbord of training options, from excellent to awful, often followed by little mentoring or help. As a result, while some teachers are well prepared, many students in needy schools experience a revolving door of inexperienced and underprepared teachers.

Darling-Hammond is probably best-known for her criticism of Teach for America’s crash-course, full-steam-ahead approach to teacher preparation. She goes on to criticize the cost/benefit ratio of the program .

Where some studies have shown better outcomes for TFA teachers—generally in high school, in mathematics, and in comparison with less prepared teachers in the same high-need schools—others have found that students of new TFA teachers do less well than those of fully prepared beginners, especially in elementary grades, in fields such as reading, and with Latino students and English-language learners.

The small number of TFA-ers who stay in teaching (fewer than 20 percent by year four, according to state and district data) do become as effective as other fully credentialed teachers and, often, more effective in teaching mathematics. However, this small yield comes at substantial cost to the public for recruitment, training, and replacement. A recent estimate places recurring costs at more than $70,000 per recruit, enough to have trained numerous effective career teachers.

She doesn’t provide a source for the last figure, unfortunately, and it’s not clear what exactly is being included in the recurring cost per TFA recruit. Even disregarding that, it is still true that TFA teachers are more expensive than traditionally-certified teachers, not obviously more effective, and they leave the profession in higher numbers.

Darling-Hammond is not anti-TFA. She just believes that it (and the rest of public education) would better serve their students if they focused more on quality teacher preparation, preparation that sets the stage for a lifelong career of successful teaching, not just a two-year commitment.

TFA teachers are committed, work hard, and want to do a good job. Many want to stay in the profession, but feel their lack of strong preparation makes it difficult to do so. For these reasons, alumni like Megan Hopkins have proposed that TFA evolve into a teacher-residency model that would offer recruits a full year of training under the wing of an expert urban teacher while completing tightly connected coursework for certification. Such teacher residencies, operating as partnerships with universities in cities like Chicago, Boston, and Denver, have produced strong urban teachers who stay in the profession at rates of more than 80 percent, as have many universities that have developed new models of recruitment and training.

On the occasion of its 20th anniversary, we should be building on what works for TFA and marrying it to what works for dozens of strong preparation programs to produce the highly qualified, effective teachers we need for every community in the 21st century.

Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Using student evaluations to measure teaching effectiveness

I came across a fascinating discussion on the use of student evaluations to measure teaching effectiveness upon following this Observational Epidemiology blog post by Mark, a statistical consultant. The original paper by Scott Carrell and James West uses value-added modeling to estimate teachers’ contributions to students’ grades in introductory courses and in subsequent courses, then analyzes the relationship between those contributions and student evaluations. (An ungated version of the paper is also available.) Key conclusions are:

Student evaluations are positively correlated with contemporaneous professor value‐added and negatively correlated with follow‐on student achievement. That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value‐added in follow‐on courses).

We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow‐on related curriculum.

Not having closely followed the research on this, I’ll simply note some key comments from other blogs.

Direct examination:

Several have posted links that suggest an endorsement of this paper’s conclusion, such as George Mason University professor of economics Tyler Cowen, Harvard professor of economics Greg Mankiw, and Northwestern professor of managerial economics Sandeep Baliga. Michael Bishop, a contributor to Permutations (“official blog of the Mathematical Sociology Section of the American Sociological Association“), provides some more detail in his analysis:

In my post on Babcock’s and Marks’ research, I touched on the possible unintended consequences of student evaluations of professors.  This paper gives new reasons for concern (not to mention much additional evidence, e.g. that physical attractiveness strongly boosts student evaluations).

That said, the scary thing is that even with random assignment, rich data, and careful analysis there are multiple, quite different, explanations.

The obvious first possibility is that inexperienced professors, (perhaps under pressure to get good teaching evaluations) focus strictly on teaching students what they need to know for good grades.  More experienced professors teach a broader curriculum, the benefits of which you might take on faith but needn’t because their students do better in the follow-up course!

After citing this alternative explanation from the authors:

Students of low value added professors in the introductory course may increase effort in follow-on courses to help “erase” their lower than expected grade in the introductory course.

Bishop also notes that motivating students to invest more effort in future courses would be a desirable effect of good professors as well. (But how to distinguish between “good” and “bad” methods for producing this motivation isn’t obvious.)

Cross-examination:

Others critique the article and defend the usefulness of student evaluations with observations that provoke further fascinating discussions.

Andrew Gelman, Columbia professor of statistics and political science, expresses skepticism about the claims:

Carrell and West estimate that the effects of instructors on performance in the follow-on class is as large as the effects on the class they’re teaching. This seems hard to believe, and it seems central enough to their story that I don’t know what to think about everything else in the paper.

At Education Sector, Forrest Hinton expresses strong reservations about the conclusions and the methods:

If you’re like me, you are utterly perplexed by a system that would mostly determine the quality of a Calculus I instructor by students’ performance in a Calculus II or aeronautical engineering course taught by a different instructor, while discounting students’ mastery of Calculus I concepts.

The trouble with complex value-added models, like the one used in this report, is that the number of people who have the technical skills necessary to participate in the debate and critique process is very limited—mostly to academics themselves, who have their own special interests.

Jeff Ely, Northwestern professor of economics, objects to the authors’ interpretation of their results:

I don’t see any way the authors have ruled out the following equally plausible explanation for the statistical findings.  First, students are targeting a GPA.  If I am an outstanding teacher and they do unusually well in my class they don’t need to spend as much effort in their next class as those who had lousy teachers, did poorly this time around, and have some catching up to do next time.  Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.

In agreement, Ed Dolan, an economist who was also for ten years “a teacher and administrator in a graduate business program that did not have tenure,” comments on Jeff Ely’s blog:

I reject the hypothesis that students give high evaluations to instructors who dumb down their courses, teach to the test, grade high, and joke a lot in class. On the contrary, they resent such teachers because they are not getting their money’s worth. I observed a positive correlation between overall evaluation scores and a key evaluation-form item that indicated that the course required more work than average. Informal conversations with students known to be serious tended to confirm the formal evaluation scores.

Re-direct:

Dean Eckles, PhD candidate at Stanford’s CHIMe lab offers this response to Andrew Gelman’s blog post (linked above):

Students like doing well on tests etc. This happens when the teacher is either easier (either through making evaluations easier or teaching more directly to the test) or more effective.

Conditioning on this outcome, is conditioning on a collider that introduces a negative dependence between teacher quality and other factors affecting student satisfaction (e.g., how easy they are).

From Jeff Ely’s blog, a comment by Brian Moore raises this critical question:

“Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.”

Do we know this for sure? Perhaps they know when they have an outstanding teacher, but by definition, those are relatively few.

Closing thoughts:

These discussions raise many key questions, namely:

  • how to measure good teaching;
  • tensions between short-term and long-term assessment and evaluation[1];
  • how well students’ grades measure learning, and how grades impact their perception of learning;
  • the relationship between learning, motivation, and affect (satisfaction);
  • but perhaps most deeply, the question of student metacognition.

The anecdotal comments others have provided about how students respond on evaluations are more fairly couched in the terms “some students.” Given the considerable variability among students, interpreting student evaluations needs to account for those individual differences in teasing out the actual teaching and learning that underlie self-reported perceptions. Buried within those evaluations may be a valuable signal masked by a lot of noise– or more problematically, multiple signals that cancel and drown each other out.

[1] For example, see this review of research demonstrating that training which produces better short-term performance can produce worse long-term learning:
Schmidt, R.A., & Bjork, R.A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science, 3, 207-217.

In-school vs. non-school factors

From How to fix our schools, in which Richard Rothstein, of the Economic Policy Institute, critiques Joel Klein’s and Michelle Rhee’s approach of focusing only on firing incompetent teachers as a means to improve schools:

“Differences in school quality can explain about 1/3 of the variation in student achievement. But the other 2/3 come from non-school factors.” In-school factors go beyond teacher quality: school leadership, curriculum quality, teacher collaboration. Non-school factors include economic consequences of parental underemployment, such as geographic disruption, malnutrition, stress, poor health.

But what do the data say?

Perhaps this is the time for a counter-reformation” summarizes some choice tidbits on charter schools, test-based metrics & value-added modeling, and performance-based pay and firing, from a statistician’s perspective.

On charter schools:

The majority of the 5,000 or so charter schools nationwide appear to be no better, and in many cases worse, than local public schools when measured by achievement on standardized tests.

On value-added modeling:

A study [using VAM] found that students’ fifth grade teachers were good predictors of their fourth grade test scores… [which] can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

On performance-based pay and firing:

There is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones.

[A study] conducted by the National Center on Performance Incentives at Vanderbilt… found no significant difference between the test results from classes led by teachers eligible for bonuses and those led by teachers who were ineligible.

In summary:

Just for the record, I believe that charter schools, increased use of metrics, merit pay and a streamlined process for dismissing bad teachers do have a place in education, but all of these things can more harm than good if badly implemented and, given the current state of the reform movement, badly implemented is pretty much the upper bound.

I’m less pessimistic than Mark is about the quality of implementation of these initiatives, but I agree that how effectively well-intentioned reforms are implemented is always a crucial concern.

Instruction matters, and community matters

On “In Massachusetts, Brockton High Becomes Success Story“:

Engaging the community of teachers and students can be much more effective than stripping it down and weeding it out. Bottom line: “Achievement rose when leadership teams focused thoughtfully and relentlessly on improving the quality of instruction.”

Concerns about the LA Times teacher ratings

On “L.A. Times analysis rates teachers’ effectiveness“:

A Times analysis, using data largely ignored by LAUSD, looks at which educators help students learn, and which hold them back.

I’m a huge fan of organizing, analyzing, and sharing data, but I have real concerns about figuring out the best means for conveying and acting upon those results. Not just data quality (what gets assessed, how scores are calculated and weighed), but contextualizing results (triangulation with qualitative data) and professional development (social comparison, ongoing support).

The importance of good teachers for all

On “Poor quality teachers may prevent children from reaching reading potential“:

However much we all might wish that great teachers leveled the playing field and brought every student up to impressive levels of achievement, this paper suggests otherwise: Bad teaching hinders all students, and good teaching helps all students– just not necessarily by the same amount or to the same level. It also contradicts the notion that capable students just teach themselves and don’t need good teachers.

Quite simply, good teaching helps all students reach their potential.

This also implies that we should be careful in how we measure achievement gaps: Variability in meeting basic skill levels (which we may reasonably expect of all students) are problematic, but overall variability (in reaching higher levels of achievement) may actually be a sign of good teaching.

(Full article available via subscription at http://www.sciencemag.org/cgi/content/full/328/5977/512?rss=1.)

On good teachers

On “What Makes a Good Teacher?“:

For years, the secrets to great teaching have seemed more like alchemy than science, a mix of motivational mumbo jumbo and misty-eyed tales of inspiration and dedication. But for more than a decade, one organization has been tracking hundreds of thousands of kids, and looking at why some teachers can move them three grade levels ahead in a year and others can’t. Now, as the Obama administration offers states more than $4 billion to identify and cultivate effective teachers, Teach for America is ready to release its data.

Really fascinating– I’m looking forward to reading TfA’s report. I wish the journalist had maintained her focus on reporting behaviors and attitudes (monitoring understanding, setting & striving for goals, grit) rather than traits and past history (GPA, leadership), but there’s still a great list in there. I hope that policy will also attend closely to the process variables.

Note that the research comes from an aggregate sample of master’s programs. There could well be distinguishing factors in some master’s programs that are beneficial, which is why I’d want more process data. True that they focused on the classroom (due to the emphasis on what teachers do), but they did at least mention the importance of having teachers reach out to students’ families. It’d be interesting to find out what kind of impact parent-education programs could have.

Wouldn’t it be wonderful to read an article about how teachers learn and improve, instead of what makes them “great”?

%d bloggers like this: