Not all uses of data are equal

Gil Press worries that “big data enthusiasts may encourage (probably unintentionally) a new misguided belief, that ‘putting data in front of the teacher’ is in and of itself a solution [to what ails education today].”

As an advocate for the better use of educational data and learning analytics to serve teachers, I worry about careless endorsements and applications of “big data” that overlook these concerns:

1. Available data are not always the most important data.
2. Data should motivate providing support, not merely accountability.
3. Teachers are neither scientists nor laypeople in their use of data. They rely on data constantly, but need representations that they can interpret and turn into action readily.

Assessment specialists have long noted the many uses of assessment data; all educational data should be weighed as carefully, even more so when implemented at a large scale which magnifies the influence of errors.

Advertisements

Linda Darling-Hammond on TFA and teacher preparation

Linda Darling-Hammond’s article on teacher preparation in this week’s EdWeek should be required reading for anyone interested in education policy. The article was written in recognition of Teach for America’s 20th anniversary.

Yes, my vision is that in 10 years, the United States, like other high-achieving nations, will recruit top teaching candidates, prepare them well in state-of-the-art training programs (free of charge), and support them for career-long success in high-quality schools. Today, by contrast, teachers go into debt to enter a career that pays noticeably less than their alternatives—especially if they work in high-poverty schools— and reach the profession through a smorgasbord of training options, from excellent to awful, often followed by little mentoring or help. As a result, while some teachers are well prepared, many students in needy schools experience a revolving door of inexperienced and underprepared teachers.

Darling-Hammond is probably best-known for her criticism of Teach for America’s crash-course, full-steam-ahead approach to teacher preparation. She goes on to criticize the cost/benefit ratio of the program .

Where some studies have shown better outcomes for TFA teachers—generally in high school, in mathematics, and in comparison with less prepared teachers in the same high-need schools—others have found that students of new TFA teachers do less well than those of fully prepared beginners, especially in elementary grades, in fields such as reading, and with Latino students and English-language learners.

The small number of TFA-ers who stay in teaching (fewer than 20 percent by year four, according to state and district data) do become as effective as other fully credentialed teachers and, often, more effective in teaching mathematics. However, this small yield comes at substantial cost to the public for recruitment, training, and replacement. A recent estimate places recurring costs at more than $70,000 per recruit, enough to have trained numerous effective career teachers.

She doesn’t provide a source for the last figure, unfortunately, and it’s not clear what exactly is being included in the recurring cost per TFA recruit. Even disregarding that, it is still true that TFA teachers are more expensive than traditionally-certified teachers, not obviously more effective, and they leave the profession in higher numbers.

Darling-Hammond is not anti-TFA. She just believes that it (and the rest of public education) would better serve their students if they focused more on quality teacher preparation, preparation that sets the stage for a lifelong career of successful teaching, not just a two-year commitment.

TFA teachers are committed, work hard, and want to do a good job. Many want to stay in the profession, but feel their lack of strong preparation makes it difficult to do so. For these reasons, alumni like Megan Hopkins have proposed that TFA evolve into a teacher-residency model that would offer recruits a full year of training under the wing of an expert urban teacher while completing tightly connected coursework for certification. Such teacher residencies, operating as partnerships with universities in cities like Chicago, Boston, and Denver, have produced strong urban teachers who stay in the profession at rates of more than 80 percent, as have many universities that have developed new models of recruitment and training.

On the occasion of its 20th anniversary, we should be building on what works for TFA and marrying it to what works for dozens of strong preparation programs to produce the highly qualified, effective teachers we need for every community in the 21st century.

Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Using student evaluations to measure teaching effectiveness

I came across a fascinating discussion on the use of student evaluations to measure teaching effectiveness upon following this Observational Epidemiology blog post by Mark, a statistical consultant. The original paper by Scott Carrell and James West uses value-added modeling to estimate teachers’ contributions to students’ grades in introductory courses and in subsequent courses, then analyzes the relationship between those contributions and student evaluations. (An ungated version of the paper is also available.) Key conclusions are:

Student evaluations are positively correlated with contemporaneous professor value‐added and negatively correlated with follow‐on student achievement. That is, students appear to reward higher grades in the introductory course but punish professors who increase deep learning (introductory course professor value‐added in follow‐on courses).

We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow‐on related curriculum.

Not having closely followed the research on this, I’ll simply note some key comments from other blogs.

Direct examination:

Several have posted links that suggest an endorsement of this paper’s conclusion, such as George Mason University professor of economics Tyler Cowen, Harvard professor of economics Greg Mankiw, and Northwestern professor of managerial economics Sandeep Baliga. Michael Bishop, a contributor to Permutations (“official blog of the Mathematical Sociology Section of the American Sociological Association“), provides some more detail in his analysis:

In my post on Babcock’s and Marks’ research, I touched on the possible unintended consequences of student evaluations of professors.  This paper gives new reasons for concern (not to mention much additional evidence, e.g. that physical attractiveness strongly boosts student evaluations).

That said, the scary thing is that even with random assignment, rich data, and careful analysis there are multiple, quite different, explanations.

The obvious first possibility is that inexperienced professors, (perhaps under pressure to get good teaching evaluations) focus strictly on teaching students what they need to know for good grades.  More experienced professors teach a broader curriculum, the benefits of which you might take on faith but needn’t because their students do better in the follow-up course!

After citing this alternative explanation from the authors:

Students of low value added professors in the introductory course may increase effort in follow-on courses to help “erase” their lower than expected grade in the introductory course.

Bishop also notes that motivating students to invest more effort in future courses would be a desirable effect of good professors as well. (But how to distinguish between “good” and “bad” methods for producing this motivation isn’t obvious.)

Cross-examination:

Others critique the article and defend the usefulness of student evaluations with observations that provoke further fascinating discussions.

Andrew Gelman, Columbia professor of statistics and political science, expresses skepticism about the claims:

Carrell and West estimate that the effects of instructors on performance in the follow-on class is as large as the effects on the class they’re teaching. This seems hard to believe, and it seems central enough to their story that I don’t know what to think about everything else in the paper.

At Education Sector, Forrest Hinton expresses strong reservations about the conclusions and the methods:

If you’re like me, you are utterly perplexed by a system that would mostly determine the quality of a Calculus I instructor by students’ performance in a Calculus II or aeronautical engineering course taught by a different instructor, while discounting students’ mastery of Calculus I concepts.

The trouble with complex value-added models, like the one used in this report, is that the number of people who have the technical skills necessary to participate in the debate and critique process is very limited—mostly to academics themselves, who have their own special interests.

Jeff Ely, Northwestern professor of economics, objects to the authors’ interpretation of their results:

I don’t see any way the authors have ruled out the following equally plausible explanation for the statistical findings.  First, students are targeting a GPA.  If I am an outstanding teacher and they do unusually well in my class they don’t need to spend as much effort in their next class as those who had lousy teachers, did poorly this time around, and have some catching up to do next time.  Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.

In agreement, Ed Dolan, an economist who was also for ten years “a teacher and administrator in a graduate business program that did not have tenure,” comments on Jeff Ely’s blog:

I reject the hypothesis that students give high evaluations to instructors who dumb down their courses, teach to the test, grade high, and joke a lot in class. On the contrary, they resent such teachers because they are not getting their money’s worth. I observed a positive correlation between overall evaluation scores and a key evaluation-form item that indicated that the course required more work than average. Informal conversations with students known to be serious tended to confirm the formal evaluation scores.

Re-direct:

Dean Eckles, PhD candidate at Stanford’s CHIMe lab offers this response to Andrew Gelman’s blog post (linked above):

Students like doing well on tests etc. This happens when the teacher is either easier (either through making evaluations easier or teaching more directly to the test) or more effective.

Conditioning on this outcome, is conditioning on a collider that introduces a negative dependence between teacher quality and other factors affecting student satisfaction (e.g., how easy they are).

From Jeff Ely’s blog, a comment by Brian Moore raises this critical question:

“Second, students recognize when they are being taught by an outstanding teacher and they give him good evaluations.”

Do we know this for sure? Perhaps they know when they have an outstanding teacher, but by definition, those are relatively few.

Closing thoughts:

These discussions raise many key questions, namely:

  • how to measure good teaching;
  • tensions between short-term and long-term assessment and evaluation[1];
  • how well students’ grades measure learning, and how grades impact their perception of learning;
  • the relationship between learning, motivation, and affect (satisfaction);
  • but perhaps most deeply, the question of student metacognition.

The anecdotal comments others have provided about how students respond on evaluations are more fairly couched in the terms “some students.” Given the considerable variability among students, interpreting student evaluations needs to account for those individual differences in teasing out the actual teaching and learning that underlie self-reported perceptions. Buried within those evaluations may be a valuable signal masked by a lot of noise– or more problematically, multiple signals that cancel and drown each other out.

[1] For example, see this review of research demonstrating that training which produces better short-term performance can produce worse long-term learning:
Schmidt, R.A., & Bjork, R.A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science, 3, 207-217.

Retrieval is only part of the picture

The latest educational research to make the rounds has been reported variously as “Test-Taking Cements Knowledge Better Than Studying,” “Simple Recall Exercises Make Science Learning Easier,” “Practising Retrieval is Best Tool for Learning,” and “Learning Science: Actively Recalling Information from Memory Beats Elaborate Study Methods.” Before anyone gets carried away seeking to apply these findings to practice, let’s correct the headlines and clarify what the researchers actually studied.

First, the “test-taking” vs. “studying” dichotomy presented by the NYT is too broad. The winning condition was “retrieval practice”, described fairly as “actively recalling information from memory” or even “simple recall exercises.” The multiple-choice questions popular on so many standardized tests don’t qualify because they assess recognition of information, not recall. In this study, participants had to report as much information as they could remember from the text, a more generative task than picking the best among the possible answers presented to them.

Nor were the comparison conditions merely “studying.” While the worst-performing conditions asked students to read (and perhaps reread) the text, they were dropped from the second experiment, which contrasted retrieval practice against “elaborative concept-mapping.” Thus, the “elaborate” (better read as “elaborative”) study methods reported in the ScienceDaily headline are overly broad, since concept-mapping is only one of many kinds of elaborative study methods. That the researchers found no benefit for students who had previous concept-mapping experience may simply mean that it requires more than one or two exposures to be useful.

The premise underlying concept-mapping as a learning tool is that re-representing knowledge in another format helps students identify and understand relationships between the concepts. But producing a new representation on paper (or some other external medium) doesn’t require constructing a new internal mental representation. In focusing on producing a concept map, students may simply have copied the information from the text to their diagram without deeply processing what they were writing or drawing. By scoring the concept maps by completeness (number of ideas) rather than quality (appropriateness of node placement and links), this study did not fully safeguard against this.

To a certain extent that may be the exact point the researchers wanted to make: That concept-mapping can be executed in an “active” yet non-generative fashion. Even reviewing a concept map (as the participants were encouraged to do with any remaining time) can be done very superficially, simply checking to make sure that all the information is present, rather than reflecting on the relationships represented—similar to making a “cheat sheet” for a test and trusting that all the formulas and definitions are there, instead of evaluating the conditions and rationale for applying them.

One may construe this as an argument against concept-mapping as a study technique, if it is so difficult to utilize it effectively. But just because a given tool can be used poorly does not mean it should be avoided completely; that could be true of any teaching or learning approach. Nor does this necessarily constitute an argument against other elaborative study methods. Explaining a text or diagram, whether to oneself or to others, is another form of elaboration that has been well documented for its effectiveness in supporting learning[1]. This constitutes an interesting hybrid between elaboration and retrieval, insofar as explanation adds information beyond the source but may also demand partial recall of the contents of the source even when present. If the value of explanation is solely in the retrieval involved, then it should fare worse against pure retrieval and better against pure elaboration.

All of this begs the question, “Better for what?” The tests in this study primarily measured retrieval, with 84% of the points counting the presence of ideas and the rest (from only two questions) assessing inference. Yet even those inference questions depended partially on retrieval, making it ambiguous whether wrong answers reflected a failure to retrieve, comprehend, or apply knowledge. What this study showed most clearly was that retrieval practice is valuable for improving retrieval. Elaboration and other activities may still be valuable for promoting transfer and inference. There could also be a possible interaction whereby elaboration and retrieval mutually enhance each other, since remembering and conducting inferences is easier with robust knowledge structures. The lesson may not be that elaborative activities are a poor use of time, but that they need to incorporate retrieval practice to be most effective.

I don’t at all doubt the validity of the finding, or the importance of retrieval in promoting learning. I share the authors’ frustration with the often-empty trumpeting of “active learning,” which can assume ineffective and meaningless forms [2][3]. I also recognize the value of knowing certain information in order to utilize it efficiently and flexibly. My concerns are in interpreting and applying this finding sensibly to real-life teaching and learning.

  • Retrieval is only part of the picture. Educators need to assess and support multiple skills, including and beyond retrieval. There’s a great danger of forgetting other learning goals (such as understanding, applying, creating, evaluating, etc.) when pressured to document success in retrieval.
  • Is it retrieving knowledge or generating knowledge? I also wonder whether “retrieval” may be too narrow a label for the broader phenomenon of generating knowledge. This may be a specific instance of the well-documented generation effect [4], and it may not always be most beneficial to focus only on retrieving the particular facts. There could be a similar advantage to other generative tasks, such as inventing a new application of a given phenomenon, writing a story incorporating new vocabulary words, or creating a problem that could almost be solved by a particular strategy. None of these require retrieving the phenomenon, the definitions, or the solution method to be learned, but they all require elaborating upon the knowledge-to-be-learned by generating new information and deeper understanding of it. Knowledge is more than a list of disconnected facts [5]; it needs a structure to be meaningful [6]. Focusing too heavily on retrieving the list downplays the importance of developing the supporting structure.
  • Retrieval isn’t recognition, and not all retrieval is worthwhile. Most important, I’m especially concerned that the mainstream media’s reporting of this finding may make it too easily misinterpreted. It would be a shame if this were used to justify more multiple-choice testing, or if a well-meaning student thought that accurately reproducing a graph from a textbook by memory constituted better studying than explaining the relationships embedded within that graph.

For the sake of a healthy relationship between research and practice, I hope the general public and policymakers will take this finding in context and not champion it into the latest silver bullet that will save education. Careless conversion of research into practice undermines the scientific process, effective policymaking, and teachers’ professional judgment, all of which need to collaborate instead of collide.

J. D. Karpicke, J. R. Blunt. Retrieval Practice Produces More Learning than Elaborative Studying with Concept Mapping. Science, 2011; DOI: 10.1126/science.1199327


[1] Chi, M.T.H., de Leeuw, N., Chiu, M.H., & LaVancher, C. (1994). Eliciting self-explanations improves understanding. Cognitive Science, 18, 439-477.
[2] For example, see the “Teacher A” model described in:
Scardamalia, M., & Bereiter, C. (1991). Higher levels of agency for children in knowledge building: A challenge for the design of new knowledge media. Journal of the Learning Sciences, 1, 37-68.
(There’s also a “Johnny Appleseed” project description I once read that’s a bit of a caricature of poorly-designed project-based learning, but I can’t seem to find it now. If anyone knows of this example, please share it with me!)
[3] This is one reason why some educators now advocate “minds-on” rather than simply “hands-on” learning. Of course, what those minds are focused on still deserves better clarification.
[4] e.g., Slamecka, N.J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4, 592-604.
[5] In the following study, some gifted students outscored historians in their fact recall, but could not evaluate and interpret claims as effectively:
Wineburg, S.S. (1991). Historical problem solving: A study of the cognitive processes used in the evaluation of documentary and pictorial evidence. Journal of Educational Psychology, 83, 73-87.
[6] For a fuller description of the importance of structured knowledge representations, see:
Bransford, J.D., Brown, A.L., & Cocking, R.R. (2000). How people learn: Brain, mind, experience, and school (Expanded edition). Washington DC: National Academy Press, pp. 31-50 (Ch. 2: How Experts Differ from Novices). 

In-school vs. non-school factors

From How to fix our schools, in which Richard Rothstein, of the Economic Policy Institute, critiques Joel Klein’s and Michelle Rhee’s approach of focusing only on firing incompetent teachers as a means to improve schools:

“Differences in school quality can explain about 1/3 of the variation in student achievement. But the other 2/3 come from non-school factors.” In-school factors go beyond teacher quality: school leadership, curriculum quality, teacher collaboration. Non-school factors include economic consequences of parental underemployment, such as geographic disruption, malnutrition, stress, poor health.

But what do the data say?

Perhaps this is the time for a counter-reformation” summarizes some choice tidbits on charter schools, test-based metrics & value-added modeling, and performance-based pay and firing, from a statistician’s perspective.

On charter schools:

The majority of the 5,000 or so charter schools nationwide appear to be no better, and in many cases worse, than local public schools when measured by achievement on standardized tests.

On value-added modeling:

A study [using VAM] found that students’ fifth grade teachers were good predictors of their fourth grade test scores… [which] can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

On performance-based pay and firing:

There is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones.

[A study] conducted by the National Center on Performance Incentives at Vanderbilt… found no significant difference between the test results from classes led by teachers eligible for bonuses and those led by teachers who were ineligible.

In summary:

Just for the record, I believe that charter schools, increased use of metrics, merit pay and a streamlined process for dismissing bad teachers do have a place in education, but all of these things can more harm than good if badly implemented and, given the current state of the reform movement, badly implemented is pretty much the upper bound.

I’m less pessimistic than Mark is about the quality of implementation of these initiatives, but I agree that how effectively well-intentioned reforms are implemented is always a crucial concern.

Instruction matters, and community matters

On “In Massachusetts, Brockton High Becomes Success Story“:

Engaging the community of teachers and students can be much more effective than stripping it down and weeding it out. Bottom line: “Achievement rose when leadership teams focused thoughtfully and relentlessly on improving the quality of instruction.”

Concerns about the LA Times teacher ratings

On “L.A. Times analysis rates teachers’ effectiveness“:

A Times analysis, using data largely ignored by LAUSD, looks at which educators help students learn, and which hold them back.

I’m a huge fan of organizing, analyzing, and sharing data, but I have real concerns about figuring out the best means for conveying and acting upon those results. Not just data quality (what gets assessed, how scores are calculated and weighed), but contextualizing results (triangulation with qualitative data) and professional development (social comparison, ongoing support).

This is what math class should be like

%d bloggers like this: