Not all uses of data are equal

Gil Press worries that “big data enthusiasts may encourage (probably unintentionally) a new misguided belief, that ‘putting data in front of the teacher’ is in and of itself a solution [to what ails education today].”

As an advocate for the better use of educational data and learning analytics to serve teachers, I worry about careless endorsements and applications of “big data” that overlook these concerns:

1. Available data are not always the most important data.
2. Data should motivate providing support, not merely accountability.
3. Teachers are neither scientists nor laypeople in their use of data. They rely on data constantly, but need representations that they can interpret and turn into action readily.

Assessment specialists have long noted the many uses of assessment data; all educational data should be weighed as carefully, even more so when implemented at a large scale which magnifies the influence of errors.

If assessments are diagnoses, what are the prescriptions?

I happen to like statistics. I appreciate qualitative observations, too– data of all sorts can be deeply illuminating. But I also believe that the most important part of interpreting them is understanding what they do and don’t measure. And in terms of policy, it’s important to consider what one will do with the data once collected, organized, analyzed, and interpreted. What do the data tell us that we didn’t know before? Now that we have this knowledge, how will we apply it to achieve the desired change?

In an eloquent, impassioned open letter to President Obama, Education Secretary Arne Duncan, Bill Gates and other billionaires pouring investments into business-driven education reforms (revised version at Washington Post), elementary teacher and literacy coach Peggy Robertson argues that all these standardized tests don’t give her more information than what she already knew from observing her students directly. She also argues that the money that would go toward administering all these tests would be better spent on basic resources such as stocking school libraries with books for the students and reducing poverty.

She doesn’t go so far as to question the current most-talked-about proposals for using those test data: performance-based pay, tenure, and firing decisions. But I will. I can think of a much more immediate and important use for the streams of data many are proposing on educational outcomes and processes: Use them to improve teachers’ professional development, not just to evaluate, reward and punish them.

Simply put, teachers deserve formative assessment too.

Statistical issues with applying VAM

There’s a wonderful statistical discussion of Michael Winerip’s NYT article critiquing the use of value-added modeling in evaluating teachers, which I referenced in a previous post. I wanted to highlight some of the key statistical errors in that discussion, since I think these are important and understandable concepts for the general public to consider.

  • Margin of error: Ms. Isaacson’s 7th percentile score actually ranged from 0 to 52, yet the state is disregarding that uncertainty in making its employment recommendations. This is why I dislike the article’s headline, or more generally the saying, “Numbers don’t lie.” No, they don’t lie, but they do approximate, and can thus mislead, if those approximations aren’t adequately conveyed and recognized.
  • Reversion to the mean: (You may be more familiar with this concept as “regression to the mean,” but since it applies more broadly than linear regression, “reversion” is a more suitable term.) A single measurement can be influenced by many randomly varying factors, so one extreme value could reflect an unusual cluster of chance events. Measuring it again is likely to yield a value closer to the mean, simply because those chance events are unlikely to coincide again to produce another extreme value. Ms. Isaacson’s students could have been lucky in their high scores the previous year, causing their scores in the subsequent year to look low compared to predictions.
  • Using only 4 discrete categories (or ranks) for grades:
    • The first problem with this is the imprecision that results. The model exaggerates the impact of between-grade transitions (e.g., improving from a 3 to a 4) but ignores within-grade changes (e.g., improving from a low 3 to a high 3).
    • The second problem is that this exacerbates the nonlinearity of the assessment (discussed next). When changes that produce grade transitions are more likely than changes that don’t produce grade transitions, having so few possible grade transitions further inflates their impact.
      Another instantiation of this problem is that the imprecision also exaggerates the ceiling effects mentioned below, in that benefits to students already earning the maximum score become invisible (as noted in a comment by journalist Steve Sailer

      Maybe this high IQ 7th grade teacher is doing a lot of good for students who were already 4s, the maximum score. A lot of her students later qualify for admission to Stuyvesant, the most exclusive public high school in New York.
      But, if she is, the formula can’t measure it because 4 is the highest score you can get.

  • Nonlinearity: Not all grade transitions are equally likely, but the model treats them as such. Here are two major reasons why some transitions are more likely than others.
    • Measurement ceiling effects: Improving at the top range is more difficult and unlikely than improving in the middle range, as discussed in this comment:

      Going from 3.6 to 3.7 is much more difficult than going from 2.0 to 2.1, simply due to the upper-bound scoring of 4.

      However, the commenter then gives an example of a natural ceiling rather than a measurement ceiling. Natural ceilings (e.g., decreasing changes in weight loss, long jump, reaction time, etc. as the values become more extreme) do translate into nonlinearity, but due to physiological limitations rather than measurement ceilings. That said, the above quote still holds true because of the measurement ceiling, which masks the upper-bound variability among students who could have scored higher but inflates the relative lower-bound variability due to missing a question (whether from carelessness, a bad day, or bad luck in the question selection for the test). These students have more opportunities to be hurt by bad luck than helped by good luck because the test imposes a ceiling (doesn’t ask all the harder questions which they perhaps could have answered).

    • Unequal responses to feedback: The students and teachers all know that some grade transitions are more important than others. Just as students invest extra effort to turn an F into a D, so do teachers invest extra resources in moving students from below-basic to basic scores.
      More generally, a fundamental tenet of assessment is to inform the students in advance of the grading expectations. That means that there will always be nonlinearity, since now the students (and teachers) are “boundary-conscious” and behaving in ways to deliberately try to cross (or not cross) certain boundaries.
  • Definition of “value”: The value-added model described compares students’ current scores against predictions based on their prior-year scores. That implies that earning a 3 in 4th grade has no more value than earning a 3 in 3rd grade. As noted in this comment:

    There appears to be a failure to acknowledge that students must make academic progress just to maintain a high score from one year to the next, assuming all of the tests are grade level appropriate.

    Perhaps students can earn the same (high or moderate) score year after year on badly designed tests simply through good test-taking strategies, but presumably the tests being used in these models are believed to measure actual learning. A teacher who helps “proficient” students earn “proficient” scores the next year is still teaching them something worthwhile, even if there’s room for more improvement.

These criticisms can be addressed by several recommendations:

  1. Margin of error. Don’t base high-stakes decisions on highly uncertain metrics.
  2. Reversion to the mean. Use multiple measures. These could be estimates across multiple years (as in multiyear smoothing, as another commenter suggested), or values from multiple different assessments.
  3. Few grading categories. At the very least, use more scoring categories. Better yet, use the raw scores.
  4. Ceiling effect. Use tests with a higher ceiling. This could be an interesting application for using a form of dynamic assessment for measuring learning potential, although that might be tricky from a psychometric or educational measurement perspective.
  5. Nonlinearity of feedback. Draw from a broader pool of assessments that measure learning in a variety of ways, to discourage “gaming the system” on just one test (being overly sensitive to one set of arbitrary scoring boundaries).
  6. Definition of “value.” Change the baseline expectation (either in the model itself or in the interpretation of its results) to reflect the reality that earning the same score on a harder test actually does demonstrate learning.

Those are just the statistical issues. Don’t forget all the other problems we’ve mentioned, especially: the flaws in applying aggregate inferences to the individual; the imperfect link between student performance and teacher effectiveness; the lack of usable information provided to teachers; and the importance of attracting, training, and retaining good teachers.

Some limitations of value-added modeling

Following this discussion on teacher evaluation led me to a fascinating analysis by Jim Manzi.

We’ve already discussed some concerns with using standardized test scores as the outcome measures in value-added modeling; Manzi points out other problems with the model and the inputs to the model.

  1. Teaching is complex.
  2. It’s difficult to make good predictions about achievement across different domains.
  3. It’s unrealistic to attribute success or failure only to a single teacher.
  4. The effects of teaching extend beyond one school year, and therefore measurements capture influences that go back beyond one year and one teacher.

I’m not particularly fond of the above list—while I agree with all the claims, they’re not explained very clearly and they don’t capture the below key issues, which he discusses in more depth.

  1. Inferences about the aggregate are not inferences about an individual.
  2. More deeply, the model is valid at the aggregate level, “but any one data point cannot be validated.” This is a fundamental problem, true of stereotypes, of generalizations, and of averages. While they may enable you to make broad claims about a population of people, you can’t apply those claims to policies about a particular individual with enough confidence to justify high-stakes outcomes such as firing decisions. As Manzi summarizes it, an evaluation system works to help an organization achieve an outcome, not to be fair to the individuals within that organization.

    This is also related to problems with data mining—by throwing a bunch of data into a model and turning the crank, you can end up with all kinds of difficult-to-interpret correlations which are excellent predictors but which don’t make a whole lot of sense from a theoretical standpoint.

  3. Basing decisions on single instead of multiple measures is flawed.
  4. From a statistical modeling perspective, it’s easier to work with a single precise, quantitative measure than with multiple measures. But this inflates the influence of that one measure, which is often limited in time and scale. Figuring out how to combine multiple measures into a single metric requires subjective judgment (and thus organizational agreement), and, in Manzi’s words, “is very unlikely to work” with value-added modeling. (I do wish he’d expanded on this point further, though.)

  5. All assessments are proxies.
  6. If the proxy is given more value than the underlying phenomenon it’s supposed to measure, this can incentivize “teaching to the test”. With much at stake, some people will try to game the system. This may motivate those who construct and rely on the model to periodically change the metrics, but that introduces more instability in interpreting and calibrating the results across implementations.

In highlighting these weaknesses of value-added modeling, Manzi concludes by arguing that improving teacher evaluation requires a lot more careful interpretation of its results, within the context of better teacher management. I would very much welcome hearing more dialogue about what that management and leadership should look like, instead of so much hype about impressive but complex statistical tools expected to solve the whole problem on their own.

Retrieval is only part of the picture

The latest educational research to make the rounds has been reported variously as “Test-Taking Cements Knowledge Better Than Studying,” “Simple Recall Exercises Make Science Learning Easier,” “Practising Retrieval is Best Tool for Learning,” and “Learning Science: Actively Recalling Information from Memory Beats Elaborate Study Methods.” Before anyone gets carried away seeking to apply these findings to practice, let’s correct the headlines and clarify what the researchers actually studied.

First, the “test-taking” vs. “studying” dichotomy presented by the NYT is too broad. The winning condition was “retrieval practice”, described fairly as “actively recalling information from memory” or even “simple recall exercises.” The multiple-choice questions popular on so many standardized tests don’t qualify because they assess recognition of information, not recall. In this study, participants had to report as much information as they could remember from the text, a more generative task than picking the best among the possible answers presented to them.

Nor were the comparison conditions merely “studying.” While the worst-performing conditions asked students to read (and perhaps reread) the text, they were dropped from the second experiment, which contrasted retrieval practice against “elaborative concept-mapping.” Thus, the “elaborate” (better read as “elaborative”) study methods reported in the ScienceDaily headline are overly broad, since concept-mapping is only one of many kinds of elaborative study methods. That the researchers found no benefit for students who had previous concept-mapping experience may simply mean that it requires more than one or two exposures to be useful.

The premise underlying concept-mapping as a learning tool is that re-representing knowledge in another format helps students identify and understand relationships between the concepts. But producing a new representation on paper (or some other external medium) doesn’t require constructing a new internal mental representation. In focusing on producing a concept map, students may simply have copied the information from the text to their diagram without deeply processing what they were writing or drawing. By scoring the concept maps by completeness (number of ideas) rather than quality (appropriateness of node placement and links), this study did not fully safeguard against this.

To a certain extent that may be the exact point the researchers wanted to make: That concept-mapping can be executed in an “active” yet non-generative fashion. Even reviewing a concept map (as the participants were encouraged to do with any remaining time) can be done very superficially, simply checking to make sure that all the information is present, rather than reflecting on the relationships represented—similar to making a “cheat sheet” for a test and trusting that all the formulas and definitions are there, instead of evaluating the conditions and rationale for applying them.

One may construe this as an argument against concept-mapping as a study technique, if it is so difficult to utilize it effectively. But just because a given tool can be used poorly does not mean it should be avoided completely; that could be true of any teaching or learning approach. Nor does this necessarily constitute an argument against other elaborative study methods. Explaining a text or diagram, whether to oneself or to others, is another form of elaboration that has been well documented for its effectiveness in supporting learning[1]. This constitutes an interesting hybrid between elaboration and retrieval, insofar as explanation adds information beyond the source but may also demand partial recall of the contents of the source even when present. If the value of explanation is solely in the retrieval involved, then it should fare worse against pure retrieval and better against pure elaboration.

All of this begs the question, “Better for what?” The tests in this study primarily measured retrieval, with 84% of the points counting the presence of ideas and the rest (from only two questions) assessing inference. Yet even those inference questions depended partially on retrieval, making it ambiguous whether wrong answers reflected a failure to retrieve, comprehend, or apply knowledge. What this study showed most clearly was that retrieval practice is valuable for improving retrieval. Elaboration and other activities may still be valuable for promoting transfer and inference. There could also be a possible interaction whereby elaboration and retrieval mutually enhance each other, since remembering and conducting inferences is easier with robust knowledge structures. The lesson may not be that elaborative activities are a poor use of time, but that they need to incorporate retrieval practice to be most effective.

I don’t at all doubt the validity of the finding, or the importance of retrieval in promoting learning. I share the authors’ frustration with the often-empty trumpeting of “active learning,” which can assume ineffective and meaningless forms [2][3]. I also recognize the value of knowing certain information in order to utilize it efficiently and flexibly. My concerns are in interpreting and applying this finding sensibly to real-life teaching and learning.

  • Retrieval is only part of the picture. Educators need to assess and support multiple skills, including and beyond retrieval. There’s a great danger of forgetting other learning goals (such as understanding, applying, creating, evaluating, etc.) when pressured to document success in retrieval.
  • Is it retrieving knowledge or generating knowledge? I also wonder whether “retrieval” may be too narrow a label for the broader phenomenon of generating knowledge. This may be a specific instance of the well-documented generation effect [4], and it may not always be most beneficial to focus only on retrieving the particular facts. There could be a similar advantage to other generative tasks, such as inventing a new application of a given phenomenon, writing a story incorporating new vocabulary words, or creating a problem that could almost be solved by a particular strategy. None of these require retrieving the phenomenon, the definitions, or the solution method to be learned, but they all require elaborating upon the knowledge-to-be-learned by generating new information and deeper understanding of it. Knowledge is more than a list of disconnected facts [5]; it needs a structure to be meaningful [6]. Focusing too heavily on retrieving the list downplays the importance of developing the supporting structure.
  • Retrieval isn’t recognition, and not all retrieval is worthwhile. Most important, I’m especially concerned that the mainstream media’s reporting of this finding may make it too easily misinterpreted. It would be a shame if this were used to justify more multiple-choice testing, or if a well-meaning student thought that accurately reproducing a graph from a textbook by memory constituted better studying than explaining the relationships embedded within that graph.

For the sake of a healthy relationship between research and practice, I hope the general public and policymakers will take this finding in context and not champion it into the latest silver bullet that will save education. Careless conversion of research into practice undermines the scientific process, effective policymaking, and teachers’ professional judgment, all of which need to collaborate instead of collide.

J. D. Karpicke, J. R. Blunt. Retrieval Practice Produces More Learning than Elaborative Studying with Concept Mapping. Science, 2011; DOI: 10.1126/science.1199327

[1] Chi, M.T.H., de Leeuw, N., Chiu, M.H., & LaVancher, C. (1994). Eliciting self-explanations improves understanding. Cognitive Science, 18, 439-477.
[2] For example, see the “Teacher A” model described in:
Scardamalia, M., & Bereiter, C. (1991). Higher levels of agency for children in knowledge building: A challenge for the design of new knowledge media. Journal of the Learning Sciences, 1, 37-68.
(There’s also a “Johnny Appleseed” project description I once read that’s a bit of a caricature of poorly-designed project-based learning, but I can’t seem to find it now. If anyone knows of this example, please share it with me!)
[3] This is one reason why some educators now advocate “minds-on” rather than simply “hands-on” learning. Of course, what those minds are focused on still deserves better clarification.
[4] e.g., Slamecka, N.J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4, 592-604.
[5] In the following study, some gifted students outscored historians in their fact recall, but could not evaluate and interpret claims as effectively:
Wineburg, S.S. (1991). Historical problem solving: A study of the cognitive processes used in the evaluation of documentary and pictorial evidence. Journal of Educational Psychology, 83, 73-87.
[6] For a fuller description of the importance of structured knowledge representations, see:
Bransford, J.D., Brown, A.L., & Cocking, R.R. (2000). How people learn: Brain, mind, experience, and school (Expanded edition). Washington DC: National Academy Press, pp. 31-50 (Ch. 2: How Experts Differ from Novices). 

The value-added wave is a tsunami

Edweek ran an article earlier this week in which economist Douglas N. Harris attempts to encourage economists and educators to get along.

He unfortunately lost me in the 3rd paragraph.

Drawing on student-level achievement data across years, linked to individual teachers, statistical techniques can be used to estimate how much each teacher contributed to student scores—the value-added measure of teacher performance. These measures in turn can be given to teachers and school leaders to inform professional development and curriculum decisions, or to make arguably higher-stakes decisions about performance pay, tenure, and dismissal.

Emphasis mine.

Economists and their education reform allies frequently make this claim, but it is not true, at least not yet. Value-added measures are based on standardized-test scores and neither currently provide information an educator can actually use to make professional development or curriculum decisions. When the scores are released, administrators and teachers receive a composite score and a handful of subscores for each student. In math, these subscores might be for topics like “Number and Operation Sense” and “Geometry”.

It does not do an educator any good to know last year’s students struggled with a topic as broad as “Number and Operation Sense”. Which numbers? Integers? Decimals? Did the students have problems with basic place value? Which operations? The non-commutative ones? Or did they have specific problems with regrouping and carrying? In what way are the students struggling? What errors are they making? What misconceptions might these errors point to? None of this information is contained in a score report. So, as an educator faced with test scores low in “Number and Operation Sense” (and which might be low in other areas as well), where do you start? Do you throw out the entire curriculum? If not, how do you know which parts of it need to be re-examined?

People trained in education recognize a difference between formative assessment—information collected for the purpose of improving instruction and student learning, and summative assessment—information collected to determine whether a student or other entity has reached a desired endpoint. Standardized tests are summative assessments—bad scores on them are like knowing that your football team keeps losing its games. This information is not sufficient for helping the team improve.

Why do economists see the issue so differently?

An economist myself, let me try to explain. Economists tend to think like well-meaning business people. They focus more on bottom-line results than processes and pedagogy, care more about preparing students for the workplace than the ballot box or art museum, and worry more about U.S. economic competitiveness. Economists also focus on the role financial incentives play in organizations, more so than the other myriad factors affecting human behavior. From this perspective, if we can get rid of ineffective teachers and provide financial incentives for the remainder to improve, then students will have higher test scores, yielding more productive workers and a more competitive U.S. economy.

This logic makes educators and education scholars cringe: Do economists not see that drill-and-kill has replaced rich, inquiry-based learning? Do they really think test preparation is the solution to the nation’s economic prosperity? Economists do partly recognize these concerns, as the quotations from the recent reports suggest. But they also see the motivation and goals of human behavior somewhat differently from the way most educators do.

This false dichotomy makes me cringe. As a trained education research scientist who is no stranger to statistical models, value-added is not ready for prime time because its primary input—standardized test scores—is deeply flawed. In science and statistics, if you put garbage data into your model, you will get garbage conclusions out. It has nothing to do with valuing art over economic competitiveness, and everything to do with the integrity of the science.

The divide between economists and others might be more productive if any of the reports provided specific recommendations. For example, creating better student assessments and combining value-added with classroom assessments are musts.

Thank you. Here where I start agreeing—if only that had been the central point of the article. I don’t dismiss value-added modeling as a technique, but I do not believe we have anything resembling good measures of teaching and learning.

We also have to avoid letting the tail wag the dog: Some states and districts are trying to expand testing to nontested grades and subjects, and to change test instruments so the scores more clearly reflect student growth for value-added calculations. This thinking is exactly backwards.

I agree completely, but that won’t stop states and districts from desperately trying to game the system. Since economists focus so much on financial incentives, this should be easy for them to understand: when the penalty for having low standardized test scores (or low value-added scores) is losing your funding, you will do whatever will get those scores up fastest. In most cases, that is changing the rules by which the scores are computed. Welcome to Campbell’s law.

Drawing inferences from data is limited by what the data measure

In “Why Genomics Falls Short as a Medical Tool,” Matt Ridley points out how tracking genetic associations hasn’t yielded as much explanatory power as hoped to inform medical applications:

It’s a curious fact that genomics has always been sold as a medical story, yet it keeps underdelivering useful medical knowledge and overdelivering other stuff. … True, for many rare inherited diseases, genomics is making a big difference. But not for most of the common ailments we all get. Nor has it explained the diversity of the human condition in things like height, intelligence and extraversion.

He notes that even something as straightforward and heritable as height has been difficult to predict from the genes identified:

Your height, for example, is determined something like 90% by the tallness of your parents—so long as you and they were decently well fed as children. … In the case of height, more than 50 genetic variants were identified, but together they could account for only 5% of the heritability. Where was the other 95%?

Some may argue that it’s a case of needing to search more thoroughly for all the relevant genes:

A recent study of height has managed to push the explained heritability up to about half, by using a much bigger sample. But still only half.

Or, perhaps there are so many genetic pathways that affect height that it would be difficult to identify and generalize from them all:

Others… think that heritability is hiding in rare genetic variants, not common ones—in “private mutations,” genetic peculiarities that are shared by just a few people each. Under this theory, as Tolstoy might have put it, every tall person would be tall in a different way.

Ridley closes by emphasizing that genes influence outcomes through complex interactions and network effects.

If we expect education research and application to emulate medical research and application, then we need to recognize and beware of its limitations as well. Educational outcomes are even more multiply determined than height, personality, and intelligence. If we seek to understand and control subtle environmental influences, we need to do much more than simply measure achievement on standardized tests and manipulate teacher incentives.

But what do the data say?

Perhaps this is the time for a counter-reformation” summarizes some choice tidbits on charter schools, test-based metrics & value-added modeling, and performance-based pay and firing, from a statistician’s perspective.

On charter schools:

The majority of the 5,000 or so charter schools nationwide appear to be no better, and in many cases worse, than local public schools when measured by achievement on standardized tests.

On value-added modeling:

A study [using VAM] found that students’ fifth grade teachers were good predictors of their fourth grade test scores… [which] can only mean that VAM results are based on factors other than teachers’ actual effectiveness.

On performance-based pay and firing:

There is not strong evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones.

[A study] conducted by the National Center on Performance Incentives at Vanderbilt… found no significant difference between the test results from classes led by teachers eligible for bonuses and those led by teachers who were ineligible.

In summary:

Just for the record, I believe that charter schools, increased use of metrics, merit pay and a streamlined process for dismissing bad teachers do have a place in education, but all of these things can more harm than good if badly implemented and, given the current state of the reform movement, badly implemented is pretty much the upper bound.

I’m less pessimistic than Mark is about the quality of implementation of these initiatives, but I agree that how effectively well-intentioned reforms are implemented is always a crucial concern.

Evidence in educational research

On “Classroom Research and Cargo Cults“:

After many years of educational research, it is disconcerting that we have little dependable research guidance for school policy. We have useful statistics in the form of test scores…. But we do not have causal analyses of these data that could reliably lead to significant improvement.

This offers powerful reading for anyone with an interest in education. Hirsch starts off a bit controversial, but he moves toward principles upon which we can all converge: Evidence matters, AND theoretical description of causal mechanism matters.

The challenge of completing the analogy between educational research and medical research (i.e., finding the education-research analogue to the germ theory of disease) is in developing precise assessment of knowledge. The prior knowledge that is so important in influencing how people learn does not map directly onto a particular location or even pattern of connectivity in the brain. There is no neural “germ” or “molecule” that represents some element of knowledge.

Other tidbits:

  1. Intention to learn may sometimes be a condition for learning, but it is not a necessary or sufficient condition.
  2. Neisser’s law:

    You can get a good deal from rehearsal
    If it just has the proper dispersal.
    You would just be an ass
    To do it en masse:
    Your remembering would turn out much worsal.

  3. I wouldn’t characterize the chick-sexing experiments as the triumph of explicit over implicit learning, but rather, that of carefully structured over wholly naturalistic environments. One can implicitly learn quite effectively from the presentation of examples across boundaries, from prototypes and attractors, and from extremes.

Concerns about the LA Times teacher ratings

On “L.A. Times analysis rates teachers’ effectiveness“:

A Times analysis, using data largely ignored by LAUSD, looks at which educators help students learn, and which hold them back.

I’m a huge fan of organizing, analyzing, and sharing data, but I have real concerns about figuring out the best means for conveying and acting upon those results. Not just data quality (what gets assessed, how scores are calculated and weighed), but contextualizing results (triangulation with qualitative data) and professional development (social comparison, ongoing support).

%d bloggers like this: