Unfair outcomes from fair tests

[Status: I’m sure this is well known, so I’d appreciate pointers to explanations by people who are less likely to make statistical or terminological errors. I sometimes worry I do too much background reading before thinking aloud about something, so I’m experimenting with switching it up. A quick search turns up a number of papers rediscovering something like this, like Fair prediction with disparate impact from this year. Summary: Say you use a fair test to predict a quality for which other non-tested factors matter, and then you make a decision based on this test. Then people who do worse on the test measure (but not necessarily the other factors) are subject to different error rates, even if you estimate their qualities just as well. If that’s already obvious, great; I additionally try to present the notion of fairness that lets one stop at “the test is fair; all is as it should be” as a somewhat arbitrary line to draw with respect to a broader class of notions of statistical fairness.]

What’s a fair test?a Well, it probably shouldn’t become easier or harder based on who’s taking it. More strongly, we might say it should have the same test validity for the same interpretation of the same test outcome, regardless of the test-taker. For example, using different flavors of validity:

  • Construct validity: To what extent does the test measure what it claims to be measuring? “Construct unfairness” would occur when construct validity varies between test-takers. If you’re measuring “agility” by watching an animal climb a tree, that could be valid for cats, but less so (hence unfair) for dogs.b
  • Predictive validity: To what extent is the measure related to the prediction or decision criterion the test is to be used for? Imagine a test that measures what it claims to measure and isn’t biased against anyone, but isn’t predictive for some subset of of the population. Filtering everyone through this test could be considered unfair. If we consider the test as administered and not just in the abstract, we also run into predictive unfairness due to differential selection bias for test-takers from different groups.

As an example of predictive unfairness, say I’m hiring college students for a programming internship, and I use in-major GPA for a cutoff.c I can say it has construct fairness if I don’t pretend it’s a measure of anything more than performance in their major.d But that number is much more predictive of job performance for Computer Science majors than for Physics majors.

This is “unfair” in a couple ways. Many CS students will be more prepared for the job by virtue of experience, but will be outranked by non-coder Physics students with high GPA. At the same time, the best Physics majors for the job can be very good at programming, but that mostly won’t show up in their physics GPA.

Can we make the use of a GPA cutoff “fair”? Well, say the correlation between physics GPA and coding experience is small but nonzero. We can raise the cutoff for non-CS majors until the expectation for job performance at the two cutoffs are the same. From the employer’s point of view, that’s the smart thing to do, assuming threshold GPA-based hiring.e Then we have a new “test” that “measures” [GPA + CS_major_bonus] that has better predictive fairness with respect to job performance.f We’re still doing poorly by the secretly-coding physicists, but it’s hard to see how one could scoop up more of them without hiring even more false positive physicists.g

Intuitively, “fairness” wants to minimize the probability that you will fail the test despite your actual performance—the thing the test wanted to predict—being good enough, or that I will succeed despite falling short of the target outcome, perhaps weighted by how far you surpass or I fall short of the requirements. In these terms, we also want to minimize the effects of chance by using all the best information available to us. Varying the cutoff by major seems to have done all that.

So is it a problem that the part of the test that’s directly under the students’ control—their GPA (for the sake of argument, their major is fixed)—is now easier or harder depending on who’s taking it? In this case it seems reasonable.

But there’s still at least one thing about fairness we didn’t capture: we may want the error probabilities not to depend on which groups we fall into. Our model of fairness doesn’t say anything about why we might or might not want that. Perhaps there’s still a way to do better in terms of equalizing the two kinds of error rates between the two populations. Hmm…

If we consider the decision process as a whole, we can imagine another kind of validity and corresponding fairness:

  • Consequential validity: To what extent are the consequences of various outcomes—for a binary test, true/false positives/negatives—in line with their various likelihoods, benefits, and risks to the test taker and/or other interested partiesh? Say this varies between subjects—for example, if on a overcrowded bus, the youngest-looking had to give up their seats, even if they had, say, an invisible disability. This could be fair in thati apparent age is a fair measure of age, and age is equally correlated with the need to sit for all subgroups; but it’s not fair in expected utility.j Distinctly, the harm posed by misclassification as young is greater for some people than for the average person who just looks young. In general, we can have consequential unfairness without prior unfairness if either the odds of some error or the consequences of that error are different for different people.k

We now have a validity and an unfairness for each incremental chain of steps in the process: true quality to imperfect information (construct); quality to imperfect information to prediction (predictive); quality to information to prediction to decision/outcome (consequential). Validity is the extent to which those steps are “good” or “faithful”, and unfairness is the extent to which validity varies between test-takers. We might also take a step back to “quality fairness”—or should we say that if populations differ in that quality, that’s not our problem, as long as the test is fair in these other ways? Is that even possible?

We ran into trouble above because all we had available to us was a test that was already unfair, which didn’t give much information about the physics majors at all, and there was only so much we could do to patch that. So what if, instead, we give both the CS and physics majors a practical coding test? We can posit both construct fairness and predictive fairness: it measures the same skill equally well for each, and that skill predicts job performance equally well for each. And physics majors, predictably, do worse on average.

The construct is valid. The predictions are valid. And they’re equally valid for both groups, hence “fair”.l The cutoff is the same for everyone. And it leads to physicists being under-represented in this job.

And, no, I don’t mean the obvious “they’re worse coders so they just get the job that much less often”. They become under-represented relative to their true rate of being qualified.m

It doesn’t have anything to do with their group membership, of course—you could say the same of low-scoring CS majors. If, in the end, the factors that caused your success were hard to measure, then people probably had a hard time predicting your success. The Physics majors just happen to bear that as a group.

You can think of this in terms of why the tails come apart. We have two correlated quantities: test scores and job performance. The highest scorers will not be the highest performers. This effect becomes larger the closer you are to the tails of the distribution, and you’re selecting from further into the tail of the Physics distribution than of the CS one.

And then you can say: Performance is part “whatever the test measures”, and part “other stuff”. Physicists have less of what the test measures, and the same amount of the other stuff. So the qualified ones will tend to be those with lots of the other stuff, and they will be missed by the test. Because they’re farther into the tails, which are coming apart, this happens more than with the qualified CS majors.n

Here’s a Guesstimate model putting some numbers on this. Feel free to play with it.

With these numbers, you’d need to hire about 50% more Physics majors beyond those who made the cutoff to match the rates of qualified candidates in the populations. The true positive rate is about a factor of 2 lower for Physics majors.

Is this unfair? They’re all being hired exactly according to their expected performance; the test is just a fuzzy measure. But the rates of different kinds of errors are very different for the different groups. This will keep happening as long as one group does worse on the testo while factors not tested matter. You can’t have a decision based on point estimate parity for expected performance simultaneously with error rate parity; you can’t even get parity for the two false positive rates and the two false negative rates simultaneously.p

So what does this mean for consequential fairness? This always depends on how we assess the tradeoffs between false positives and false negatives for different groups, and the benefits or harms to the employer, employees, and society. Here is where you might decide you want error rate parity or even overcorrection. In the example, wrongly rejecting a low-scoring Physics major probably isn’t any worse than wrongly rejecting a CS major with the same score, but it’s easy to imagine cases where group membership does matter. Our sense of social equality of opportunity lies somewhat in true positive rates, if not total acceptance rates. Communities that suffer from chronic under-representation likely suffer more from its perpetuation, considering the ways that the position of a community in society feeds back on itself, than a majority population would suffer from a corresponding decrease in hiring rates. Or perhaps the harm of misclassification is greater for one group: some people will be OK no matter what, while others get only one chance. Additionally, all this provokes us to think about the relative costs of improving the inputs to our predictions, as well as the risks involved in drawing on an incomplete subset of relevant predictive factors.

You can also forget about social issues and think about how to interpret this in terms of how you act under uncertainty in your own life, and what the frequencies of various events you observe actually mean.

Fairness is a nice property for a test to have, but the statistical fairness of a test-based prediction or decision is broader than the objectivity of the test construct itself. Consequential fairness is broader still, and it’s most like the kind of fairness people care about in real life (while short of equality of outcome). You can let these considerations in not as contamination of a sacred mathematical property, but as an expansion of that property towards something better corresponding to your values.

I’m intentionally wrapping this up rather abruptly. I’m sure there’s more to be said on the topic, and want to leave some room at this point.

  1. What’s a test? For now, it’s a piece of imperfect information about some real quality that you want to use to make a decision or prediction.  (back)
  2. Dogs do have less of things like joint and spine flexibility, which may matter equally well for tree-climbing and general agility, but they also lack cats’ claws, which mostly help with the trees.  (back)
  3. That is, the average grade a student has received in classes in her major. But this is meant to be a generic example; these issues are not far-fetched and can show up in different  and even qualitative forms in school, work, court, politics, daily life, charity evaluation…  (back)
  4. There’s some gerrymandering here: is it a fair measure of academic performance, or an unfair measure of coding ability?  (back)
  5. More specifically, the employer cares about expected utility of letting the candidate past the GPA screen. If false negatives are really bad, and qualifications beyond the minimum don’t matter, then they’d equalize the false negative rate, requiring an even higher cutoff for Physics. If there’s a (costly) interview rather than just hiring, then physics majors may need a lower cutoff, if the CS GPA tells you all you need to know but the physics GPA tells you little. Improving your instrument is good, if you can afford it.  (back)
  6. Note that this isn’t a correction for CS majors having a higher mean performance, despite looking kind of like it—it’s just an ad-hoc adjustment for the fact that GPA is less correlated with coding skill for Physics majors, who could be just as good at programming, on average.  (back)
  7. I keep saying “physicists” instead of Physics majors because it’s shorter, but I mean, come on, they’re undergrads.  (back)
  8. Who else might be interested parties to the consequences? Well, whoever is making decisions based on test results, for sure. But it’s not necessarily outside the scope of test fairness to worry about the effect of the test-based decision rule on society in a broad sense.  (back)
  9. for the sake of the example  (back)
  10. This idea doesn’t rely on utilitarianism; you just need to agree that certain young people standing is a worse outcome than other young people standing.  (back)
  11. Perhaps the error rates should be included in predictive validity, or split off as part of a new “decision validity” subsequent to prediction, with consequences and utilities only coming in at the very end? I lean toward the latter, but it seems to be a less common usage.  (back)
  12. Well, unless as suggested in the previous note we define a stricter sense of predictive validity in terms not of point estimates but of true positive [correctly identified given positive] and true negative rates. In which case this doesn’t meet predictive fairness. Uh, spoilers.  (back)
  13. I absolutely don’t mean to use the words “true rate of being qualified” as though it’s something innate or unchangeable in that population; obviously, in this example it’s a matter of what classes they’ve taken, plus things like the odds that they have cultural programming exposure/experience outside of class.  (back)
  14. If the test actually returns “exact performance + measurement error”, and physicists do worse because their actual performance is worse and not because of a bias in measurement error, and the test result is used as our performance estimate (wrongly, on which more momentarily), then physicists become over-represented, since those who pass the test tend (more than do CS majors) to be those with the largest measurement error in their favor. I think a lot of our intuitions about how many tests work rely on this model. It is not how most tests actually work, although there is always some measurement error. More importantly, a comprehension check: how is that not more gerrymandering? Why don’t we just say GPA directly measures programming internship performance, just very noisily? Or that a noisy test actually exactly measures one of two factors in performance, where the second is the negation of the noise? The correlation between test and performance is the same in either case. But our best estimate of performance based on the test should be different depending on which model we’re using. That is, we aren’t using all our information here—the noisy direct test isn’t our best estimate of actual performance. That would be the test result regressed toward the mean. If we use that, then we’re back to relative under-representation. (Well—regression towards whose mean, and is that fair?)   (back)
  15. to repeat myself, this is even if they do so due to being actually worse on the tested quality where the quality matters exactly in the way we always thought, although in real life, this is often mixed with ordinary construct/predictive unfairness  (back)
  16. As an exercise, what about when we have continuous rather than binary decisions?  (back)

One thought on “Unfair outcomes from fair tests

Leave a Reply

Your email address will not be published. Required fields are marked *