Fairness in Machine Learning

ProPublica's 2016 piece, Machine Bias, started a conversation on algorithmic accountability and fairness across the country. The story explored the workings of Northpointe's COMPAS - a risk assesment software most popularly used nationwide. ProPublica found that black defendants were incorrectly judged to be at a higher risk of recidivism compared to while defendents. The article generated a lot of interest in fairness and much research has concluded that there are six types of fairness, some of which are incompatible with each other and with accuracy. This project attempts to simplify and visualize this research.





Introduction
In many industries, ranging from criminal justice to medicine, algorithms perform the risk assesment to decide the consequences of decisions and actions. As the scope of such algorithms increases, journalists and academics have voiced concerns that these models might be encoding human biases, as they are built on data that reflect human biases in past decisions. And outcomes.

Over the last couple of years, the research community has proposed formal and mathematical definitions of fairness to guide the conversation and design around equitable risk assesment tools. These definitions are intuitive, but each of these definitions of fairness has its own limitations, and are not the best measures of detecting discriminatory algorithms. In fact, designing algorithms that satisfy these definitions of fairness can end up negatively impacting minority (as well as majority) community well-being.

What is Fairness?
Fairness, as defined by Barocas and Hardt, is an unjustified basis for differentiation. Historically, this basis has had practical irrelevance in relation to the task being performed.


Think gender, race, or sexual preference.


Even if statistically relevant, there are certain factors considered morally irrelevant, as a decision that we consciously take and choose not to differentiate on the basis of.


Think of physical disabilities, or age.


There are two doctrines of discrimination law:
  • Disparate Treatment: This is the idea that if your model even considers certain factors (whether or not they are used in your model) then it is illegal. Sometimes, these factors can be a matter of formality, like asking for age and gender on any form. Disparate treatment also occurs when proxies are used for these factors, like redlining districts where African Americans are in majority.

  • Disparate Impact: This doctrine avoids using any factors that can cause discrimination - instead, it focuses on facially neutral attributes. However, disparity still manages to appear in the output. This leads to questions like is the process justified, and is systemic bias even avoidable?

As you can see, fairness is intrinsically tied with discrimination, which is not a general concept - it often depends on context and a domain. However, it often concerns important decisions that affect life situations.
Definitions of Fairness
So then, how does one go about defining the concept of fairness? Extensive research has been done on the topic, and certain definitions have been widely accepted to quantify fairness and discrimination.

From Disparate Treatment comes Anti-Classification fairness, where algorithms do not consider protected characteristics like race, gender and their proxies. Classification parity requires that certain measures of predictive performance be equal across all groups defined by these protected attributes. Calibration, which ties in with the concept of independence, requires that outcomes be independent of the protected attributes, after controlling for estimated risk.

Berk et al define fairness through simple mathematical concepts. For this, let us consider a simple construct: we want to predict whether a person will like coffee with or without sugar. We employ a tool to predict this in a binary manner. Even if we were to design a continuous algorithm that calculated the probablity of a person wanting coffee with sugar, we would have to set a threshold value - if the algorithm predicted a value above the threshold, we would say they prefer coffee with coffee and vice versa.

Let us assume our tool considers the following attributes of the person: how often they buy ice cream, how many hours they sleep, whether they have eaten before the coffee and the brand of shoes they are wearing.
Let us say we start out with 1000 people, half who drink coffee with sugar, and half without. We create a confusion matrix like the one to the right.

The columns represent the actual number of people in each category, positive being the ones that take coffee with sugar and negative are the ones without.
On the other hand, the rows represent the predictions of the algorithm about the number of people in each category, positive being the ones that take coffee with sugar and negative are the ones without.

We shorten the true positives, false positives, false negatives and true negatives to TP, FP, FN and TN respectively. The confusion matrix makes it very easy to record certain simple statistics:
  • Sample size: This is the total number of observations, denoted by N, and is the sum of all four cells in the confusion matrix. So, N = TP + FP + FN + TN

  • Base Rate: This is the proportion of actual successes, or actual failures, from the total sample. The choice between success and failure depends on the experimenter. This is defined as (TP + FN)/N or (TP + FN)/(TP + FP + FN + TN), or (TP + FN)/N.

  • Prediction Distribution: The proportion of predictions predicted to fail, and the proportion predicted to succeed. This translates to (FN + TN)/N and (TP + FP)/N respectively.

  • Overall Procedure Error: This is the proportion of cases that are misclassified by our algorithm, (FP + FN)/(TP + FP + FN + TN). The complement to this is overall procedure accuracy, defined as (TP + TN)/(TP + FP + FN + TN).

  • Conditional Procedure Error: This is the proportion of cases misclassified conditional on one of the two actual outcomes. This gives us the false positive rate, FP/(FP + TN) or the false negative rate, FN/(TP + FN).

  • Conditional Use Error: This is the proportion of cases misclassified conditional on one of the two predicted outcomes. The incorrect failure predictions is FN/(TN + FN) and the incorrect success prediction proportion is FP/(TP + FP).

Fairness can be defined in six different ways, according to the State of the Art paper on Fairness Criminal Risk Assesments. Each of these definitions can further be represented using features of the confusion matrix, and the above defined concepts. The exact features that we use, however, will change depending on the type of fairness under consideration.

The confusion matrix, however, makes it easy to see that different kinds of fairness are not only related to each other, but also with accuracy. It also simplifies the notion that total fairness cannot be achieved simultaneously.
Coming back to our example of classifying people who like their sugar with coffee, we can see that the columns should add upto 500, since our population is split between people who like coffee with and without sugar. The sum of the rows, however, cannot be fixed within the population, and will depend on how our classifier algorithm is defined.
Keep in mind, a confusion matrix can also be created for a subset of our sample. For example, we can have a confusion matrix for people who sleep less than five hours a day, and another for people who sleep more. We now define the different types of fairness, trying to balance fairness of our algorithm across both these (or any other) groups created on the basis of sleeping patterns.
1. Overall Accuracy Equality
When the overall procedure accuracy is the same for each group, overall accuracy equality is achieved. Overall procedure accuracy, as previously mentioned, is the total correct number of predictions made by the algorithm, and is defined as (TP + TN)/N.

This definition assumes that true negatives and true positives are equally desirable in our system. In real-world cases, however, it is possible that true positives may be more desirable than true negatives. Rather, false negatives (wrong predictions of negatives) might be twice as undesirable as false positives (wrong predictions of positives).

Overall Accuracy Equality, according to the State of the Art paper, is not commonly used as it doesn't differentiate between success and failure accuracy.

2. Statistical Parity
Statistical Parity is achieved when the prediction distribution is the same across all protected groups. From our previous example, this implies that if sleep patterns was a protected class, the proportion of people predicted to have coffee with (or without) sugar should be equal for people who get less than five hours of sleep, and for people who get more than five hours of sleep.

Statistical parity, also known as demographic parity, has been criticized as it can lead to highly undesirable outcomes. It is defined as (TP + FP)/N or (FN + TN)/N.

By simply sampling more (or less) people from these predicted groups, we could modify the proportions for prediction distribution and achieve statistical parity by changing our population, rather than modifying our algorithm, to equalize the proportion of predictions based on sleep patterns.
3. Conditional Procedure Accuracy Equality
As the name suggests, this fairness is achieved when the conditional procedure accuracy is the same across all protected groups. Simply put, the accuracy of predicting positives (and negatives) from all actual positives (and negatives) should be equal across groups. Thus, the value for TP/(TP + FN) and TN/(TN + FP) should be the same. This is similar to considering that the false positive and false negative rates are the same across groups.

A closely similar definition of conditional procedure accuracy equality is also referred to as "equalized odds," and "equality of opportunity" is the same as this, for the more desirable outcome only.
4. Conditional Use Accuracy Equality
This differs from the previous fairness in that it depends on the ratio of a predicted outcome among all predictions. The proportion of each prediction should remain the same across groups. This fairness asks that conditional on the prediction of success (or failure), is the probability of success (or failure) the same across groups? Thus, the ratios TP/(TP + FP) and TN/(FN + TN) should be equal across all groups.

Both conditional procedure and conditional use accuracy equality are concerns in criminal justive risk assesments (more on this later). Chouldechova refers to conditional use accuracy equality as positive predictive value (PPV) and corresponds to TP/(TP + FN).
5. Treatment Equality
Treatment equality is achieved when the ratio of false negatives to false positives (or vice versa) is the same across protected groups. This equality is considered a "lever with which to achieve other kinds of fairness." It is defined as FN/FP or FP/FN.

We could treat people who get more sleep differently than people who get lesser sleep to achieve conditional use accuracy equality, by weighing their false negatives for heavily. This would change the treatment ratio, and would be an indicator of unfair treatment of people who get more seep.
6. Total Fairness
Total fairness is achieved when all the above fairness, overall accuracy equality, statistical parity, conditional procedure accuracy equality, conditional use accuracy equality and treatment equality are achieved.

Recent research shows that some of the above fairness are at odds at each other, thus making it impossible to achieve total fairness.
Again, it is important to remember that definitions of fairness exists only between different groups (or classes) of the sample, and when there are more than two outcome categories.
Now that we have some idea of how different kinds of fairness is defined, let's continue with our algorithm.
One of the attributes our algorithm uses in the prediction of sugar in coffee is the brand of shoes a person is wearing. It seems pretty pointless to consider shoe brand in coffee preferences, but please humor our whimsical algorithm. On the right are two confusion matrices, each one for a different shoe brand. Out of the 1000 people, let's assume that 500 people wear Nike shoes, while 200 wear Adidas.

We will now go through a scenario where the confusion matrices are populated based on our algorithm's predictions and the actual coffee preferences of users, and I hope you can more easily understand how different fairness are often at odds with each other, and accuracy.

Scenario 1
Let's consider that we want to maintain Conditional Procedure Accuracy (CPA) across both groups at 0.40.
The confusion matrices will be populated accordingly. CPA is defined as TP/(TP + FN), or the number of correct positive predictions over the total positive outcomes.

As you can see, this ratio is the same for both groups. Also, notice that the accuracy (TP + TN)/N is the same for both, at 0.44.
Let's also take a look at Conditional Use Accuracy (CUA) across both groups, which is defined as TP/(TP + FP), or the number of correct positive predictions over the total positive predictions.

For now, this ratio is also the same, at 0.8. There doesn't seem to be a problem.

Now, let's say we we alter the way our algorithm predicts for Adidas to increase accuracy, while keeping CPA constant. Let our new confusion matrix for Adidas change. As you can see, CPA remains 0.40.

The accuracy for Adidas, (TP + TN)/N, is now increased to 0.50.

However, when we consider CUA now, we can see that they are no longer the same. Nike's CUA remains at 0.80, while Adidas' has dropped to 0.50.

It's easy to see here, that if we want to increase accuracy, while maintaining at least one form of fairness (CPA), we will invariably lose out on another form of fairness (CUA).
Let us now consider a real-world case: The COMPAS algorithm that was scrutinized by ProPublica, and criticized for being unfair.

Though Northpointe, the developers of COMPAS, argued that they had achieved Overall Accuracy Equality, ProPublica reported that the Conditional Procedural Accuracy Equality was missing. ProPublica proved that the false negative rates were disproportionately higher for African-American males, in comparison to Caucasian males, which violated some form of fairness as well.

Below are two interactive visualizations, built using the same data as ProPublica used, and a very similar methodology as theirs to evaluate the COMPAS algorithm.

The first interactive is more mathematical - once you choose a category and fairness definition, it creates confusions matrices which achieve that fairness across your chosen category. You can also see the overall accuracy, PPV, FPR and FNR for the algorithm across all the people, and also compare the values of other definitions of fairness.
You can also see the threshold created by the algorithm, which is the threshold score used by judges to decide whether a prisoner can or cannot be released on bail.

In some cases, multiple definitions of fairness can be achieved, and the interactive shows this by highlighting multiple fairness definitions.

However, total fairness cannot be achieved in the case (and rarely in any other), and thus, choosing the last fairness option does not yield any change.

The second interactive tries to emphasize on the effect of the thresholds. Again, a choice of category and fairness displays results. The graphs under section 4 display the number of people who received a particular COMPAS score. So, 605 Caucasian males are rated at score 0 by COMPAS, while only 365 African-Americans receive the same score. On the other hand, only 50 Caucasian males get a score of 10, while 227 African-Americans are given the high-risk score of 10.

The threshold line indicates the number of people who will not be granted bail. Thus, the number of people with scores higher than the threshold are not granted bail, as shown by the bars to the right of the threshold lines.

As you can see, in some cases, African-Americans are affected in much larger numbers than Caucasians, even if their threshold is higher. On the other hand, men are affected a lot more than women, who in general are considered low-risk by the COMPAS algorithm.

In some cases, multiple definitions of fairness can be achieved, and the interactive shows this by highlighting multiple fairness definitions.

Again, total fairness cannot be achieved in this case (and rarely in any other), and thus, choosing the last fairness option does not yield any change.


. . .

Thank you for going through this explainer on Fairness in Machine Learning!
For more details on my methodology, and research, head over here.