ECE324 Mini Project 1

Observational Fairness and the COMPAS dataset

About the Mini-Project

This mini-project can be completed individually or in teams of two. We want all students in the class to have hands-on experience with PyTorch early in the semester, and to make sure that everyone in the class gets hands-on experience with writing code for dealing with messy data.

You must use PyTorch when developing the code for this mini-project. You may use scikit-learn to compute quantities such as false-positive rates, though that is neither necessary nor encouraged.

In this mini-project, you are reproducing results from two scientific papers. There will be multiple available reasonable choices for you to make as you interpret the papers – that is normal.

Plagiarism warning

Your submissions will be checked for plagiarism. You may discuss general issues with your classmates, but you may not share your code, or look at other people’s code. Please avoid uploading your code publically to Github (and similar services). You may use small snippets of code supplied by Stack Overflow or GitHub co-pilot. More than two lines requires a citation to your source. Obviously if you copy the entire project from someone else and then credit them, that is not plagiarism, but it would not get any credit either.

What to submit

You will submit a Jupyter notebook, with all your code. The results you report need to be reproducible – every figure/number you report should be computed by the code in the Jupyter notebook you submit.

The Dataset

You will be working with the COMPAS dataset, available at https://github.com/propublica/compas-analysis/blob/master/compas-scores-two-years.csv

You will be reproducing (parts of) two analyses:

Julia Dressel and Hany Farid. “The accuracy, fairness, and limits of predicting recidivism.” Science Advances 4.1 (2018) https://www.science.org/doi/10.1126/sciadv.aao5580

Tameem Adel, Isabel Valera, Zoubin Ghahramani, and Adrian Weller. “One-network adversarial fairness.” In Proceedings of the AAAI Conference on Artificial Intelligence (2019) https://ojs.aaai.org/index.php/AAAI/article/download/4085/3963

or

Christina Wadsworth, Francesca Vera, and Chris Piech. “Achieving fairness through adversarial learning: an application to recidivism prediction.” In Proc. Workshop Fairness, Accountability, Transparency Mach. Learn., (2018). http://web.stanford.edu/~cpiech/bio/papers/fairnessAdversary.pdf.

Part 1

Build a logistic regression model to predict two-year recidivism. Show that your model approximately satisfies calibration, but fails to satisfy false-positive parity. Relate this to the difference in base rates in the dataset.

NEW: you should not expect that calibration would be satisfied exactly. Recall that the variance of a Bernoulli variable sampled repeatedly and summed up is \(\sigma^2 = n\times p\times(1-p)\). Recall from Physics that the difference of between two measured quantities can’t be said to be different if the difference is less than \(2\sqrt{2}\sigma\). Derive what you need here, or use this: https://www.khanacademy.org/math/ap-statistics/sampling-distribution-ap/xfb5d8e68:sampling-distribution-diff-proportions/a/diff-proportions-probability-examples

Following an idea from the paper of Corbett-Davies and Goel https://arxiv.org/abs/1808.00023, show that adjusting the thresholds for the different demographics can lead to an algorithm that does not satisfy calibration, but does satisfy false-positive parity. That is, for the classifier you trained, find two thresholds that could be applied to the scores for White and Black to obtain fase-positive parity while not sacrificing accuracy too much.

NEW describe the process you used.

Grading scheme

Trains a logistic regression model with the required properties (20%)

Computes what’s required to check FPR parity and calibration, and discusses the results correctly (10%)

Part 2

NEW In this Part, you will use insights from the https://github.com/google-research/tuning_playbook in order to train a difficult-to-train system.

Using the Deep Learning Training Playbook

Read the Deep Learning Training Playbook. Adopt at least three suggestions from it. Identify that you are using a strategy from the Deep Learning Training Playbook when you do so. For two strategies that you did not choose to use, explain why it did not make sense to do that. Please include the write-up on the strategies used here, but you can also refer the reader to later parts of your write-up.

You may choose to work on either Option 1 or Option 2. I recommend Option 1.

Option 2 (deprecated)

Wadsworth et al. claim that it is possible to produce a more accurate classifier that satisfies properties such as false-positive parity, by using more features, and by using an adversarial learning procedure.

Write PyTorch code to implement the adversarial learning procedure. Note that the adversarial learning procedure would need to be iterative: the adversary is optimized, and then the whole network N is optimized, in a loop. Specifically, you want to alternately minimize \[L_d\] and \[L_y-\alpha L_d\], making sure that you don’t reach convergence during either stage since that would make further training impossible.

That is because it is very easy to make the adversary make very bad predictions if we only try to make its performance bad.

Improving the performance of the system

Report on your experiments trying to improve the performance of the initial system

Do you observe that the false-positive disparity becomes small using Wadsworth et al.’s method? What did you try in order to try to observe this result?

What data to use

You should use at least 7 features from COMPAS, including the most predictive ones: sex, age, and number of priors. For categorical features, you want to use one-hot encoding.

What data not to use

DecileScore, v_decile_score, etc. are output of COMPAS, so don’t use that.

We are predicting is_recid. Variables like violent_recid, is_violent_recid, and two_year_recid are related to is_recid and are about the future, so you should not use them.

Reporting the results

Report your results. Which features did you use, and how did you use them? How did you process them? Do your experiments confirm that using Option (1) or Option (2)’s network with more features produces a more accurate classifier?

Part 3

Ethical Reflection

NEW In this section, I ask you to present good arguments for and against different propositions. It is expected that you make good arguments, either because you agree with the argument, or because you tried hard to understand what people who disagree with you say, or because of some combination of those. You do not have to disclose your personal opinions; we will grade you on the quality of the argument, not on your opinions.

A paragraph is usually enough for a good answer.

  1. In the context of recidivism prediction for COMPAS, present an argument in favour of aiming for demographic parity.

  2. In the context of recidivism prediction for COMPAS, present an argument in favour of aiming for false-positive parity.

  3. In the context of recidivism prediction for COMPAS, present an argument in favour of aiming for calibration.

  4. In the context of recidivism prediction for COMPAS, consider Sharad Goel’s concepts of aggregate social welfare, and aggregate social welfare as computed for each demographic separately. Make one argument in favour of using those concepts. Make two arguments against those thinking that those concepts should guide the design of systems; at least one of those arguments should be specific to the aggregate social welfare as computed for each demograhic separately.

  5. There are many contexts in which prediction systems are currently being used. Examples include credit scores; university admissions; insurance rate offers; early-warning systems for deterioriating patients in hospital. Pick one field (not necessarily one of the ones I listed) where you think the ethical considerations or different from the ethical considerations for COMPAS. Argue that the ethical considerations for the two cases are different, and connect your argument to observational and causal measures of fairness that we discussed in class.

Grading scheme

A good report on what was done (which features were used and how, how the training was done, etc) (20%)

A good attempt to reproduce one of the paper (30%), with progress toward showing progress toward observational fairness measures (20%)

A report on using the Deep Learning Training Playbook (10%)

Good answers to the ethical reflection questions (10%)

The quality of writing, clarity, presentation quality, and reproducibility of the report (10%)