SML201 Project 1 – Predictive Modelling and Fairness

General guidelines

You will submit an Rmd file and the pdf file that was knit from it. Your code should be general enough that if the dataset were changed, the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

You should use ggplot for plotting and dplyr (where needed) for wrangling the data.

When you are asked to compute something, use code (rather than compute things outside of R and then input the answer), and include the code that you used in the report.

Auditing the COMPAS score

ProPublica obtained the public record on over 10,000 criminal defendants in Broward County, Florida. They also computed a variable that indicates whether each person was arrested within two years of being assessed. The data is available here. Download and read in the data.

The COMPAS scores you will be analyzing are the “decile scores” in the data frame (the column decile_score).

Part 1: Comparing the scores of black and white defendants (10%)

Make two histograms: one with the decile scores for white defendants, and one with the decile scores for black defendants. The histograms should allow the reader to understand how the scores for white defendants and the scores for black defendants differ.

Grading scheme

  • The figure displays the right data in a good way: 6 pts.

  • The figure is easy to read and understand (incl. labels, legend, captions, etc.): 4 pts.

Part 2: Initial evaluation the COMPAS scores (15%)

Suppose that defendants with scores that are greater or equal to 5 are considered to be “high-risk,” and other defendants are considered to be “low-risk.”

Compute the false positive rate, the false negative rate, and the correct classification rate for the entire population, for the population of white defendants separately, and for the population of black defendants separately. State the tentative conclusions that you can draw about the fairness of the COMPAS scores.

To obtain the context for the potential informativeness of the scores, compute the overall recidivism rate in the dataset. Comment on the difference between the overall recidivism rate and the correct classification rate using the score. Use is_recid is the variable that indicates whether the person recidivated.

Grading scheme

  • The numbers are computed in a way that basically makes sense: 7 pts

  • The numbers are displayed and presented in an understandble and easy-to-read way: 3 pts

  • The text frames the numbers well, and the answer overall is written up well: 5 pts

Part 3: Altering the threshold (15%)

For possible the thresholds [0.5, 1, 1.5, 2, 2.5, 3, ..., 9.5], compute the FPR, FNR, and correct classification rate for the entire population, for white defendants, and for black defendants. Plot the results using ggplot. You should produce three plots with three curves each (one plot per demographic group), with the thresholds being on the x-axis.

For example, when we were predicting survival for the titanic dataset, we used \(0.5\) as the threshold when using pred <- (predict(fit, newdata = dat, type = response) > 0.5)

Grading scheme

  • FPR, FNR, and CCR computed correctly: 7 pts

  • The graphs display the correct information: 2 pts

  • The graphs display the information in a way that’s easy to read and understand (includes choice of labels, colors, etc.) : 4 pts

  • The text in the answer frames the graphical information well: 2 pts

Part 4: Trying to reproduce the score (15%)

Fit a logistic regression model that predicts the probability of recidivism using the age and the number of priors of the defendent.

State the interpretations of the coefficients of the model.

For the threshold of \(0.5\) on the probability, obtain the FPR, FNR, and correct classification rates for the model for the entire population, for black defendants, and for white defendants, on both the training and the validation sets.

For a 30-year-old with 2 priors, what is the effect on the predicted probability of a re-arrest of one more prior offense?

Grading scheme

  • Correct logistic regression: 3 pts

  • Correct interpretation (note: “interpretation of coefficients” is a term of art: just do what’s in the slides): 2 pts

  • The FPR, FNR, and CCR are correctly computed: 3 pts

  • The numbers are presented in an easily readable and comprehensible way: 3 pts

  • Correct and easily understandable explanation of the numbers obtained for the 30-year-old with 2 priors: 4 pts

Part 5: Adding variables (15%)

For at least 4 more input variables, try including each one in the model, and use the validation set to decide whether the variables should be included in the model. Your report should include documentation and an explanation of the process you used.

What is the model that you obtained that produces the highest correct classification rate?

N.B., you should not add variables like two_year_recide and violent_recid, which are very similar to is_recid.

Grading scheme

  • Good process for trying to include variables in the model: 6 pts

  • Good description and documentation (including code that was used) of the process: 6 pts

  • Good report on the outcome 3 pts

Part 6: Adjusting thresholds (15%)

One appealing definition of a fair model is that the model has the same probabiliy of labelling a defendant low-risk regardless of demographics, if the defendant will not end up being re-arrested. Build such a model by finding a combination of thresholds (which can vary by demographics) that produces such a result. Try to keep the FNR and the FPR as low as possible.

Report the FNR, FPR, and the correct classification rate of the system on the validation set, for the whole population.

For this part, you may manually try different thresholds (while documenting your process) rather than write a program to do that for you.

Grading scheme

  • Good process + description of finding a combination of depmgraphic-dependent thresholds: 12 pts

  • Good report on the results: 3 pts

Part 7: Data visualization (15%)

Pick two variables in the dataset, and produce a piece of data visualization that shows the relationship between the two variables and the COMPAS risk scores.

Explain what trends you are observing and briefly justify your choices regarding how you made the visualization.

Grading scheme

  • Good and informative visualization: 12 pts

  • Good explanation: 3 pts