You will submit an `Rmd`

file and the `pdf`

file that was knit from it. Your code should be general enough that if the dataset were changed, the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

You should use `ggplot`

for plotting and `dplyr`

(where needed) for wrangling the data.

When you are asked to compute something, use code (rather than compute things outside of R and then input the answer), and include the code that you used in the report.

ProPublica obtained the public record on over 10,000 criminal defendants in Broward County, Florida. They also computed a variable that indicates whether each person was arrested within two years of being assessed. The data is available here. Download and read in the data.

The COMPAS scores you will be analyzing are the “decile scores” in the data frame (the column `decile_score`

).

Make two histograms: one with the decile scores for white defendants, and one with the decile scores for black defendants. The histograms should allow the reader to understand how the scores for white defendants and the scores for black defendants differ.

*Grading scheme*

The figure displays the right data in a good way: 6 pts.

The figure is easy to read and understand (incl. labels, legend, captions, etc.): 4 pts.

Suppose that defendants with scores that are greater or equal to 5 are considered to be “high-risk,” and other defendants are considered to be “low-risk.”

Compute the false positive rate, the false negative rate, and the correct classification rate for the entire population, for the population of white defendants separately, and for the population of black defendants separately. State the tentative conclusions that you can draw about the fairness of the COMPAS scores.

To obtain the context for the potential informativeness of the scores, compute the overall recidivism rate in the dataset. Comment on the difference between the overall recidivism rate and the correct classification rate using the score. Use `is_recid`

is the variable that indicates whether the person recidivated.

*Grading scheme*

The numbers are computed in a way that basically makes sense: 7 pts

The numbers are displayed and presented in an understandble and easy-to-read way: 3 pts

The text frames the numbers well, and the answer overall is written up well: 5 pts

For possible the thresholds `[0.5, 1, 1.5, 2, 2.5, 3, ..., 9.5]`

, compute the FPR, FNR, and correct classification rate for the entire population, for white defendants, and for black defendants. Plot the results using `ggplot`

. You should produce three plots with three curves each (one plot per demographic group), with the thresholds being on the x-axis.

For example, when we were predicting survival for the `titanic`

dataset, we used \(0.5\) as the threshold when using pred <- (predict(fit, newdata = dat, type = response) > 0.5)

*Grading scheme*

FPR, FNR, and CCR computed correctly: 7 pts

The graphs display the correct information: 2 pts

The graphs display the information in a way that’s easy to read and understand (includes choice of labels, colors, etc.) : 4 pts

The text in the answer frames the graphical information well: 2 pts

Fit a logistic regression model that predicts the probability of recidivism using the age and the number of priors of the defendent.

State the interpretations of the coefficients of the model.

For the threshold of \(0.5\) on the probability, obtain the FPR, FNR, and correct classification rates for the model for the entire population, for black defendants, and for white defendants, on both the training and the validation sets.

For a 30-year-old with 2 priors, what is the effect on the predicted probability of a re-arrest of one more prior offense?

*Grading scheme*

Correct logistic regression: 3 pts

Correct interpretation (note: “interpretation of coefficients” is a term of art: just do what’s in the slides): 2 pts

The FPR, FNR, and CCR are correctly computed: 3 pts

The numbers are presented in an easily readable and comprehensible way: 3 pts

Correct and easily understandable explanation of the numbers obtained for the 30-year-old with 2 priors: 4 pts

For at least 4 more input variables, try including each one in the model, and use the validation set to decide whether the variables should be included in the model. Your report should include documentation and an explanation of the process you used.

What is the model that you obtained that produces the highest correct classification rate?

N.B., you should not add variables like `two_year_recide`

and `violent_recid`

, which are very similar to `is_recid`

.

*Grading scheme*

Good process for trying to include variables in the model: 6 pts

Good description and documentation (including code that was used) of the process: 6 pts

Good report on the outcome 3 pts

One appealing definition of a fair model is that the model has the same probabiliy of labelling a defendant low-risk regardless of demographics, if the defendant will not end up being re-arrested. Build such a model by finding a combination of thresholds (which can vary by demographics) that produces such a result. Try to keep the FNR and the FPR as low as possible.

Report the FNR, FPR, and the correct classification rate of the system on the validation set, for the whole population.

For this part, you may manually try different thresholds (while documenting your process) rather than write a program to do that for you.

*Grading scheme*

Good process + description of finding a combination of depmgraphic-dependent thresholds: 12 pts

Good report on the results: 3 pts

Pick two variables in the dataset, and produce a piece of data visualization that shows the relationship between the two variables and the COMPAS risk scores.

Explain what trends you are observing and briefly justify your choices regarding how you made the visualization.

*Grading scheme*

Good and informative visualization: 12 pts

Good explanation: 3 pts