Create the file p7.Rmd, and create a report in that file. Using R Markdown is mandatory.

Problem 1: Tidy Data (same as Precept 6 Problem 2)

It will likely not be completely trivial to display legends for each curve – that’s because for you to be able to map color to “train”/“validation”, ggplot needs for the data to only have one y value per row, with a column that indicates whether that y goes to the training or validation curve. Here is a way to transform the data:

library(tidyverse)
dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300))
dat.longer <- pivot_longer(dat, cols = c(B, C), names_to = "B.or.C", values_to = "value")
dat
##   a  B   C
## 1 4 10 100
## 2 5 20 200
## 3 5 30 300
dat.longer
## # A tibble: 6 x 3
##       a B.or.C value
##   <dbl> <chr>  <dbl>
## 1     4 B         10
## 2     4 C        100
## 3     5 B         20
## 4     5 C        200
## 5     5 B         30
## 6     5 C        300

Use the technique above to display your graphs from Problem 1 using just one call to geom_line rather than two.

The information above is all you need to know for now – but there is a longer explanation in the Week 7 lecture (including a video)

Problem 2: Experiments using replicate (same as Precept 6 Problem 3)

Here is how you can use replicate to repeatdly run the same experiment.

res <- replicate(10, sample(c(1, 2, 3, 4)))
res
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    3    2    4    4    1    2    1    1    1     4
## [2,]    2    3    1    3    2    3    3    3    2     3
## [3,]    1    4    3    2    4    1    2    2    4     2
## [4,]    4    1    2    1    3    4    4    4    3     1

Here, we ran sample(c(1, 2, 3, 4) 10 times. Each column represents a result of an experiment. You will usually just obtain a single number from one experiment. Here is an example:

replicate(10, mean(sample(c(1, 2, 3, 4), size = 2)))
##  [1] 1.5 2.5 3.0 1.5 3.5 3.0 3.0 2.0 3.5 1.5

Repeatedly sample a training set of size 15 from titanic, and create two histograms: one for the performances on the training set, and one for the performances on the test set.

Repeatedly sample a training set of size 25 from titanic, and create two histograms: one for the performances (i.e., CCRs) on the training set, and one for the performances (i.e., CCRs) on the test set. You should use ggplot’s geom_histogram geom.

Hints and suggestions

Here is a suggestion for how to proceed:

  • First, write a function that samples a small training set, fits a model on it, and returns the performance on the small training set as well as the validation set. You will return a vector of length 2.

  • Second, use replicate to repeatedly call the function. Because the function you wrote returns a vector of length 2 every time, rather than just one value, you will get a matrix. You can treat the matrix like a dataframe when working with rows and columns. For example, you can extract the first column using m[:, 1].

  • Observe that you get the same kind of thing as what you got in Problem 1.

  • Make histograms (rather than curves, as in Problem 1)

Problem 3: Calibration

Fit a model on the Titanic dataset to predict survival using

glm(Survived ~ Sex + Age + Pclass, family = binomial)

Does this model False Positive Parity, with respect to sex? Does it satisfy Calibration? Does it satisfy demographic parity?