Create the file p7.Rmd
, and create a report in that file. Using R Markdown is mandatory.
It will likely not be completely trivial to display legends for each curve – that’s because for you to be able to map color to “train”/“validation”, ggplot
needs for the data to only have one y value per row, with a column that indicates whether that y goes to the training or validation curve. Here is a way to transform the data:
library(tidyverse)
dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300))
dat.longer <- pivot_longer(dat, cols = c(B, C), names_to = "B.or.C", values_to = "value")
dat
## a B C
## 1 4 10 100
## 2 5 20 200
## 3 5 30 300
dat.longer
## # A tibble: 6 x 3
## a B.or.C value
## <dbl> <chr> <dbl>
## 1 4 B 10
## 2 4 C 100
## 3 5 B 20
## 4 5 C 200
## 5 5 B 30
## 6 5 C 300
Use the technique above to display your graphs from Problem 1 using just one call to geom_line
rather than two.
The information above is all you need to know for now – but there is a longer explanation in the Week 7 lecture (including a video)
replicate
(same as Precept 6 Problem 3)Here is how you can use replicate
to repeatdly run the same experiment.
res <- replicate(10, sample(c(1, 2, 3, 4)))
res
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3 2 4 4 1 2 1 1 1 4
## [2,] 2 3 1 3 2 3 3 3 2 3
## [3,] 1 4 3 2 4 1 2 2 4 2
## [4,] 4 1 2 1 3 4 4 4 3 1
Here, we ran sample(c(1, 2, 3, 4)
10 times. Each column represents a result of an experiment. You will usually just obtain a single number from one experiment. Here is an example:
replicate(10, mean(sample(c(1, 2, 3, 4), size = 2)))
## [1] 1.5 2.5 3.0 1.5 3.5 3.0 3.0 2.0 3.5 1.5
Repeatedly sample a training set of size 15 from titanic
, and create two histograms: one for the performances on the training set, and one for the performances on the test set.
Repeatedly sample a training set of size 25 from titanic
, and create two histograms: one for the performances (i.e., CCRs) on the training set, and one for the performances (i.e., CCRs) on the test set. You should use ggplot
’s geom_histogram
geom.
Here is a suggestion for how to proceed:
First, write a function that samples a small training set, fits a model on it, and returns the performance on the small training set as well as the validation set. You will return a vector of length 2.
Second, use replicate
to repeatedly call the function. Because the function you wrote returns a vector of length 2 every time, rather than just one value, you will get a matrix. You can treat the matrix like a dataframe when working with rows and columns. For example, you can extract the first column using m[:, 1]
.
Observe that you get the same kind of thing as what you got in Problem 1.
Make histograms (rather than curves, as in Problem 1)
Fit a model on the Titanic dataset to predict survival using
glm(Survived ~ Sex + Age + Pclass, family = binomial)
Does this model False Positive Parity, with respect to sex? Does it satisfy Calibration? Does it satisfy demographic parity?