--- title: "SML201 Precept 7, Spring 2020" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Create the file `p7.Rmd`, and create a report in that file. *Using R Markdown is mandatory*. ### Problem 1: Tidy Data (same as Precept 6 Problem 2) It will likely not be completely trivial to display legends for each curve -- that's because for you to be able to map color to "train"/"validation", `ggplot` needs for the data to only have one y value per row, with a column that indicates whether that y goes to the training or validation curve. Here is a way to transform the data: ```{r message = F} library(tidyverse) dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300)) dat.longer <- pivot_longer(dat, cols = c(B, C), names_to = "B.or.C", values_to = "value") dat dat.longer ``` Use the technique above to display your graphs from Problem 1 using just one call to `geom_line` rather than two. The information above is all you need to know for now -- but there is a longer explanation in the Week 7 lecture (including a video) ### Problem 2: Experiments using `replicate` (same as Precept 6 Problem 3) Here is how you can use `replicate` to repeatdly run the same experiment. ```{r} res <- replicate(10, sample(c(1, 2, 3, 4))) res ``` Here, we ran `sample(c(1, 2, 3, 4)` 10 times. Each column represents a result of an experiment. You will usually just obtain a single number from one experiment. Here is an example: ```{r} replicate(10, mean(sample(c(1, 2, 3, 4), size = 2))) ``` Repeatedly sample a training set of size 15 from `titanic`, and create two histograms: one for the performances on the training set, and one for the performances on the test set. Repeatedly sample a training set of size 25 from `titanic`, and create two histograms: one for the performances (i.e., CCRs) on the training set, and one for the performances (i.e., CCRs) on the test set. You should use `ggplot`'s `geom_histogram` geom. #### Hints and suggestions Here is a suggestion for how to proceed: * First, write a function that samples a small training set, fits a model on it, and returns the performance on the small training set as well as the validation set. You will return a vector of length 2. * Second, use `replicate` to repeatedly call the function. Because the function you wrote returns a vector of length 2 every time, rather than just one value, you will get a *matrix*. You can treat the matrix like a dataframe when working with rows and columns. For example, you can extract the first column using `m[:, 1]`. * Observe that you get the same kind of thing as what you got in Problem 1. * Make histograms (rather than curves, as in Problem 1) ### Problem 3: Calibration Fit a model on the Titanic dataset to predict survival using `glm(Survived ~ Sex + Age + Pclass, family = binomial)` Does this model False Positive Parity, with respect to sex? Does it satisfy [Calibration](https://www.youtube.com/watch?v=VE4exCVC9OE)? Does it satisfy demographic parity?