---
title: "SML201 Precept 7, Spring 2020"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


Create the file `p7.Rmd`, and create a report in that file. *Using R Markdown is mandatory*.

### Problem 1: Tidy Data (same as Precept 6 Problem 2)

It will likely not be completely trivial to display legends for each curve -- that's because for you to be able to map color to "train"/"validation", `ggplot` needs for the data to only have one y value per row, with a column that indicates whether that y goes to the training or validation curve. Here is a way to transform the data:


```{r message = F}
library(tidyverse)
dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300))
dat.longer <- pivot_longer(dat, cols = c(B, C), names_to = "B.or.C", values_to = "value")
dat
dat.longer
```

Use the technique above to display your graphs from Problem 1 using just one call to `geom_line` rather than two.

The information above is all you need to know for now -- but there is a longer explanation in the Week 7 lecture (including a video)


### Problem 2: Experiments using `replicate` (same as Precept 6 Problem 3)

Here is how you can use `replicate` to repeatdly run the same experiment.

```{r}
res <- replicate(10, sample(c(1, 2, 3, 4)))
res
```


Here, we ran `sample(c(1, 2, 3, 4)` 10 times. Each column represents a result of an experiment. You will usually just obtain a single number from one experiment. Here is an example:

```{r}
replicate(10, mean(sample(c(1, 2, 3, 4), size = 2)))

```


Repeatedly sample a training set of size 15 from `titanic`, and create two histograms: one for the performances on the training set, and one for the performances on the test set.


Repeatedly sample a training set of size 25 from `titanic`, and create two histograms: one for the performances (i.e., CCRs) on the training set, and one for the performances (i.e., CCRs) on the test set. You should use `ggplot`'s `geom_histogram` geom.


#### Hints and suggestions


Here is a suggestion for how to proceed:

* First, write a function that samples a small training set, fits a model on it, and returns the performance on the small training set as well as the validation set. You will return a vector of length 2.

* Second, use `replicate` to repeatedly call the function. Because the function you wrote returns a vector of length 2 every time, rather than just one value, you will get a *matrix*. You can treat the matrix like a dataframe when working with rows and columns. For example, you can extract the first column using `m[:, 1]`.

* Observe that you get the same kind of thing as what you got in Problem 1.

* Make histograms (rather than curves, as in Problem 1)

### Problem 3: Calibration 

Fit a model on the Titanic dataset to predict survival using

`glm(Survived ~ Sex + Age + Pclass, family = binomial)`

Does this model False Positive Parity, with respect to sex? Does it satisfy  [Calibration](https://www.youtube.com/watch?v=VE4exCVC9OE)? Does it satisfy demographic parity?