---
title: "Precept 5 Problem Set"
output:
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Please show your work to your preceptor at the end of the precept. After the precept, there is nothing to submit. For this problem set, you will make an R markdown document. Examples of R markdown documents are available on the course website.

Create the file `p5.Rmd`, and create a report in that file.

### Problem 1: Overfitting

Read in the Titanic dataset, predicting survival based on sex, age, and ticket class.

For the experiments below, you will be splitting the dataset into  training, test, and validation sets. Create a plot with two curves: one for the  performance on the training set, and one for the performance on the validation set. Plot the size of the training set on the x axis, and the performance on the training/validation set on the y axis. 

Your plot should demonstrate that performance on the validation set generally increases and performance on the training set generally decreases (especially for very small training set sizes) as the size of the training set increases.

Make your graph look professional -- the axes should be labelled and details chosen deliberately.

You should create a vector of sizes of the training sets that you will be using (e.g., use `c(3, 6, 9, 15, 20, 25, 30, 40, 50, 70, 100)`), and then use `sapply` to compute the performances.


#### Challenge

It will likely not be completely trivial to display legends for each curve -- that's becaues `ggplot` expects the data to be tidy. You can use `melt` to accomplish what you need. Here is an example.

```{r}
library(reshape2)
dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300))
melt(dat, 3)
```


<!--
### Problem 2: Introduction to `replicate`

Here is how you can use `replicate` to repeatdly run the same experiment.

```{r}
res <- replicate(10, sample(c(1, 2, 3, 4)))
res
```


Here, we ran `sample(c(1, 2, 3, 4)` 10 times. Each column represents a result of an experiment. You will usually just obtain a single number from one experiment. Here is an example:

```{r}
replicate(10, mean(sample(c(1, 2, 3, 4), size = 2)))

```


Repeatedly sample a training set of size 15 from `titanic`, and create two histograms: one for the performances on the training set, and one for the performances on the test set.


-->