--- title: "Precept 5 Problem Set" output: html_document: df_print: paged --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` Please show your work to your preceptor at the end of the precept. After the precept, there is nothing to submit. For this problem set, you will make an R markdown document. Examples of R markdown documents are available on the course website. Create the file `p5.Rmd`, and create a report in that file. ### Problem 1: Overfitting Read in the Titanic dataset, predicting survival based on sex, age, and ticket class. For the experiments below, you will be splitting the dataset into training, test, and validation sets. Create a plot with two curves: one for the performance on the training set, and one for the performance on the validation set. Plot the size of the training set on the x axis, and the performance on the training/validation set on the y axis. Your plot should demonstrate that performance on the validation set generally increases and performance on the training set generally decreases (especially for very small training set sizes) as the size of the training set increases. Make your graph look professional -- the axes should be labelled and details chosen deliberately. You should create a vector of sizes of the training sets that you will be using (e.g., use `c(3, 6, 9, 15, 20, 25, 30, 40, 50, 70, 100)`), and then use `sapply` to compute the performances. #### Challenge It will likely not be completely trivial to display legends for each curve -- that's becaues `ggplot` expects the data to be tidy. You can use `melt` to accomplish what you need. Here is an example. ```{r} library(reshape2) dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300)) melt(dat, 3) ```