SML201 Precept 6, Spring 2020

Create the file p6.Rmd, and create a report in that file.

Problem 1: Overfitting

Read in the Titanic dataset, and fit a model predicting survival based on sex, age, and ticket class.

For the experiments below, you will be splitting the dataset into training, test, and validation sets. Create a plot with two curves: one for the performance on the training set, and one for the performance on the validation set. Plot the size of the training set on the x axis, and the performance on the training/validation set on the y axis.

Your plot should demonstrate that performance on the validation set generally increases and performance on the training set generally decreases (especially for very small training set sizes) as the size of the training set increases.

Make your graph look professional – the axes should be labelled and details chosen deliberately.

You should create a vector of sizes of the training sets that you will be using (e.g., use c(6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 70, 100)), and then use sapply to compute the performances.

Here is a suggested plan (you may follow your own plan, but use this one if you are stuck)

Reread the lecture notes on variable selection
Make a validation (you can reuse it throughout). Make a function that computes the training set of size size
Now, make a function that takes the size of the training set, the entire dataset, idx, and the validation set, and computes the performance on the train and validation set.
Now, use sapply to compute the performance on the training sets and the validation set for each training set size. You will get a matrix if everything goes right.
The output of sapply will be a matrix, like this:

mat

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

You can extract the rows of a matrix and put the in a data frame:

data.frame(A = mat[1, ], B = mat[2, ])

##   A B
## 1 1 2
## 2 3 4
## 3 5 6

You will probably want another column in your data frame: the sizes of the datasets.

You can then use ggplot to plot the columns of the data frame you made.

Problem 2: tidy data

It will likely not be completely trivial to display legends for each curve – that’s because for you to be able to map color to “train”/“validation”, ggplot needs for the data to only have one y value per row, with a column that indicates whether that y goes to the training or validation curve. Here is a way to transform the data:

library(tidyverse)
dat <- data.frame(a = c(4, 5, 5), B = c(10, 20, 30), C = c(100, 200, 300))
dat.longer <- pivot_longer(dat, cols = c(B, C), names_to = "B.or.C", values_to = "value")
dat

##   a  B   C
## 1 4 10 100
## 2 5 20 200
## 3 5 30 300

dat.longer

## # A tibble: 6 x 3
##       a B.or.C value
##   <dbl> <chr>  <dbl>
## 1     4 B         10
## 2     4 C        100
## 3     5 B         20
## 4     5 C        200
## 5     5 B         30
## 6     5 C        300

Use the technique above to display your graphs from Problem 1 using just one call to geom_line rather than two.

The information above is all you need to know for now – but there is a longer explanation in the Week 7 lecture (including a video)

Problem 3: Introduction to `replicate`

Here is how you can use replicate to repeatdly run the same experiment.

res <- replicate(10, sample(c(1, 2, 3, 4)))
res

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    3    4    3    2    1    4    3     1
## [2,]    4    3    1    1    1    4    2    2    4     2
## [3,]    2    2    4    2    2    3    4    3    1     3
## [4,]    3    4    2    3    4    1    3    1    2     4

Here, we ran sample(c(1, 2, 3, 4) 10 times. Each column represents a result of an experiment. You will usually just obtain a single number from one experiment. Here is an example:

replicate(10, mean(sample(c(1, 2, 3, 4), size = 2)))

##  [1] 1.5 3.0 2.5 3.0 3.5 2.0 2.5 3.0 2.5 1.5

Repeatedly sample a training set of size 15 from titanic, and create two histograms: one for the performances on the training set, and one for the performances on the test set.

Repeatedly sample a training set of size 25 from titanic, and create two histograms: one for the performances (i.e., CCRs) on the training set, and one for the performances (i.e., CCRs) on the test set. You should use ggplot’s geom_histogram geom.

Hints and suggestions

Here is a suggestion for how to proceed:

First, write a function that samples a small training set, fits a model on it, and returns the performance on the small training set as well as the validation set. You will return a vector of length 2.
Second, use replicate to repeatedly call the function. Because the function you wrote returns a vector of length 2 every time, rather than just one value, you will get a matrix. You can treat the matrix like a dataframe when working with rows and columns. For example, you can extract the first column using m[:, 1].
Observe that you get the same kind of thing as what you got in Problem 1.
Make histograms (rather than curves, as in Problem 1)

SML201 Precept 6, Spring 2020

Problem 1: Overfitting

Problem 2: tidy data

Problem 3: Introduction to replicate

Hints and suggestions

Problem 3: Introduction to `replicate`