--- title: "Precept 6 Problem Set" output: html_document: df_print: paged --- ```{r setup, include=FALSE} knitr::opts_chunk\$set(echo = TRUE) library(tidyverse) library(reshape2) ``` ### Problem 1: `replicate` Read in the dataset and split into test/train/validation: ```{r} titanic <- read.csv("titanic.csv") idx <- sample(1:nrow(titanic)) train.potential.idx <- idx[1:500] valid.idx <- idx[501:700] titanic.valid <- titanic[valid.idx,] ``` We made the potential training indices vector large so that we could different training sets every time. We'll use the same validation set every time. Now, let's write a function to get the performance on the training and validation sets. ```{r} TrainExp15 <- function(titanic, train.potential.idx, valid.idx){ train.idx <- sample(train.potential.idx, size = 15) titanic.train <- titanic[train.idx, ] fit <- glm(Survived ~ Age + Sex + Pclass, family=binomial, data = titanic[train.idx,]) pred.train <- predict(fit, newdata = titanic.train) > 0.5 pred.valid <- predict(fit, newdata = titanic.valid) > 0.5 perf.train <- mean(pred.train == titanic.train\$Survived) perf.valid <- mean(pred.valid == titanic.valid\$Survived) return(c(perf.train, perf.valid)) } ``` Let's try to run this function once: ```{r} TrainExp15(titanic, train.potential.idx, valid.idx) ``` Now, let's run it 1000 times: ```{r warning = F} perfs <- replicate(1000, TrainExp15(titanic, train.potential.idx, valid.idx)) perfs.train <- perfs[1,] perfs.valid <- perfs[2,] ``` Finally, let's get the data into tidy data format: ```{r} perfs.df.wide <- data.frame(perfs.train = perfs.train, perfs.valid = perfs.valid) perfs.df <- melt(perfs.df.wide) %>% select(set = variable, perf = value) ``` Finally, let's display the histograms. (N.b., we are using facets here, but that is not strictly necessary: two separate histograms would be fine) ```{r} ggplot(perfs.df) + geom_histogram(mapping = aes(x = perf), bins = 20) + facet_wrap(~ set) ``` We don't need to do anything as fancy as the facets. We could also do something like this for the performance figures in the validation set: ```{r} perfs.valid.df <- data.frame(perf = perfs.valid) ggplot(perfs.valid.df) + geom_histogram(mapping = aes(x = perf), bins = 20) ```