Problem 1: replicate

Read in the dataset and split into test/train/validation:

titanic <- read.csv("titanic.csv")
idx <- sample(1:nrow(titanic))
train.potential.idx <- idx[1:500]
valid.idx <- idx[501:700]

titanic.valid <- titanic[valid.idx,]

We made the potential training indices vector large so that we could different training sets every time. We’ll use the same validation set every time.

Now, let’s write a function to get the performance on the training and validation sets.

TrainExp15 <- function(titanic, train.potential.idx, valid.idx){
train.idx <- sample(train.potential.idx, size = 15)
titanic.train <- titanic[train.idx, ]
fit <- glm(Survived ~ Age + Sex + Pclass, family=binomial, data = titanic[train.idx,])
pred.train <- predict(fit, newdata = titanic.train) > 0.5
pred.valid <- predict(fit, newdata = titanic.valid) > 0.5
perf.train <- mean(pred.train == titanic.train$Survived) perf.valid <- mean(pred.valid == titanic.valid$Survived)
return(c(perf.train, perf.valid))
}

Let’s try to run this function once:

TrainExp15(titanic, train.potential.idx, valid.idx)
## [1] 0.8666667 0.4900000

Now, let’s run it 1000 times:

perfs <- replicate(1000, TrainExp15(titanic, train.potential.idx, valid.idx))
perfs.train <- perfs[1,]
perfs.valid <- perfs[2,]

Finally, let’s get the data into tidy data format:

perfs.df.wide <- data.frame(perfs.train = perfs.train, perfs.valid = perfs.valid)
perfs.df <- melt(perfs.df.wide) %>% select(set = variable, perf = value)
## No id variables; using all as measure variables

Finally, let’s display the histograms. (N.b., we are using facets here, but that is not strictly necessary: two separate histograms would be fine)

ggplot(perfs.df) +
geom_histogram(mapping = aes(x = perf), bins = 20) +
facet_wrap(~ set)

We don’t need to do anything as fancy as the facets. We could also do something like this for the performance figures in the validation set:

perfs.valid.df <- data.frame(perf = perfs.valid)
ggplot(perfs.valid.df) +
geom_histogram(mapping = aes(x = perf), bins = 20)