Cross-validation

We sometimes want to try different models, and pick the best one. Generally, we would evaluate the models by measuring the performance on the validation set, pick the best one, and then evaluate the final model on the test set.

If we had used the validation set as the test set, we would be overestimating how well the model would perform on new data – basically, we’d be overfitting on the test set.

Trying out different models: variable selection

Let’s try to predict survival on the Titanic again. We’ll try different things – Pclass, as.factor(Pclass), log(Age)

We’ll use the validation set to select the model that works best on the validation set. To get an estimate of the performance on “new” data (i.e., data from the Titanic that we haven’t looked at yet), we’ll use the test set.

set.seed(0)
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:50],]
titanic.valid <- titanic[idx[300:600],]
titanic.test <- titanic[idx[601:800],]
GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  mean(pred == dat$Survived)
}

fit0 <- glm(Survived ~ 1, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit0, titanic.valid)
## [1] 0.5913621
fit1 <- glm(Survived ~ Sex + Age + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit1, titanic.valid)
## [1] 0.7840532
fit2 <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit2, titanic.valid)
## [1] 0.7807309
fit3 <- glm(Survived ~ Sex + log(Age), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)
## [1] 0.7973422
fit3 <- glm(Survived ~ Sex + Age + Fare, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)
## [1] 0.7641196
fit4 <- glm(Survived ~ Sex + Age + Fare + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit4, titanic.valid)
## [1] 0.7940199
fit5 <- glm(Survived ~ Sex + Age + Fare + Pclass + Parents.Children.Aboard, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit5, titanic.valid)
## [1] 0.7940199
GetCorrectClassificationRate(fit4, titanic.test)
## [1] 0.755

replicate

replicate can be used to repeat the same call multiple times. Unless there is randomness involved, that’s not very useful: you’ll get the same output every time. But with randomness involved, we can obtain important insights.

First, let’s see the syntax:

square <- function(x){
  x**2
}

replicate(10, square(5))
##  [1] 25 25 25 25 25 25 25 25 25 25

Now, let’s try this with randomness:

CoinFlip <- function(){
  sample(c("Heads", "Tails"), size = 1)
}
set.seed(0)
replicate(10, CoinFlip())
##  [1] "Tails" "Heads" "Tails" "Heads" "Heads" "Tails" "Heads" "Heads" "Heads"
## [10] "Tails"

We can now use replicate to see how the classification performance on the titanic dataset depends on the selection of the training set.

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  mean(pred == dat$Survived)
}


GetValidPerformance <- function(valid.size, titanic){
  idx <- sample(1:length(titanic$Survived))
  titanic.train <- titanic[idx[1:500],]
  titanic.valid <- titanic[idx[501:(501 + valid.size)],]
  fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
  GetCorrectClassificationRate(fit, titanic.valid)
}

GetValidPerformance.fixtrained <- function(train.idx, rest.idx, valid.size, titanic){
  
  titanic.train <- titanic[train.idx,]
  valid.idx <- sample(rest.idx)[1:valid.size]
  titanic.valid <- titanic[valid.idx,]
  fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
  GetCorrectClassificationRate(fit, titanic.valid)
}


set.seed(0)
idx <- sample(1:length(titanic$Survived))
train.idx <- idx[1:500]
rest.idx <- idx[501:length(idx)]

perf.data <- data.frame(valid.perf = replicate(1000, GetValidPerformance.fixtrained(train.idx, rest.idx, 50, titanic)))

ggplot(data = perf.data, mapping = aes(x=valid.perf)) + 
  geom_bar()