Cross-validation

We sometimes want to try different models, and pick the best one. Generally, we would evaluate the models by measuring the performance on the validation set, pick the best one, and then evaluate the final model on the test set.

If we had used the validation set as the test set, we would be overestimating how well the model would perform on new data – basically, we’d be overfitting on the test set.

Trying out different models: variable selection

Let’s try to predict survival on the Titanic again. We’ll try different things – Pclass, as.factor(Pclass), log(Age)

We’ll use the validation set to select the model that works best on the validation set. To get an estimate of the performance on “new” data (i.e., data from the Titanic that we haven’t looked at yet), we’ll use the test set.

set.seed(0)
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:50],]
titanic.valid <- titanic[idx[300:600],]
titanic.test <- titanic[idx[601:800],]
GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  return(mean(pred == dat$Survived))
}

fit0 <- glm(Survived ~ 1, family = binomial, data = titanic)
GetCorrectClassificationRate(fit0, titanic.valid)
## [1] 0.5980066
fit1 <- glm(Survived ~ Sex + Age + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit1, titanic.valid)
## [1] 0.7840532
fit2 <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit2, titanic.valid)
## [1] 0.7807309
fit3 <- glm(Survived ~ Sex + log(Age), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)
## [1] 0.7707641
fit3 <- glm(Survived ~ Sex + Age + Fare, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)
## [1] 0.7674419
fit4 <- glm(Survived ~ Sex + Age + Fare + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit4, titanic.valid)
## [1] 0.7873754
fit5 <- glm(Survived ~ Sex + Age + Fare + Pclass + Parents.Children.Aboard, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit5, titanic.valid)
## [1] 0.744186
GetCorrectClassificationRate(fit4, titanic.test)
## [1] 0.78

replicate (Probably after the break)

replicate can be used to repeat the same call multiple times. Unless there is randomness involved, that’s not very useful: you’ll get the same output every time. But with randomness involved, we can obtain important insights.

First, let’s see the syntax:

square <- function(x){
  return(x**2)
}

replicate(10, square(5))
##  [1] 25 25 25 25 25 25 25 25 25 25

Now, let’s try this with randomness:

CoinFlip <- function(){
  sample(c("Heads", "Tails"), size = 1)
}
set.seed(0)
replicate(10, CoinFlip())
##  [1] "Tails" "Heads" "Heads" "Tails" "Tails" "Heads" "Tails" "Tails"
##  [9] "Tails" "Tails"

We can now use replicate to see how the classification performance on the titanic dataset depends on the selection of the training set.

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  return(mean(pred == dat$Survived))
}


GetValidPerformance <- function(valid.size, titanic){
  idx <- sample(1:length(titanic$Survived))
  titanic.train <- titanic[idx[1:500],]
  titanic.valid <- titanic[idx[501:(501 + valid.size)],]
  fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
  GetCorrectClassificationRate(fit, titanic.train)
  return(GetCorrectClassificationRate(fit, titanic.valid))
}

titanic <- read.csv("titanic.csv")
perf.data <- data.frame(valid.perf = replicate(1000, GetValidPerformance(250, titanic)))

ggplot(data = perf.data, mapping = aes(x=valid.perf)) + 
  geom_bar()