Week 6 Lecture

Cross-validation

We sometimes want to try different models, and pick the best one. Generally, we would evaluate the models by measuring the performance on the validation set, pick the best one, and then evaluate the final model on the test set.

If we had used the validation set as the test set, we would be overestimating how well the model would perform on new data – basically, we’d be overfitting on the test set.

Trying out different models: variable selection

Let’s try to predict survival on the Titanic again. We’ll try different things – Pclass, as.factor(Pclass), log(Age)…

We’ll use the validation set to select the model that works best on the validation set. To get an estimate of the performance on “new” data (i.e., data from the Titanic that we haven’t looked at yet), we’ll use the test set.

set.seed(0)
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:50],]
titanic.valid <- titanic[idx[300:600],]
titanic.test <- titanic[idx[601:800],]

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  mean(pred == dat$Survived)
}

fit0 <- glm(Survived ~ 1, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit0, titanic.valid)

## [1] 0.5913621

fit1 <- glm(Survived ~ Sex + Age + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit1, titanic.valid)

## [1] 0.7840532

fit2 <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit2, titanic.valid)

## [1] 0.7807309

fit3 <- glm(Survived ~ Sex + log(Age), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)

## [1] 0.7973422

fit3 <- glm(Survived ~ Sex + Age + Fare, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)

## [1] 0.7641196

fit4 <- glm(Survived ~ Sex + Age + Fare + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit4, titanic.valid)

## [1] 0.7940199

fit5 <- glm(Survived ~ Sex + Age + Fare + Pclass + Parents.Children.Aboard, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit5, titanic.valid)

## [1] 0.7940199

GetCorrectClassificationRate(fit4, titanic.test)

## [1] 0.755

`replicate`

replicate can be used to repeat the same call multiple times. Unless there is randomness involved, that’s not very useful: you’ll get the same output every time. But with randomness involved, we can obtain important insights.

First, let’s see the syntax:

square <- function(x){
  x**2
}

replicate(10, square(5))

##  [1] 25 25 25 25 25 25 25 25 25 25

Now, let’s try this with randomness:

CoinFlip <- function(){
  sample(c("Heads", "Tails"), size = 1)
}
set.seed(0)
replicate(10, CoinFlip())

##  [1] "Tails" "Heads" "Tails" "Heads" "Heads" "Tails" "Heads" "Heads" "Heads"
## [10] "Tails"

We can now use replicate to see how the classification performance on the titanic dataset depends on the selection of the training set.

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  mean(pred == dat$Survived)
}


GetValidPerformance <- function(valid.size, titanic){
  idx <- sample(1:length(titanic$Survived))
  titanic.train <- titanic[idx[1:500],]
  titanic.valid <- titanic[idx[501:(501 + valid.size)],]
  fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
  GetCorrectClassificationRate(fit, titanic.valid)
}

GetValidPerformance.fixtrained <- function(train.idx, rest.idx, valid.size, titanic){
  
  titanic.train <- titanic[train.idx,]
  valid.idx <- sample(rest.idx)[1:valid.size]
  titanic.valid <- titanic[valid.idx,]
  fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
  GetCorrectClassificationRate(fit, titanic.valid)
}


set.seed(0)
idx <- sample(1:length(titanic$Survived))
train.idx <- idx[1:500]
rest.idx <- idx[501:length(idx)]

perf.data <- data.frame(valid.perf = replicate(1000, GetValidPerformance.fixtrained(train.idx, rest.idx, 50, titanic)))

ggplot(data = perf.data, mapping = aes(x=valid.perf)) + 
  geom_bar()