Week 6 Lecture

Cross-validation

We sometimes want to try different models, and pick the best one. Generally, we would evaluate the models by measuring the performance on the validation set, pick the best one, and then evaluate the final model on the test set.

If we had used the validation set as the test set, we would be overestimating how well the model would perform on new data – basically, we’d be overfitting on the test set.

Trying out different models: variable selection

Let’s try to predict survival on the Titanic again. We’ll try different things – Pclass, as.factor(Pclass), log(Age)…

We’ll use the validation set to select the model that works best on the validation set. To get an estimate of the performance on “new” data (i.e., data from the Titanic that we haven’t looked at yet), we’ll use the test set.

set.seed(0)
idx <- sample(1:length(titanic$Survived))
titanic.train <- titanic[idx[1:50],]
titanic.valid <- titanic[idx[300:600],]
titanic.test <- titanic[idx[601:800],]

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  return(mean(pred == dat$Survived))
}

fit0 <- glm(Survived ~ 1, family = binomial, data = titanic)
GetCorrectClassificationRate(fit0, titanic.valid)

## [1] 0.5980066

fit1 <- glm(Survived ~ Sex + Age + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit1, titanic.valid)

## [1] 0.7840532

fit2 <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit2, titanic.valid)

## [1] 0.7807309

fit3 <- glm(Survived ~ Sex + log(Age), family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)

## [1] 0.7707641

fit3 <- glm(Survived ~ Sex + Age + Fare, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit3, titanic.valid)

## [1] 0.7674419

fit4 <- glm(Survived ~ Sex + Age + Fare + Pclass, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit4, titanic.valid)

## [1] 0.7873754

fit5 <- glm(Survived ~ Sex + Age + Fare + Pclass + Parents.Children.Aboard, family = binomial, data = titanic.train)
GetCorrectClassificationRate(fit5, titanic.valid)

## [1] 0.744186

GetCorrectClassificationRate(fit4, titanic.test)

## [1] 0.78

`replicate` (Probably after the break)

replicate can be used to repeat the same call multiple times. Unless there is randomness involved, that’s not very useful: you’ll get the same output every time. But with randomness involved, we can obtain important insights.

First, let’s see the syntax:

square <- function(x){
  return(x**2)
}

replicate(10, square(5))

##  [1] 25 25 25 25 25 25 25 25 25 25

Now, let’s try this with randomness:

CoinFlip <- function(){
  sample(c("Heads", "Tails"), size = 1)
}
set.seed(0)
replicate(10, CoinFlip())

##  [1] "Tails" "Heads" "Heads" "Tails" "Tails" "Heads" "Tails" "Tails"
##  [9] "Tails" "Tails"

We can now use replicate to see how the classification performance on the titanic dataset depends on the selection of the training set.

GetCorrectClassificationRate <- function(fit, dat){
  pred <- predict(fit, newdata=dat, type = "response") > .5
  return(mean(pred == dat$Survived))
}


GetValidPerformance <- function(valid.size, titanic){
  idx <- sample(1:length(titanic$Survived))
  titanic.train <- titanic[idx[1:500],]
  titanic.valid <- titanic[idx[501:(501 + valid.size)],]
  fit <- glm(Survived ~ Pclass,family = binomial, data = titanic.train)
  GetCorrectClassificationRate(fit, titanic.train)
  return(GetCorrectClassificationRate(fit, titanic.valid))
}

titanic <- read.csv("titanic.csv")
perf.data <- data.frame(valid.perf = replicate(1000, GetValidPerformance(250, titanic)))

ggplot(data = perf.data, mapping = aes(x=valid.perf)) + 
  geom_bar()

Week 6 Lecture

Cross-validation

Trying out different models: variable selection

replicate (Probably after the break)

`replicate` (Probably after the break)