Week 5 Lecture 1

Categorical vs. Continous Variables

We reviewed the material from the previous lecture. See there for details.

Let’s fit two models: one with Pclass as a continous variable, and one with Pclass as a discrete variable.

fit.titanic.cont <- glm(Survived ~ Age + Sex + Pclass, family = binomial, data=titanic)
fit.titanic.categ <- glm(Survived ~ Age + Sex + as.factor(Pclass), family = binomial, data=titanic)

Let’s predict using the model that uses Pclass as a continuous. Here, we are predicting for the person who appears first in the Titanic dataset:

titanic[1,]

##   Survived Pclass                   Name  Sex Age Siblings.Spouses.Aboard
## 1        0      3 Mr. Owen Harris Braund male  22                       1
##   Parents.Children.Aboard Fare
## 1                       0 7.25

Here’s the prediction using plogis:

plogis(predict(fit.titanic.cont, newdata = titanic[1,]))

##         1 
## 0.1035659

We can get the same prediction using type="response"

predict(fit.titanic.cont, newdata = titanic[1,], type="response")

##         1 
## 0.1035659

Finally, we can look at the fit to see what the coefficients are:

fit.titanic.cont$coefficients

## (Intercept)         Age     Sexmale      Pclass 
##  4.87851130 -0.03436144 -2.58916304 -1.23053773

So we can predict “manually”:

plogis(4.87851 - 0.03436 * 22 -2.58916 - 1.23054* 3)

## [1] 0.1035684

Now, consider the cateogrical variable version.

fit.titanic.categ

## 
## Call:  glm(formula = Survived ~ Age + Sex + as.factor(Pclass), family = binomial, 
##     data = titanic)
## 
## Coefficients:
##        (Intercept)                 Age             Sexmale  
##            3.63492            -0.03427            -2.58872  
## as.factor(Pclass)2  as.factor(Pclass)3  
##           -1.19912            -2.45544  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  882 Residual
## Null Deviance:       1183 
## Residual Deviance: 801.6     AIC: 811.6

The “manual” prediction would be

plogis(3.63492 -0.03427 * 22 -2.58872 -2.45544 )

## [1] 0.103106

That is almost the same. As we discussed in class, the model with the cateogrical version has strictly more felxibility as far as predicting different values for different classes, so we expect that on the training set, the prediction using the categorical variable version would work at least as well or better.

Predicting “Survived” or Died

Like we mentioned before, we can decide to guess “Survived” if the probability is greater or equal to 0.5.

titanic[, "pred"] <- predict(fit.titanic.cont, newdata = titanic, type="response") >= .5

(Note that this is the same as predicting “Survived” if the sum \(a_0 + a_1 x_ 1 + ...\) is greater than 0, since \(\sigma(0) = 0.5\).)

The “baseline” classifier

The simplest possible classifier simply predicts the same thing (in our case, “did not survive”) every time. We can compute how often the classifier will be correct:

mean(titanic$pred == titanic$Survived)

## [1] 0.794814

And now we can compute how often the baseline classifier will be correct

mean(titanic$pred == 0)

## [1] 0.6347238

Other measures of how good a classifier/predictor/model is

False positive rate

The false positive rate is the rate at which the model outputs “positive”, when considering the negative examples (i.e., the model says “Survived” when the person did not survive):

\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]

The false nevative rate is the rate at which the model outputs “negative”, when considering the positive examples (i.e., the model says “Did not survive” when the person actually did survive)

\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]

The positive predictive value is the rate at which the model is correct when it says “positive”:

\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]

total.positive.actually.negative <- sum((titanic$pred == T) & (titanic$Survived == 0))
total.negative <- sum((titanic$pred == F))
FPR <- total.positive.actually.negative/total.negative

Which is “positive” and which is “negative”?

There is no hard-and-fast rule, but usually “positive” would be the suprising or rare or important event (e.g., the patient has a rare/significant disease).