We reviewed the material from the previous lecture. See there for details.
Let’s fit two models: one with Pclass
as a continous variable, and one with Pclass
as a discrete variable.
fit.titanic.cont <- glm(Survived ~ Age + Sex + Pclass, family = binomial, data=titanic)
fit.titanic.categ <- glm(Survived ~ Age + Sex + as.factor(Pclass), family = binomial, data=titanic)
Let’s predict using the model that uses Pclass
as a continuous. Here, we are predicting for the person who appears first in the Titanic dataset:
titanic[1,]
## Survived Pclass Name Sex Age Siblings.Spouses.Aboard
## 1 0 3 Mr. Owen Harris Braund male 22 1
## Parents.Children.Aboard Fare
## 1 0 7.25
Here’s the prediction using plogis
:
plogis(predict(fit.titanic.cont, newdata = titanic[1,]))
## 1
## 0.1035659
We can get the same prediction using type="response"
predict(fit.titanic.cont, newdata = titanic[1,], type="response")
## 1
## 0.1035659
Finally, we can look at the fit to see what the coefficients are:
fit.titanic.cont$coefficients
## (Intercept) Age Sexmale Pclass
## 4.87851130 -0.03436144 -2.58916304 -1.23053773
So we can predict “manually”:
plogis(4.87851 - 0.03436 * 22 -2.58916 - 1.23054* 3)
## [1] 0.1035684
Now, consider the cateogrical variable version.
fit.titanic.categ
##
## Call: glm(formula = Survived ~ Age + Sex + as.factor(Pclass), family = binomial,
## data = titanic)
##
## Coefficients:
## (Intercept) Age Sexmale
## 3.63492 -0.03427 -2.58872
## as.factor(Pclass)2 as.factor(Pclass)3
## -1.19912 -2.45544
##
## Degrees of Freedom: 886 Total (i.e. Null); 882 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 811.6
The “manual” prediction would be
plogis(3.63492 -0.03427 * 22 -2.58872 -2.45544 )
## [1] 0.103106
That is almost the same. As we discussed in class, the model with the cateogrical version has strictly more felxibility as far as predicting different values for different classes, so we expect that on the training set, the prediction using the categorical variable version would work at least as well or better.
Like we mentioned before, we can decide to guess “Survived” if the probability is greater or equal to 0.5.
titanic[, "pred"] <- predict(fit.titanic.cont, newdata = titanic, type="response") >= .5
(Note that this is the same as predicting “Survived” if the sum \(a_0 + a_1 x_ 1 + ...\) is greater than 0, since \(\sigma(0) = 0.5\).)
The simplest possible classifier simply predicts the same thing (in our case, “did not survive”) every time. We can compute how often the classifier will be correct:
mean(titanic$pred == titanic$Survived)
## [1] 0.794814
And now we can compute how often the baseline classifier will be correct
mean(titanic$pred == 0)
## [1] 0.6347238
The false positive rate is the rate at which the model outputs “positive”, when considering the negative examples (i.e., the model says “Survived” when the person did not survive):
\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]
The false nevative rate is the rate at which the model outputs “negative”, when considering the positive examples (i.e., the model says “Did not survive” when the person actually did survive)
\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]
The positive predictive value is the rate at which the model is correct when it says “positive”:
\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]
total.positive.actually.negative <- sum((titanic$pred == T) & (titanic$Survived == 0))
total.negative <- sum((titanic$pred == F))
FPR <- total.positive.actually.negative/total.negative
There is no hard-and-fast rule, but usually “positive” would be the suprising or rare or important event (e.g., the patient has a rare/significant disease).