Let’s load the Titanic dataset
titanic <- read.csv("http://guerzhoy.princeton.edu/201s20/titanic.csv")
Now, let’s fit a Logistic Regression model:
fit.titanic <- glm(Survived ~ Sex + Age, data = titanic, family = "binomial")
Get the predictions:
titanic[, "pred"] = predict(fit.titanic, newdata = titanic, type = "response")
View the predictions with View(titanic)
.
Let’s look at the coefficients:
fit.titanic
##
## Call: glm(formula = Survived ~ Sex + Age, family = "binomial", data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age
## 1.11388 -2.50000 -0.00206
##
## Degrees of Freedom: 886 Total (i.e. Null); 884 Residual
## Null Deviance: 1183
## Residual Deviance: 916 AIC: 922
So for example, to predict the probability that a 30-year-old man would survive, we’d use
plogis(1.11 - 2.5 - 0.002*30)
## [1] 0.1900016
Here, plogis
is simply the logistic function \(f(z) = \frac{1}{1+epx(-z)}\).
Suppose we try to use Pclass
– the class of the fare – as a predictor. The values there are 1, 2, and 3, but it is unclear whether we should use the Pclass
as a continuous variable (since the classes are numbered in a way that makes sense: second class is really between first class and third class in a way that could be relevant to predicting survival) or as a categorical variable.
First, let’s deal with the mechanics. To use Pclass
as a categorical variable, we need to convert it to a new data type: factor
. To do that, we’ll use as.factor
.
A variable of type factor
will always be treated as categorical by R (so will a variable of type character
, which is why we did not have to do anything extra to use Sex
as categorical.) On the other hand, numeric
and integer
variables will be treated as continuous.
fit.categ <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic)
fit.categ
##
## Call: glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial,
## data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age as.factor(Pclass)2
## 3.63492 -2.58872 -0.03427 -1.19912
## as.factor(Pclass)3
## -2.45544
##
## Degrees of Freedom: 886 Total (i.e. Null); 882 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 811.6
Compare this to using Pclass
as a continous variable
fit.cont <- glm(Survived ~ Sex + Age + Pclass, family = "binomial", data = titanic)
fit.cont
##
## Call: glm(formula = Survived ~ Sex + Age + Pclass, family = "binomial",
## data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age Pclass
## 4.87851 -2.58916 -0.03436 -1.23054
##
## Degrees of Freedom: 886 Total (i.e. Null); 883 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 809.6
When using Pclass
as a categorical variable, for a 30-year-old male in second class, we’d predict
\(\sigma(3.6 - 2.6 -0.03\times 30 - 1.2)\) (here we picked the \(-1.2\) coefficient from the coefficients displayed in the model).
When using Pclass
as a continous variable, for the same person we’d predict
\(\sigma(4.9 - 2.6 - 0.03\times 30 -1.2\times 2)\) (here we have just one coefficient associated with class.)
Those numbers are very close, but different.
Pclass
as a categorical variableIt is possible to convert Pclass
to a variable that will be treated as categorical in several ways. Anything that converts the column Pclass
to factor or character will do:
glm(Survived ~ Sex + Age + as.character(Pclass), family = binomial, data = titanic)
##
## Call: glm(formula = Survived ~ Sex + Age + as.character(Pclass), family = binomial,
## data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age
## 3.63492 -2.58872 -0.03427
## as.character(Pclass)2 as.character(Pclass)3
## -1.19912 -2.45544
##
## Degrees of Freedom: 886 Total (i.e. Null); 882 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 811.6
t1 <- titanic
t1$Pclass <- as.factor(t1$Pclass)
glm(Survived ~ Sex + Age + Pclass, family = binomial, data = t1)
##
## Call: glm(formula = Survived ~ Sex + Age + Pclass, family = binomial,
## data = t1)
##
## Coefficients:
## (Intercept) Sexmale Age Pclass2 Pclass3
## 3.63492 -2.58872 -0.03427 -1.19912 -2.45544
##
## Degrees of Freedom: 886 Total (i.e. Null); 882 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 811.6
t2 <- titanic
t2$Pclass <- as.character(t2$Pclass)
glm(Survived ~ Sex + Age + Pclass, family = binomial, data = t1)
##
## Call: glm(formula = Survived ~ Sex + Age + Pclass, family = binomial,
## data = t1)
##
## Coefficients:
## (Intercept) Sexmale Age Pclass2 Pclass3
## 3.63492 -2.58872 -0.03427 -1.19912 -2.45544
##
## Degrees of Freedom: 886 Total (i.e. Null); 882 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 811.6
We can decide that if the probability the model outputs is greater than 0.5, we’ll predict “Survived,” and if it is not, we’ll predict “Did not survive”. Let’s add that as a column to the dataset
titanic[, "pred"] <- predict(fit.titanic, newdata=titanic, type="response")>0.5
titanic[1:10,] %>% select(Name, Age, Survived, pred)
## Name Age Survived pred
## 1 Mr. Owen Harris Braund 22 0 FALSE
## 2 Mrs. John Bradley (Florence Briggs Thayer) Cumings 38 1 TRUE
## 3 Miss. Laina Heikkinen 26 1 TRUE
## 4 Mrs. Jacques Heath (Lily May Peel) Futrelle 35 1 TRUE
## 5 Mr. William Henry Allen 35 0 FALSE
## 6 Mr. James Moran 27 0 FALSE
## 7 Mr. Timothy J McCarthy 54 0 FALSE
## 8 Master. Gosta Leonard Palsson 2 0 FALSE
## 9 Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson 27 1 TRUE
## 10 Mrs. Nicholas (Adele Achem) Nasser 14 1 TRUE
Now, we can compute the accuracy rate – the proportion of the time that the model is right
mean(titanic$pred == titanic$Survived)
## [1] 0.7857948
(Recall that 0 is converted to False
and 1 is converted to True
).
Is the model doing well? Maybe. One way to check is to compare the accuracy rate of the simplest possible model – the one that always says “Survived” or always says “Died.”
The survival rate was
mean(titanic$Survived)
## [1] 0.3855693
So if we say “Did not survive” all the time, our correct classification rate will be
1 - mean(titanic$Survived)
## [1] 0.6144307
That is not as good.
Pclass
Let’s compute the correct classification rates for both variants, on the training set
mean(titanic$Survived == (predict(fit.categ, newdata=titanic, type="response") > 0.5))
## [1] 0.794814
mean(titanic$Survived == (predict(fit.cont, newdata=titanic, type="response") > 0.5))
## [1] 0.794814
Apparently it didn’t make a difference.
In theory, we’d expect the model with the cateogrical variable to do at least as well as the one with the continuous variable. Consider this. With a cateogrical variable, we are saying:
Category 1: add a1
Category 2: add a2
Category 3: add a3
...
With a continuous variable, we are saying
Category 1: add a*1
Category 2: add a*2
Category 3: add a*3
If it is advantageous, glm
could just make a1 = a2 = a3 = a, and got the same results with the categorical variable as with the continous variable. But with a cateogrical variable, there is more flexibility. So with a categorical variable we’d expect performance (on the training set) that is at least as good.