Let’s load the Titanic dataset
titanic <- read.csv("titanic.csv")
Now, let’s fit a Logistic Regression model:
fit.titanic <- glm(Survived ~ Sex + Age, data = titanic, family=binomial)
Get the predictions:
titanic[, "pred"] = predict(fit.titanic, newdata=titanic, type = "response")
We can now view the predictions.
Let’s look at the coefficients
fit.titanic
##
## Call: glm(formula = Survived ~ Sex + Age, family = binomial, data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age
## 1.11388 -2.50000 -0.00206
##
## Degrees of Freedom: 886 Total (i.e. Null); 884 Residual
## Null Deviance: 1183
## Residual Deviance: 916 AIC: 922
So for example, to predict the probability that a 30-year-old man would survive, we’d use
plogis(1.11 - 2.5 - 0.002*30)
## [1] 0.1900016
Here, plogis
is simply the logistic function \(f(z) = \frac{1}{1+epx(-z)}\).
Suppose we try to use Pclass
– the class of the fare – as a predictor. The values there are 1, 2, and 3, but it is unclear whether we should use the Pclass
as a continuous variable (since the classes are numbered in a way that makes sense: second class is really between first class and third class in a way that could be relevant to predicting survival) or as a categorical variable.
First, let’s deal with the mechanics. To use Pclass
as a categorical variable, we need to convert it to a new data type: factor
. To do that, we’ll use as.factor
.
A variable of type factor
will always be treated as categorical by R (so will a variable of type character
, which is why we did not have to do anything extra to use Sex
as categorical.) On the other hand, numeric
and integer
variables will be treated as continuous.
fit.categ <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic)
fit.categ
##
## Call: glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial,
## data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age
## 3.63492 -2.58872 -0.03427
## as.factor(Pclass)2 as.factor(Pclass)3
## -1.19912 -2.45544
##
## Degrees of Freedom: 886 Total (i.e. Null); 882 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 811.6
Compare this to using Pclass
as a continous variable
fit.cont <- glm(Survived ~ Sex + Age + Pclass, family = binomial, data = titanic)
fit.cont
##
## Call: glm(formula = Survived ~ Sex + Age + Pclass, family = binomial,
## data = titanic)
##
## Coefficients:
## (Intercept) Sexmale Age Pclass
## 4.87851 -2.58916 -0.03436 -1.23054
##
## Degrees of Freedom: 886 Total (i.e. Null); 883 Residual
## Null Deviance: 1183
## Residual Deviance: 801.6 AIC: 809.6
When using Pclass
as a categorical variable, for a 30-year-old male in second class, we’d predict
\(\sigma(3.6 - 2.6 -0.03\times 30 - 1.2)\) (here we picked the \(-1.2\) coefficient from the coefficients displayed in the model).
When using Pclass
as a continous variable, for the same person we’d predict
\(\sigma(4.9 - 2.6 - 0.03\times 30 -1.2\times 2)\) (here we have just one coefficient associated with class.)
Those numbers are very close, but different.
We can decide that if the probability the model outputs is greater than 0.5, we’ll predict “Survived,” and if it is not, we’ll predict “Did not survive”. Let’s att that as a column to the dataset
titanic[, "pred"] <- predict(fit.titanic, newdata=titanic, type="response")>0.5
titanic[1:10,] %>% select(Name, Age, Survived, pred)
## Name Age Survived pred
## 1 Mr. Owen Harris Braund 22 0 FALSE
## 2 Mrs. John Bradley (Florence Briggs Thayer) Cumings 38 1 TRUE
## 3 Miss. Laina Heikkinen 26 1 TRUE
## 4 Mrs. Jacques Heath (Lily May Peel) Futrelle 35 1 TRUE
## 5 Mr. William Henry Allen 35 0 FALSE
## 6 Mr. James Moran 27 0 FALSE
## 7 Mr. Timothy J McCarthy 54 0 FALSE
## 8 Master. Gosta Leonard Palsson 2 0 FALSE
## 9 Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson 27 1 TRUE
## 10 Mrs. Nicholas (Adele Achem) Nasser 14 1 TRUE
Now, we can compute the accuracy rate – the proportion of the time that the model is right
mean(titanic$pred == titanic$Survived)
## [1] 0.7857948
(Recall that 0 is converted to False
and 1 is converted to True
).
Is the model doing well? Maybe. One way to check is to compare the accuracy rate of the simplest possible model – the one that always says “Survived” or always says “Died.”
The survival rate was
mean(titanic$Survived)
## [1] 0.3855693
So if we say “Did not survive” all the time, our correct classification rate will be
1 - mean(titanic$Survived)
## [1] 0.6144307
That is not as good.
Pclass
Let’s compute the correct classification rates for both variants, on the training set
mean(titanic$Survived == (predict(fit.categ, newdata=titanic, type="response") > 0.5))
## [1] 0.794814
mean(titanic$Survived == (predict(fit.cont, newdata=titanic, type="response") > 0.5))
## [1] 0.794814
Apparently it didn’t make a different.
In theory, we’d expect the model with the cateogrical variable to do at least as well as the one with the continuous variable. Consider this. With a cateogrical variable, we are saying:
Category 1: add a1
Category 2: add a2
Category 3: add a3
...
With a continuous variable, we are saying
Category 1: add a*1
Category 2: add a*2
Category 3: add a*3
If it is advantageous, glm
could just make a1 = a2 = a3 = a, and got the same results with the categorical variable as with the continous variable. But with a cateogrical variable, there is more flexibility. So with a categorical variable we’d expect performance (on the training set) that is at least as good.