The Titanic dataset

Let’s load the Titanic dataset

titanic <- read.csv("titanic.csv")

Now, let’s fit a Logistic Regression model:

fit.titanic <- glm(Survived ~ Sex + Age, data = titanic, family=binomial)

Get the predictions:

titanic[, "pred"] = predict(fit.titanic, newdata=titanic, type = "response")

We can now view the predictions.

Let’s look at the coefficients

fit.titanic
## 
## Call:  glm(formula = Survived ~ Sex + Age, family = binomial, data = titanic)
## 
## Coefficients:
## (Intercept)      Sexmale          Age  
##     1.11388     -2.50000     -0.00206  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  884 Residual
## Null Deviance:       1183 
## Residual Deviance: 916   AIC: 922

So for example, to predict the probability that a 30-year-old man would survive, we’d use

plogis(1.11 - 2.5 - 0.002*30)
## [1] 0.1900016

Here, plogis is simply the logistic function \(f(z) = \frac{1}{1+epx(-z)}\).

Categorical variables vs Continuous Variables

Suppose we try to use Pclass – the class of the fare – as a predictor. The values there are 1, 2, and 3, but it is unclear whether we should use the Pclass as a continuous variable (since the classes are numbered in a way that makes sense: second class is really between first class and third class in a way that could be relevant to predicting survival) or as a categorical variable.

First, let’s deal with the mechanics. To use Pclass as a categorical variable, we need to convert it to a new data type: factor. To do that, we’ll use as.factor.

A variable of type factor will always be treated as categorical by R (so will a variable of type character, which is why we did not have to do anything extra to use Sex as categorical.) On the other hand, numeric and integer variables will be treated as continuous.

fit.categ <- glm(Survived ~ Sex + Age + as.factor(Pclass), family = binomial, data = titanic)
fit.categ
## 
## Call:  glm(formula = Survived ~ Sex + Age + as.factor(Pclass), family = binomial, 
##     data = titanic)
## 
## Coefficients:
##        (Intercept)             Sexmale                 Age  
##            3.63492            -2.58872            -0.03427  
## as.factor(Pclass)2  as.factor(Pclass)3  
##           -1.19912            -2.45544  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  882 Residual
## Null Deviance:       1183 
## Residual Deviance: 801.6     AIC: 811.6

Compare this to using Pclass as a continous variable

fit.cont <- glm(Survived ~ Sex + Age + Pclass, family = binomial, data = titanic)
fit.cont
## 
## Call:  glm(formula = Survived ~ Sex + Age + Pclass, family = binomial, 
##     data = titanic)
## 
## Coefficients:
## (Intercept)      Sexmale          Age       Pclass  
##     4.87851     -2.58916     -0.03436     -1.23054  
## 
## Degrees of Freedom: 886 Total (i.e. Null);  883 Residual
## Null Deviance:       1183 
## Residual Deviance: 801.6     AIC: 809.6

When using Pclass as a categorical variable, for a 30-year-old male in second class, we’d predict

\(\sigma(3.6 - 2.6 -0.03\times 30 - 1.2)\) (here we picked the \(-1.2\) coefficient from the coefficients displayed in the model).

When using Pclass as a continous variable, for the same person we’d predict

\(\sigma(4.9 - 2.6 - 0.03\times 30 -1.2\times 2)\) (here we have just one coefficient associated with class.)

Those numbers are very close, but different.

Evaluating the logistic regression model

We can decide that if the probability the model outputs is greater than 0.5, we’ll predict “Survived,” and if it is not, we’ll predict “Did not survive”. Let’s att that as a column to the dataset

titanic[, "pred"] <- predict(fit.titanic, newdata=titanic, type="response")>0.5
titanic[1:10,] %>% select(Name, Age, Survived, pred)
##                                                  Name Age Survived  pred
## 1                              Mr. Owen Harris Braund  22        0 FALSE
## 2  Mrs. John Bradley (Florence Briggs Thayer) Cumings  38        1  TRUE
## 3                               Miss. Laina Heikkinen  26        1  TRUE
## 4         Mrs. Jacques Heath (Lily May Peel) Futrelle  35        1  TRUE
## 5                             Mr. William Henry Allen  35        0 FALSE
## 6                                     Mr. James Moran  27        0 FALSE
## 7                              Mr. Timothy J McCarthy  54        0 FALSE
## 8                       Master. Gosta Leonard Palsson   2        0 FALSE
## 9    Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson  27        1  TRUE
## 10                 Mrs. Nicholas (Adele Achem) Nasser  14        1  TRUE

Now, we can compute the accuracy rate – the proportion of the time that the model is right

mean(titanic$pred == titanic$Survived)
## [1] 0.7857948

(Recall that 0 is converted to False and 1 is converted to True).

Is the model doing well? Maybe. One way to check is to compare the accuracy rate of the simplest possible model – the one that always says “Survived” or always says “Died.”

The survival rate was

mean(titanic$Survived)
## [1] 0.3855693

So if we say “Did not survive” all the time, our correct classification rate will be

1 - mean(titanic$Survived)
## [1] 0.6144307

That is not as good.

First try at deciding whether to use the cateogrical or the continuous version of Pclass

Let’s compute the correct classification rates for both variants, on the training set

mean(titanic$Survived == (predict(fit.categ, newdata=titanic, type="response") > 0.5))
## [1] 0.794814
mean(titanic$Survived == (predict(fit.cont, newdata=titanic, type="response") > 0.5))
## [1] 0.794814

Apparently it didn’t make a different.

In theory, we’d expect the model with the cateogrical variable to do at least as well as the one with the continuous variable. Consider this. With a cateogrical variable, we are saying:

  Category 1: add a1
  Category 2: add a2
  Category 3: add a3
  ...
  

With a continuous variable, we are saying

  Category 1: add a*1
  Category 2: add a*2
  Category 3: add a*3
  

If it is advantageous, glm could just make a1 = a2 = a3 = a, and got the same results with the categorical variable as with the continous variable. But with a cateogrical variable, there is more flexibility. So with a categorical variable we’d expect performance (on the training set) that is at least as good.