titanic <- read.csv("http://guerzhoy.princeton.edu/201s20/titanic.csv")
Like we mentioned before, we can decide to guess “Survived” if the probability is greater or equal to 0.5.
fit <- glm(Survived ~ Age + Sex + Pclass, data = titanic, family = "binomial")
titanic[, "pred"] <- predict(fit, newdata = titanic, type="response") >= .5
(Note that this is the same as predicting “Survived” if the sum \(a_0 + a_1 x_ 1 + ...\) is greater than 0, since \(\sigma(0) = 0.5\).)
The simplest possible classifier simply predicts the same thing (in our case, “did not survive”) every time. We can compute how often the classifier will be correct:
mean(titanic$pred == titanic$Survived)
## [1] 0.794814
And now we can compute how often the baseline classifier will be correct
mean(titanic$Survived == 0)
## [1] 0.6144307
The false positive rate is the rate at which the model outputs “positive”, when considering the negative examples (i.e., the model says “Survived” when the person did not survive):
\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]
sum(titanic$pred == 1 & titanic$Survived == 0)/sum(titanic$Survived == 0)
## [1] 0.1504587
The false nevative rate is the rate at which the model outputs “negative”, when considering the positive examples (i.e., the model says “Did not survive” when the person actually did survive)
\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]
sum(titanic$pred == 0 & titanic$Survived == 1)/sum(titanic$Survived == 1)
## [1] 0.2923977
The positive predictive value is the rate at which the model is correct when it says “positive”:
\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]
There is no hard-and-fast rule, but usually “positive” would be the suprising or rare or important event (e.g., the patient has a rare/significant disease).