strsplit
and sapply
Make a function that takes in a data frame in the same format as Titanic, and returns the percentage (i.e., a number between 0 and 100) of people who did not survive
precent.n.survived <- function(titanic){
100*(1-mean(titanic$Survived))
}
titanic <- read.csv("http://guerzhoy.princeton.edu/201s20/titanic.csv")
precent.n.survived(titanic)
## [1] 61.44307
Learning goal: use the fact that the mean of a vector of logicals is the proportion of TRUE
s there. (And functions.)
You can use strsplit
to split character strings into words. For example, the following splits a character string into words assuming that the words are separated by a space
words <- strsplit("Go Tigers", " ")[[1]]
words
## [1] "Go" "Tigers"
words[1]
## [1] "Go"
words[2]
## [1] "Tigers"
Write a function that takes in the name of a person (as it appears in the Titanic dataset) and returns the persons last name
get.last.name <- function(full.name){
words <- strsplit(as.character(full.name), " ")[[1]]
words[length(words)]
}
Learning goal: understand the specifications for a new function, and use that function to achieve a task
Add a column to the titanic
dataset that contains just the last name for each row. You should use sapply
and your function from Problem 1(b)
titanic <- titanic %>% mutate(Last.Name = sapply(X = Name, FUN = get.last.name))
Learning goal: use sapply
to compute a new column of a dataframe by applying a function to each element of an existing column. That is: combine using sapply
to compute a function of each element of a vector with extracting and adding columns from/to data frames.
Use logistic regression to predict survival using the last name of a person. Are you able to obtain a better accuracy than the baseline classifier? Compute and compare the false positive rate (FPR), false negative rate (FNR), and the positive predictive value (PPV).
The definitions are as follows:
\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]
\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]
\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]
Can you come up with a theory that would explain why you can predict survival using the last name?
We’ll build the model using the entire dataset.
fit <- glm(Survived ~ Last.Name, family = binomial, data = titanic)
pred <- predict(fit, newdata = titanic, type = "response") > .5
mean(pred == titanic$Survived)
## [1] 0.9289741
Most of what’s going on is that the last name identifies just one person in the dataset, for whom the answer is known. So it’s not surprising that the predictions are good.
Learning goal: understand that when predicting using a categorical variable, we are predicting the average for each category, and apply this fact to a new situation. Be able to run logistic regression in R.
FPR <- sum(pred == T & titanic$Survived == 0)/sum(titanic$Survived==0)
FNR <- sum(pred == F & titanic$Survived == 1)/sum(titanic$Survived==1)
PPV <- sum(pred == T & titanic$Survived == 1)/sum(pred == T)
Learning goal: operate on logical vectors with boolean operators. Translate the definitions of FPR/FNR/PPV/… into code.
One idea is to try to predict using the last name, but for different people than the ones in the training set.
names.test <- titanic %>% group_by(Last.Name) %>%
mutate(num = n()) %>%
filter(num >= 2) %>%
summarize(Name = Name[1]) %>%
arrange(Last.Name)
names.train <- titanic %>% group_by(Last.Name) %>%
mutate(num = n()) %>%
filter(num >= 2) %>%
summarize(Name = Name[2]) %>%
arrange(Last.Name)
test.set <- titanic %>% filter(Name %in% names.test$Name)
train.set <- titanic %>% filter(Name %in% names.train$Name)
fit <- glm(Survived ~ Last.Name, family = binomial, data = train.set)
## Warning: glm.fit: algorithm did not converge
mean((predict(fit, newdata = train.set, type = "response") > 0.5) == train.set$Survived)
## [1] 1
mean((predict(fit, newdata = test.set, type = "response") > 0.5) == test.set$Survived)
## [1] 0.6766917
Note that there actually is a correlation between the survival of people with the same name, since they are usually related.
Learning goal: demonstrate an intuitive understanding of overfitting.
ggplot
Use ggplot
to visualize the relationship between the fare and the class, as well as the sex of the passenger. Do you see any patterns?
ggplot(data = titanic, mapping = aes(x = Pclass, y = Fare)) +
geom_smooth(mapping = aes(color = Sex), method = "loess")
Learning goal: display the relationship between three variables on one plot using ggplot
.
We see that the average fare for male passengers was smaller than for female passengers, in first class. If you look further, you’ll find that the fare is actually recorded per family.
ggplot
Use ggplot
to visualize the relationship between the age of a passenger and their probability of survival, based on a model the uses the passenger’s age, sex, class, and fare, as well as based on a model that uses just the passenger’s age. Superimpose the two plots.
fit.big <- glm(Survived ~ Age + Sex + Pclass, family = binomial, data = titanic)
fit.small <- glm(Survived ~ Age, family = binomial, data = titanic)
titanic[, "pred.big"] <- predict(fit.big, newdata = titanic, type = "response")
titanic[, "pred.small"] <- predict(fit.small, newdata = titanic, type = "response")
ggplot(data = titanic, mapping = aes(x = Age)) +
geom_smooth(mapping = aes(y = pred.big), method = "loess", color = "blue") +
geom_smooth(mapping = aes(y = pred.small), method = "loess", color = "red")
Learning goal: use ggplot
to display the predictions of a model; display several curves on one plot using ggplot
.
Write a function that will compute the total profit (or loss) if the if the insurance agent uses a logistic regression model that takes that uses the sex, class, and age of a person to predict whether they will survive, for the population of people on the titanic (note: this is “cheating,” since in reality an agent would not have access to the data to fit their model; there are also other issues here, which we will discuss). The policy the insurance agent uses the following procedure:
p
, turn them away.p
, sell them insurance for premium
.benefit
to the estate.Find reasonable p
, premium
, and benefit
which would yield a profit in the case of the Titanic. You should do that by just trying to call your function manually using different values.
(Again, note: this is an exercise, and not a realistic example. Most liners did not sink; no insurer would sell insurance if they knew the liner would sink. In fact, insurers often refuse to sell insurance when the probability of a bad outcome is not very small.)
total.profit <- function(fit, p, premium, benefit, dat){
dat[, "pred"] <- predict(fit, newdata=titanic, type = "response")
n.accepted <- sum(dat$pred > p)
n.paid.benefit <- sum( (dat$pred > p) & (dat$Survived == 0) )
total.revenue <- n.accepted * premium
total.expediture <- n.paid.benefit * benefit
total.revenue - total.expediture
}
p <- 0.90
premium <- 1600
benefit <- 50000
fit <- glm(Survived ~ Age + Sex + Pclass, family=binomial, data = titanic)
total.profit(fit, p, premium, benefit, titanic)
## [1] 16800
Learning goal: use techniques similar to the ones used for computing FPR/PPV/etc. to compute a quantity of interest. Understand the use of quantities like FPR etc. to compute the cost of a classifier.