strsplit
and sapply
Make a function that takes in a data frame in the same format as Titanic, and returns the percentage (i.e., a number between 0 and 100) of people who did not survive
You can use strsplit
to split character strings into words. For example, the following splits a character string into words assuming that the words are separated by a space
words <- strsplit("Go Tigers", " ")[[1]]
words
## [1] "Go" "Tigers"
words[1]
## [1] "Go"
words[2]
## [1] "Tigers"
Write a function that takes in the name of a person (as it appears in the Titanic dataset) and returns the persons last name
Add a column to the titanic
dataset that contains just the last name for each row. You should use sapply
and your function from Problem 1(b)
Use logistic regression to predict survival using the last name of a person. Are you able to obtain a better accuracy than the baseline classifier? Compute and compare the false positive rate (FPR), false negative rate (FNR), and the positive predictive value (PPV).
The definitions are as follows:
\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]
\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]
\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]
Explain why many of the predicted probabilities in 2(a) are either 0.0 or 1.0. Can you come up with a theory that would explain why you can predict survival using the last name?
Using your theory of why the last name appears to predict survival, design an experiment (i.e., what specify what data frames to create using the existing data, which model to fit to them, which prediction to make, etc.) that would demonstrate the theory.
Write R code to perform the experiment you designed in 2(c).
ggplot
Use ggplot
to visualize the relationship between the fare and the class, as well as the sex of the passenger. Do you see any patterns?
ggplot
Use ggplot
to visualize the relationship between the age of a passenger and their probability of survival, based on a model the uses the passenger’s age, sex, class, and fare, as well as based on a model that uses just the passenger’s age. Superimpose the two plots.
Write a function that will compute the total profit (or loss) if the if the insurance agent uses a logistic regression model that takes that uses the sex, class, and age of a person to predict whether they will survive, for the population of people on the titanic (note: this is “cheating,” since in reality an agent would not have access to the data to fit their model; there are also other issues here, which we will discuss). The policy the insurance agent uses the following procedure:
p
, turn them away.p
, sell them insurance for premium
.benefit
to the estate.Find reasonable p
, premium
, and benefit
which would yield a profit in the case of the Titanic. You should do that by just trying to call your function manually using different values.
(Again, note: this is an exercise, and not a realistic example. Most liners did not sink; no insurer would sell insurance if they knew the liner would sink. In fact, insurers often refuse to sell insurance when the probability of a bad outcome is not very small.)