SML201 Precept 5 Problem Set, Spring 2020

Problem 1: Extracting the Last Name with `strsplit` and `sapply`

Problem 1(a)

Make a function that takes in a data frame in the same format as Titanic, and returns the percentage (i.e., a number between 0 and 100) of people who did not survive

Problem 1(b)

You can use strsplit to split character strings into words. For example, the following splits a character string into words assuming that the words are separated by a space

words <- strsplit("Go Tigers", " ")[[1]]
words

## [1] "Go"     "Tigers"

words[1]

## [1] "Go"

words[2]

## [1] "Tigers"

Write a function that takes in the name of a person (as it appears in the Titanic dataset) and returns the persons last name

Problem 1(c)

Add a column to the titanic dataset that contains just the last name for each row. You should use sapply and your function from Problem 1(b)

Problem 2: Predicting using the last name

Problem 2(a)

Use logistic regression to predict survival using the last name of a person. Are you able to obtain a better accuracy than the baseline classifier? Compute and compare the false positive rate (FPR), false negative rate (FNR), and the positive predictive value (PPV).

The definitions are as follows:

\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]

\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]

\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]

Problem 2(b)

Explain why many of the predicted probabilities in 2(a) are either 0.0 or 1.0. Can you come up with a theory that would explain why you can predict survival using the last name?

Problem 2(c) (Challenge)

Using your theory of why the last name appears to predict survival, design an experiment (i.e., what specify what data frames to create using the existing data, which model to fit to them, which prediction to make, etc.) that would demonstrate the theory.

Problem 2(d) (Challenge: leave to the end)

Write R code to perform the experiment you designed in 2(c).

Problem 3: `ggplot`

Use ggplot to visualize the relationship between the fare and the class, as well as the sex of the passenger. Do you see any patterns?

Problem 4: `ggplot`

Use ggplot to visualize the relationship between the age of a passenger and their probability of survival, based on a model the uses the passenger’s age, sex, class, and fare, as well as based on a model that uses just the passenger’s age. Superimpose the two plots.

Problem 5: Insurance rates

Write a function that will compute the total profit (or loss) if the if the insurance agent uses a logistic regression model that takes that uses the sex, class, and age of a person to predict whether they will survive, for the population of people on the titanic (note: this is “cheating,” since in reality an agent would not have access to the data to fit their model; there are also other issues here, which we will discuss). The policy the insurance agent uses the following procedure:

If the person’s probability of survival is less than p, turn them away.
If the person’s probability of survival is greater than or equal to p, sell them insurance for premium.
If the person does not end up surviving, pay benefit to the estate.

Find reasonable p, premium, and benefit which would yield a profit in the case of the Titanic. You should do that by just trying to call your function manually using different values.

(Again, note: this is an exercise, and not a realistic example. Most liners did not sink; no insurer would sell insurance if they knew the liner would sink. In fact, insurers often refuse to sell insurance when the probability of a bad outcome is not very small.)