There is nothing to submit on Monday this week. You will be graded on making reasonable effort toward completing the assignment in precept.

Problem 1: Functions review

Make a function that takes in a data frame in the same format as Titanic, and returns the percentage (i.e., a number between 0 and 100) of people who did not survive

Problem 2: String processing and functions

You can use strsplit to split character strings into words. For example, the following splits a character string into words assuming that the words are separated by a space

words <- strsplit("Go Tigers", " ")[[1]]
words
## [1] "Go"     "Tigers"
words[1]
## [1] "Go"
words[2]
## [1] "Tigers"

Write a function that takes in the name of a person (as it appears in the Titanic dataset) and returns the persons last name

Problem 3: sapply

Add a column to the titanic dataset that includes the last name of the person. You should use sapply and your function from Problem 2

Problem 4: Predicting using the last name

Use logistic regression to predict survival using the last name of a person. Are you able to obtain a better accuracy than the baseline classifier? Compute and compare the false positive rate (FPR), false negative rate (FNR), and the positive predictive value (PPV).

The definitions are as follows (note: there was a think-o in lecture, follow the definitions below)

\[FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}\]

\[FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}\]

\[PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}\]

Can you come up with a theory that would explain why you can predict survival using the last name?

Problem 6: ggplot

Use ggplot to visualize the relationship between the fare and the class, as well as the sex of the passenger. Do you see any patterns?

Problem 7: ggplot

Use ggplot to visualize the relationship between the age of a passenger and their probability of survival, based on a model the uses the passenger’s age, sex, class, and fare, as well as based on a model that uses just the passenger’s age. Superimpose the two plots.

Problem 8: Insurance rates

Write a function that will compute the total profit (or loss) if the if the insurance agent uses a logistic regression model that takes that uses the sex, class, and age of a person to predict whether they will survive, for the population of people on the titanic (note: this is “cheating,” since in reality an agent would not have access to the data to fit their model; there are also other issues here, which we will discuss). The policy the insurance agent uses the following procedure:

Find reasonable p, premium, and benefit which would yield a profit in the case of the Titanic.

(Again, note: this is an exercise, and not a realistic example. most liners did not sink; no insurer would sell insurance if they knew the liner would sink. In fact, insurers often refuse to sell insurance when the probability of a bad outcome is not very small.)