--- title: "Precept 4 Problem Set" output: html_document: df_print: paged --- {r setup, include=FALSE} knitr::opts_chunk\$set(echo = TRUE) library(gapminder)  There is nothing to submit on Monday this week. You will be graded on making reasonable effort toward completing the assignment in precept. ### Problem 1: Functions review Make a function that takes in a data frame in the same format as Titanic, and returns the percentage (i.e., a number between 0 and 100) of people who did not survive ### Problem 2: String processing and functions You can use strsplit to split character strings into words. For example, the following splits a character string into words assuming that the words are separated by a space {r} words <- strsplit("Go Tigers", " ")[] words words words  Write a function that takes in the name of a person (as it appears in the Titanic dataset) and returns the persons last name ### Problem 3: sapply Add a column to the titanic dataset that includes the last name of the person. You should use sapply and your function from Problem 2 ### Problem 4: Predicting using the last name Use logistic regression to predict survival using the last name of a person. Are you able to obtain a better accuracy than the baseline classifier? Compute and compare the false positive rate (FPR), false negative rate (FNR), and the positive predictive value (PPV). The definitions are as follows (note: there was a think-o in lecture, follow the definitions below) $$FPR = \frac{\text{# of times the model said "positive" and was wrong}}{\text{# of negatives }}$$ $$FNR = \frac{\text{# of times the model said "negative" and was wrong}}{\text{# of positives }}$$ $$PPV = \frac{\text{# of times the model said "positive" and was correct}}{\text{# of times the model said "positive"}}$$ Can you come up with a theory that would explain why you can predict survival using the last name? ### Problem 6: ggplot Use ggplot to visualize the relationship between the fare and the class, as well as the sex of the passenger. Do you see any patterns? ### Problem 7: ggplot Use ggplot to visualize the relationship between the age of a passenger and their probability of survival, based on a model the uses the passenger's age, sex, class, and fare, as well as based on a model that uses just the passenger's age. Superimpose the two plots. ### Problem 8: Insurance rates Write a function that will compute the total profit (or loss) if the if the insurance agent uses a logistic regression model that takes that uses the sex, class, and age of a person to predict whether they will survive, for the population of people on the titanic (note: this is "cheating," since in reality an agent would not have access to the data to fit their model; there are also other issues here, which we will discuss). The policy the insurance agent uses the following procedure: * If the person's probability of survival is less than p, turn them away. * If the person's probability of survival is greater than or equal to p, sell them insurance for premium. * If the person does not end up surviving, pay benefit to the estate. Find reasonable p, premium, and benefit which would yield a profit in the case of the Titanic. (Again, note: this is an exercise, and not a realistic example. most liners did not sink; no insurer would sell insurance if they knew the liner would sink. In fact, insurers often refuse to sell insurance when the probability of a bad outcome is not very small.)