Question 1 (20 pts)

Read in the following dataset:

brains <- read.csv("http://guerzhoy.princeton.edu/201s20/brains.csv")

The dataset contains information about the weight of the brain, the weight of the body, the gestation period, and the litter size for a large number of mammals.

You should make a model that would do as good a job as possible (using the tools we use in the course) of predicting the weight of the brain of a new mammal, and estimate how well you’d be able to predict the weight of the brain of a new mammal. Follow the instructions below.

Question 1(a) (7 pts)

Split the dataset into training/test/validation sets. You may use a 70%/15%/15% Training/Test/Validation split. Explain why it is necessary to split the data into three sets. Explain why would you want the training set to be large. Explain why you would want the test and validation sets to be large.

Solution

idx <- sample(1:nrow(brains))
train.idx <- idx[1:round(0.7*length(idx))]
valid.idx <- idx[(round(0.7*length(idx)) + 1):(round(0.85*length(idx)))]
test.idx <-  idx[(round(0.85*length(idx)) + 1):length(idx)]

brains.train <- brains[train.idx, ]
brains.valid <- brains[valid.idx, ]
brains.test <- brains[test.idx, ]

We want the training set to be large to avoid overfitting. This will lead to more accurate predictions on new data.

We want the validation to be large so that we select the model that will generalize the best.

We want the test set to be large so that our estimate of how well we’ll do on new data will be more accurate.

Question 1(b) (6 pts)

Use at least three graphs to illustrate how you would use data visualization to make the best model possible. Use ggplot. Include the code you used in the R file that you submit. State what you see in the graphs and how you would use that.

Solution

library(tidyverse)
ggplot(data = brains.train) + 
  geom_point(mapping = aes(x = Body, y = Brain)) + 
  geom_smooth(mapping = aes(x = Body, y = Brain), method = "lm")

It seems that a transformation might be appropriate:

ggplot(data = brains.train) + 
  geom_point(mapping = aes(x = Body, y = Brain)) + 
  geom_smooth(mapping = aes(x = Body, y = Brain), method = "lm") + 
  scale_x_log10()

ggplot(data = brains.train) + 
  geom_point(mapping = aes(x = Body, y = Brain)) + 
  geom_smooth(mapping = aes(x = Body, y = Brain), method = "lm") + 
  scale_x_log10() + 
  scale_y_log10()

It seems the log(Brain) vs. log(Body) trend is linear.

What about the litter size?

ggplot(data = brains.train) + 
  geom_point(mapping = aes(x = Litter, y = Brain)) + 
  geom_smooth(mapping = aes(x = Litter, y = Brain), method = "lm") + 
  scale_x_log10() + 
  scale_y_log10()

ggplot(data = brains.train) + 
  geom_point(mapping = aes(x = Litter, y = Brain)) + 
  geom_smooth(mapping = aes(x = Litter, y = Brain), method = "lm") + 
  scale_y_log10()

ggplot(data = brains.train) + 
  geom_point(mapping = aes(x = Litter, y = Brain)) + 
  geom_smooth(mapping = aes(x = Litter, y = Brain), method = "lm") 

We definitely need to use log(Brain). Unclear whether to transform Litter.

(N.B., we required at least three plots when grading. Three good plots were sufficient.)

Question 1(c) (7 pts)

Use what you found in 1(b), as well as other techniques we learned in the course, to make the best possible model you can to predict the weight of the brain. How well do you expect your model to predict the weight of the brain of a new mammal species? Give as precise and clear an answer as possible.

Solution

We’ll try different models, and compare them using the RMSE, on the validation set.

RMSE <- function(fit, data){
  pred <- exp(predict(fit, newdata = data))
  sqrt(mean((data$Brain - pred)^2))
}
fit.1 <- lm(log(Brain) ~ log(Body), data = brains.train)
RMSE(fit.1, data = brains.valid)
## [1] 375.6541
fit.2 <- lm(log(Brain) ~ log(Body) + log(Litter), data = brains.train)
RMSE(fit.2, data = brains.valid)
## [1] 350.2811
fit.3 <- lm(log(Brain) ~ log(Body) + log(Gestation) + log(Litter), data = brains.train)
RMSE(fit.3, data = brains.valid)
## [1] 275.763
fit.4 <- lm(log(Brain) ~ log(Body) + log(Gestation) + log(Litter) + Litter, data = brains.train)
RMSE(fit.4, data = brains.valid)
## [1] 273.2065

It looks like fit4 performs the best. We’ll evaluate it on the test set.

RMSE(fit.4, data = brains.test)
## [1] 431.9914

On average, we expect to be off by no more than 860 grams.

Question 2 (10 pts)

Suppose the average amount of sleep per night in the student population is 7 hours. (That is, if we ask everyone for the average amount of sleep they get per night and then average the responses, we’d get 7 hours).

Question 2(a) (2 pts)

What is a plausible value of the standard deviation of the amount of sleep per night in the student population? Explain your reasoning as precisely as possible.

Solution

Recall that the interval \((\mu-2\sigma, \mu+2\sigma)\) contains about 95% of the data. We’ll guess that 95% of the population gets between 4 and 11 hours of sleep, implying a standard deviation of 1.5 hours. (Side note: this study reports a mean of 6.42 hours and a standard deviation of 1.23 hours.)

Question 2(b) (2 pts)

We intend to survey 200 randomly-selected students and got each student’s average amount of sleep per night, and compute the average. What is the probability that this average is greater than 7.2 hours? Use pnorm(..., mean = 0, sd = 1) to obtain the answer. Show your work.

We’ll use the fact that \(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim N(0, 1)\), so that, together with

\(P(\bar{X} > 7.2) = P\left(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}} > \frac{7.2-\mu}{\sigma/\sqrt{n}}\right)\)

z <- (7.2-7)/(1.5/sqrt(200))
1 - pnorm(q = z, mean = 0, sd = 1)
## [1] 0.02967322

Question 2(c) (3 pts)

Answer Problem 2(b) again, but now you must use rnorm(..., mean = 0, sd = 1).

We’ll compute the probability of getting a value higher than z by sampling from \(N(0, 1)\), and figuring out how often the sample is higher than z.

zzzs <- rnorm(n = 1000000, mean = 0, sd = 1)
mean(zzzs > z)
## [1] 0.029648

Question 2(d) (3 pts)

There are at least two quite different ways to answer 2(c). Answer 2(c) again, using rnorm(..., mean = 0, sd = 1) again, but now you must use it in a way that’s substantially different from the way you used rnorm(..., mean = 0, sd = 1) in 2(c). Show your work. Explain how your approaches in 2(c) and 2(d) are different.

Solution

Now, let’s use rnorm(..., mean = 0, sd = 1) to generate samples from \(N(\mu, \sigma)\), and compute \(P(\bar{X} > 7.2)\) more directly:

get.sample <- function(){
  mean(7 + 1.5*rnorm(n = 200, mean = 0, sd = 1))
}

means <- replicate(100000, get.sample())
mean(means > 7.2)
## [1] 0.02906

Another possible answer is to get samples from \(N(\mu, \frac{\sigma}{\sqrt{n}})\) by using the fact that if \(z\sim N(0, 1)\), then \((\mu + \frac{\sigma}{\sqrt{n}}\sigma)\sim N(\mu, \frac{\sigma}{\sqrt{n}})\)

z.means <- 7 + (1.5/sqrt(200)) * rnorm(n = 100000, mean = 0, sd = 1)
mean(z.means > 7.2)
## [1] 0.02902

Question 3 (20 pts)

Consider the following model and predictions:

titanic <- read.csv("http://guerzhoy.princeton.edu/201s20/titanic.csv")
fit <- glm(Survived ~ Sex + Age + Pclass, data = titanic )
titanic$pred <- predict(fit, newdata = titanic) > 0.5
titanic.male <- titanic %>% filter(Sex == "male")
titanic.female <- titanic %>% filter(Sex == "female")
mean(titanic.male$pred == titanic.male$Survived)       # 0.8097731
## [1] 0.8097731
mean(titanic.female$pred == titanic.female$Survived)   # 0.7579618
## [1] 0.7579618

Question 3(a) (8 pts)

Test the null hypothesis that this classifier does not violate accuracy parity on the training set titanic. State your conclusions. You may use either fake-data simulation or an analytic method.

Solution

There are many possible approaches here. We’ll use fake-data simulation: how often would the difference be as large or larger than 5.2% (the difference we observe) if the true CCR is the same, at

perf <- mean(titanic$pred == titanic$Survived)
perf
## [1] 0.7914318

?

Let’s record the number of men and women:

n.male <- sum(titanic$Sex == "male")       # 573
n.female <- sum(titanic$Sex == "female")   # 314
fake.diff <- function(n.male, n.female, perf){
  male.ccr <- mean(rbinom(n = n.male, size = 1, prob = perf))
  female.ccr <- mean(rbinom(n = n.female, size = 1, prob = perf))
  abs(male.ccr - female.ccr)
}

diffs <- replicate(n = 10000, fake.diff(n.male, n.female, perf))
mean(diffs > 0.052)
## [1] 0.075

We cannot reject the null hypothesis that there is no difference in CCRs.

Question 3(b) (12 pts) (Challenge)

The accuracies for male and female passengers are different (though you may have concluded in 3(a) that there is not enough evidence to reject the null hypothesis that they are the same and the differences are due to chance).

Write a function with the signature predict.fair.acc <- function(fit, sex, age, pclass, male.acc, female.acc) which takes in the model fit (which is the same as what’s displayed above), the sex, age, and class, of a new passenger, and the accuracy for fit for male and female passengers (so 0.81 and 0.76 in our case), and does the best job of outputting predictions that would satisfy accuracy parity while also being fairly accurate.

Use simulation to check to what extent your function achieves the objectives set out above. Explain your work.

Solution

We can’t really improve the accuracy for female passengers – so we’ll instead make the accuracy worse for male passengers, as suggested in the Fairness in Machine Learning lecture. We want to get the accuracy to be 5.2 percentage points worse. As it stands, we’re getting 81% correct. If we intentionally “flip” the prediction \(p\) of the time, we’ll get the correct answer \(0.81(1-p) + 0.19p\) of the time. To get 0.76, we need

Flip the prediction a fraction p of males

\(0.76 = 0.81(1-p) + 0.19p\)

\(0.76 = 0.81-0.81p+0.19p\)

\(p = (0.76-0.81)/(0.19-0.81) = 0.08\)

predict.fair.acc <- function(fit, sex, age, pclass, male.acc, female.acc){
  p <- (female.acc-male.acc)/(1-2*male.acc)
  pred <- predict(fit, newdata = data.frame(Sex = sex, Age = age, Pclass = pclass))
  
  if(sex == "male"){
    if(rbinom(n = 1, size = 1, prob = p) == 1){
      pred < 0.5 # flip the prediction
    }else{
      pred > 0.5
    }
  }else{
    pred > 0.5
  }
}

We can how this is working out:

male.acc <- mean(titanic.male$pred == titanic.male$Survived)       # 0.8097731
female.acc <- mean(titanic.female$pred == titanic.female$Survived)   # 0.7579618

titanic <- titanic %>% group_by(Name) %>%   
                       mutate(fair.pred = predict.fair.acc(fit, Sex, Age, Pclass, male.acc, female.acc))

titanic.male <- titanic %>% filter(Sex == "male")
titanic.female <- titanic %>% filter(Sex == "female")

mean(titanic.male$fair.pred == titanic.male$Survived)
## [1] 0.7626527
mean(titanic.female$fair.pred == titanic.female$Survived)
## [1] 0.7579618

(Note that we had to group_by(Name) in order to get a prediction for each row – that way, every time, Sex, Age, and Pclass were all of length 1. A nicer option would be to use rowwise, as in the solutions to Precept 10 Problem 5, but we haven’t done that in lecture).

Question 4 (15 pts)

Suppose we are exploring a dataset similar to the finches dataset. Use ggplot to illustrate how the probability of a type S error depends the magnitude of the difference between the population means. Explain your work.

Solution

A Type S error occurs when the estimate for the effect (in the case of the finches dataset, the difference between the two years) has a sign that’s the opposite of the sign of the true effect.

fake.finches <- function(mean.a, mean.b){
  finches.a <- rnorm(n = 89, mean = mean.a, sd = 1.5)
  finches.b <- rnorm(n = 89, mean = mean.b, sd = 1.5)
  c(mean(finches.b) - mean(finches.a), t.test(finches.a, finches.b)$p.value)
}

prob.type.s <- function(diff){
  mean.a <- 0
  mean.b <- diff #so now the true difference is diff
  diffs.pval <- replicate(n = 10000, fake.finches(mean.a, mean.b))
  diffs <- diffs.pval[1, ]
  pvals <- diffs.pval[2, ]
  mean( ((diffs > 0) != (diff > 0)) & pvals < 0.05 ) # fraction of type S errors
}

d <- c(0.000001, 0.00001, 0.0001 ,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5)
prob.ts <- sapply(X = d, FUN = prob.type.s)
df <- data.frame(d = d, prob.ts = prob.ts)
ggplot(df, mapping = aes(x = d, y = prob.ts)) + 
  geom_point() + 
  geom_line()

Question 5 (10 pts)

Suppose that the results of a poll of likely voters is that 51% intend to vote DEM and 49% intend to vote REP. What is the smallest sample size needed with this kind of result (a 51%/49% split) in order to reject the null-hypothesis that the true probability of intending to vote DEM is 50%?

Show your work to the extent that a grader can understand your thinking exactly. It is not necessary for your work to consist just of code in order to get full credit for this problem – some manual exploration is allowed.

Solution

We’ll use the normal approximation \(\bar{Х}\sim N(0.5, 0.5/\sqrt{n})\).

The p-value is (for 1000 people)

n <- 1000
2*(pnorm(q = 0.49, mean = 0.5, sd = 0.5/sqrt(n)))
## [1] 0.5270893

We can try increasing the n until we get a p-value smaller than 5%:

n <- 5000
2*(pnorm(q = 0.49, mean = 0.5, sd = 0.5/sqrt(n)))
## [1] 0.1572992
n <- 7000
2*(pnorm(q = 0.49, mean = 0.5, sd = 0.5/sqrt(n)))
## [1] 0.09426431
n <- 10000
2*(pnorm(q = 0.49, mean = 0.5, sd = 0.5/sqrt(n)))
## [1] 0.04550026
n <- 9600
2*(pnorm(q = 0.49, mean = 0.5, sd = 0.5/sqrt(n)))
## [1] 0.05004352

So we need 9,600 people in the sample for the difference to be significant.

Question 6 (10 pts)

A CDC survey from 2015 found that people who sleep less have lower incomes. One possible causal story is that sleeping more causes higher incomes because job performance improves with more sleep. Outline other hypotheses that explain the data, and briefly explain each one. Your hypotheses should demonstrate four different possible causal structures. The causal structures must be substantively different, rather than be closely analoguous.

Solutions

  • Reverse causation: having more money results in less stress and more sleep
  • Common cause: exercise causes people to sleep better, and improves their job performance
  • Indirect causation: sleeping more causes less tiredness, which increases people’s willingness to take risks, which results in more successful business ventures by the good sleepers
  • Coincidence: region A of the country has lots of natural resources but the late-night TV is really bad, so people in region A sleep more than in the resource-poor region B, which happened to have a one-in-a-generation talent hosting a very populat late-night show