Run the following to load a dataset that records various data about mammals, including brain weight. The brain weight is given in grams, the body weight in kilograms, and the gestation weight in days.

brains <- read.csv("http://guerzhoy.princeton.edu/201s20/brains.csv")

Problem 1: Linear Regression

Part 1(a)

Suppose you want to use linear regression to investigate the relationship between brain weight and body weight. Find a way to transform the variables that would allow you to do that. (Hint: try taking the log of both variables. See Tuesday’s lecture where we explored the relationship between gdp per capita and life expectancy). Use a scatterplot to assess whether a relationship is linear.

Solution

A plot where we take the log of both variables works nicely.

ggplot(brains, mapping = aes(x = log(Body), y = log(Brain))) + 
  geom_point() + 
  geom_smooth(method = "lm")  

Part 1(b)

Produce the diagnostic plots. Display and investigate outliers, if any. (See Tuesday’s lecture on the relationship between gdp per capita and life expectancy)

Let’s now plot the diagnostic plots

Solutions

library(ggfortify)
fit <- lm(log(Brain) ~ log(Body), data = brains)
autoplot(fit)

Let’s look at the outliers in more detail:

brains[c(58, 25, 48),]
##     X      Species Brain   Body Gestation Litter
## 58 58        Lemur    22    2.1       135      1
## 25 25      Dolphin  1600  160.0       360      1
## 48 48 Hippopotamus   590 1400.0       240      1

Interestingly, dolphins and hippos are closely related phylogenetically (but the residulas have different signs, so there is no big insight here.)

Removing lemurs (and hippos) as big outlier might make sense, but would be tough to justify.

(Here is how to remove datapoints:

brains.no.hippos <- brains[-48,]

)

There are not too many outliers, and the Q-Q plot is approximately linear, so we can run the regression.

Part (c)

Run the regression. What conclusions can you draw?

Solution
summary(fit)
## 
## Call:
## lm(formula = log(Brain) ~ log(Body), data = brains)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16218 -0.44640 -0.04525  0.35076  1.83561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.33235    0.07325   31.84   <2e-16 ***
## log(Body)    0.71919    0.02037   35.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5781 on 94 degrees of freedom
## Multiple R-squared:  0.9299, Adjusted R-squared:  0.9291 
## F-statistic:  1246 on 1 and 94 DF,  p-value: < 2.2e-16

There is a positive association between body weight and brain weight: the p-value for the coefficient of log(Body) is very small, so we can conclude that the coefficient is not zero.

Problem 2: Failing to meet assumptions

Suppose we want to know whether the size of the litter is related to body weight. Produce diagnostic plots for any variable transformations you can think of. Do not expect the linear regression model assumption to hold.

Solution

fit <- lm(Litter ~ log(Body), data = brains)
autoplot(fit)

### Problem 3: Litter size as categorical

Treat rounded litter size as categorical (you will need to convert litter size to categorical). Plot the appropriate diagnostics. Are the model assumptions satisfied?

Solution

library(tidyverse)
ggplot(brains) + 
  geom_boxplot(mapping = aes(y = log(Body), x = as.factor(round(Litter))))

Those are not generally symmetrical and the variance of the residuals is not constant.

Problem 4: Litter size as ever more categorical

Create a new variable: litter size is greater than 5. Check the model assumptions. Now, use lm to test the hypothesis that the body weight is related to the litter size being greater than 1. What conclusions can you draw?

brains <- brains %>% mutate(L5 = Litter > 5)
fit <- lm(log(Body) ~ L5, data = brains)
ggplot(brains) + 
  geom_boxplot(mapping = aes(x = L5, y = log(Body)))

Seems OK!

summary(fit)
## 
## Call:
## lm(formula = log(Body) ~ L5, data = brains)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3599 -1.5085  0.0177  2.2654  5.6520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2854     0.3018   7.573 2.49e-11 ***
## L5TRUE       -2.4779     1.2072  -2.053   0.0429 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.863 on 94 degrees of freedom
## Multiple R-squared:  0.0429, Adjusted R-squared:  0.03272 
## F-statistic: 4.214 on 1 and 94 DF,  p-value: 0.04288

We can reject (barely) the hypothesis that there is no relationship between having a litter greater than 5 and body weight.

Problem 5: the F-test

Create another new variable with the categories: litter size up to 2, litter size up 7, litter over 7. Produce appropriate diagnostic plots, and use an F-test to compute a p-value. What is the null hypothesis? What is the conclusion?

brains[, "Litter.size"] <- "Small"
brains$Litter.size[brains$Litter > 2] <- "Medium"
brains$Litter.size[brains$Litter > 7] <- "Large"

ggplot(brains) + 
  geom_boxplot(mapping = aes(x = Litter.size, y = log(Body)))

summary(lm(log(Body)~Litter.size, data = brains))
## 
## Call:
## lm(formula = log(Body) ~ Litter.size, data = brains)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1810 -1.7738  0.4722  1.7939  4.8840 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         0.3908     1.5319   0.255   0.7992  
## Litter.sizeMedium   0.2467     1.5944   0.155   0.8774  
## Litter.sizeSmall    2.7742     1.5717   1.765   0.0808 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.653 on 93 degrees of freedom
## Multiple R-squared:  0.1867, Adjusted R-squared:  0.1692 
## F-statistic: 10.68 on 2 and 93 DF,  p-value: 6.697e-05

We can reject the hypothesis that there is no relationship between body weight and litter size.