Run the following to load a dataset that records various data about mammals, including brain weight. The brain weight is given in grams, the body weight in kilograms, and the gestation weight in days.
library(Sleuth3)
brains <- case0902
(Note: if install.packaegs("Sleuth3")
doesn’t work, try install.packages("Sleuth3", repos="http://R-Forge.R-project.org")
instead)
Suppose you want to use linear regression to investigate the relationship between brain weight and body weight. Find a way to transform the variables that would allow you to do that. (Hint: try taking the log of both variables. See Tuesday’s lecture where we explored the relationship between gdp per capita and life expectancy). Use a scatterplot to assess whether a relationship is linear.
A plot where we take the log of both variables works nicely.
ggplot(brains, mapping = aes(x = log(Body), y = log(Brain))) +
geom_point() +
geom_smooth(method = "lm")
Produce the diagnostic plots. Display and investigate outliers, if any. (See Tuesday’s lecture on the relationship between gdp per capita and life expectancy)
Let’s now plot the diagnostic plots
library(ggfortify)
fit <- lm(log(Brain) ~ log(Body), data = brains)
autoplot(fit)
Let’s look at the outliers in more detail:
brains[c(58, 25, 48),]
## Species Brain Body Gestation Litter
## 58 Lemur 22 2.1 135 1
## 25 Dolphin 1600 160.0 360 1
## 48 Hippopotamus 590 1400.0 240 1
Interestingly, dolphins and hippos are closely related phylogenetically (but the residulas have different signs, so there is no big insight here.)
Removing lemurs (and hippos) as big outlier might make sense, but would be tough to justify.
(Here is how to remove datapoints:
brains.no.hippos <- brains[-48,]
)
There are not too many outliers, and the Q-Q plot is approximately linear, so we can run the regression.
Run the regression. What conclusions can you draw?
summary(fit)
##
## Call:
## lm(formula = log(Brain) ~ log(Body), data = brains)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16218 -0.44640 -0.04525 0.35076 1.83561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.33235 0.07325 31.84 <2e-16 ***
## log(Body) 0.71919 0.02037 35.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5781 on 94 degrees of freedom
## Multiple R-squared: 0.9299, Adjusted R-squared: 0.9291
## F-statistic: 1246 on 1 and 94 DF, p-value: < 2.2e-16
There is a positive association between body weight and brain weight: the p-value for the coefficient of log(Body)
is very small, so we can conclude that the coefficient is not zero.
Suppose we want to know whether the size of the litter is related to body weight. Produce diagnostic plots for any variable transformations you can think of. Do not expect the linear regression model assumption to hold.
fit <- lm(Litter ~ log(Body), data = brains)
autoplot(fit)
### Problem 3: Litter size as categorical
Treat rounded litter size as categorical (you will need to convert litter size to categorical). Plot the appropriate diagnostics. Are the model assumptions satisfied?
library(tigerstats)
## Loading required package: abd
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## Loading required package: lattice
## Loading required package: grid
## Loading required package: mosaic
## Loading required package: ggformula
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## GeomErrorbarh, geom_errorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median,
## prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
## Welcome to tigerstats!
## To learn more about this package, consult its website:
## http://homerhanumat.github.io/tigerstats
bwplot(log(Body)~as.factor(Litter), data = brains)
Those are not generally symmetricalm and the variance of the residuals is not constant.
Create a new variable: litter size is greater than 5. Check the model assumptions. Now, use lm
to test the hypothesis that the body weight is related to the litter size being greater than 1. What conclusions can you draw?
brains <- brains %>% mutate(L5 = Litter > 5)
fit <- lm(log(Body) ~ L5, data = brains)
bwplot(L5 ~ log(Body), data = brains)
Seems OK!
summary(fit)
##
## Call:
## lm(formula = log(Body) ~ L5, data = brains)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3599 -1.5085 0.0177 2.2654 5.6520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2854 0.3018 7.573 2.49e-11 ***
## L5TRUE -2.4779 1.2072 -2.053 0.0429 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.863 on 94 degrees of freedom
## Multiple R-squared: 0.0429, Adjusted R-squared: 0.03272
## F-statistic: 4.214 on 1 and 94 DF, p-value: 0.04288
We can reject (barely) the hypothesis that there is no relationship between having a litter greater than 5 and body weight.
Create another new variable with the categories: litter size up to 2, litter size up 7, litter over 7. Produce appropriate diagnostic plots, and use an F-test to compute a p-value. What is the null hypothesis? What is the conclusion?
brains[, "Litter.size"] <- "Small"
brains$Litter.size[brains$Litter > 2] <- "Medium"
brains$Litter.size[brains$Litter > 7] <- "Large"
bwplot(Litter.size ~ log(Body), data = brains)
summary(lm(log(Body)~Litter.size, data = brains))
##
## Call:
## lm(formula = log(Body) ~ Litter.size, data = brains)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1810 -1.7738 0.4722 1.7939 4.8840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3908 1.5319 0.255 0.7992
## Litter.sizeMedium 0.2467 1.5944 0.155 0.8774
## Litter.sizeSmall 2.7742 1.5717 1.765 0.0808 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.653 on 93 degrees of freedom
## Multiple R-squared: 0.1867, Adjusted R-squared: 0.1692
## F-statistic: 10.68 on 2 and 93 DF, p-value: 6.697e-05
We can reject the hypothesis that there is no relationship between body weight and litter size.