Linear Regression assumptions

Gapminder in 1980

gap <- gapminder %>% filter(year == 1982)
ggplot(gap, mapping = aes(x = gdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm")

This fails the first assumption: the relationship is not linear.

gap <- gap %>% mutate(logGdpPercap = log(gdpPercap))
ggplot(gap, mapping = aes(x = logGdpPercap, y = lifeExp)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Let’s fit a model and plot the diagnostics.

library(ggfortify)
fit <- lm(lifeExp ~ logGdpPercap, data = gap)
autoplot(fit)

Which countries see like outliers? The big errors are for

gap[c(113, 4, 46),]
## # A tibble: 3 x 7
##   country      continent  year lifeExp     pop gdpPercap logGdpPercap
##   <fct>        <fct>     <int>   <dbl>   <int>     <dbl>        <dbl>
## 1 Sierra Leone Africa     1982    38.4 3464522     1465.         7.29
## 2 Angola       Africa     1982    39.9 7016384     2757.         7.92
## 3 Gabon        Africa     1982    56.6  753874    15113.         9.62

The counties with a lot of leverage (i.e., influence on the estimate) are

gap[c(110, 88),]
## # A tibble: 2 x 7
##   country      continent  year lifeExp      pop gdpPercap logGdpPercap
##   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
## 1 Saudi Arabia Asia       1982    63.0 11254672    33693.        10.4 
## 2 Myanmar      Asia       1982    58.1 34680442      424          6.05

If we are satisfied that the assumptions are roughly correct, we can test the hypothesis that higher gdp per capita is associate with higher life expenctancy:

summary(fit)
## 
## Call:
## lm(formula = lifeExp ~ logGdpPercap, data = gap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7709  -2.8743   0.4812   3.6039  14.6986 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.6505     3.3463  -0.194    0.846    
## logGdpPercap   7.4936     0.3990  18.780   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.762 on 140 degrees of freedom
## Multiple R-squared:  0.7158, Adjusted R-squared:  0.7138 
## F-statistic: 352.7 on 1 and 140 DF,  p-value: < 2.2e-16

We can reject the hypothesis that there is no pattern there.

What if we add the continent as a predictor? This is called “controlling” for a variable.

fit.cont <- lm(lifeExp ~ logGdpPercap + continent, data = gap)
autoplot(fit)

We will peek at the summary, but we don’t get to test two hypotheses

summary(fit.cont)
## 
## Call:
## lm(formula = lifeExp ~ logGdpPercap + continent, data = gap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6618  -2.3396   0.1655   2.5571  12.4680 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        12.2639     3.6536   3.357  0.00102 ** 
## logGdpPercap        5.3415     0.4871  10.966  < 2e-16 ***
## continentAmericas   7.2994     1.3956   5.230 6.25e-07 ***
## continentAsia       6.4727     1.1945   5.419 2.65e-07 ***
## continentEurope     9.5629     1.5685   6.097 1.05e-08 ***
## continentOceania    9.5336     3.8200   2.496  0.01377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.033 on 136 degrees of freedom
## Multiple R-squared:  0.7894, Adjusted R-squared:  0.7817 
## F-statistic:   102 on 5 and 136 DF,  p-value: < 2.2e-16