The response (\(y\)) is approximately linear in the input \(x\) \[y_i \approx a_0 + a_1 x_1^{(i)} + ... + a_m x_m^{(i)}\]
The residuals are idependent of each other, and are normally distributed with equal variance and zero mean \[e_i = (a_0 + a_1 x_1^{(i)} + ... + a_m x_m^{(i)} )\] \[e_i \sim N(0, \sigma^2)\]
There is no multicollinearity \[x_3^{(i)} \neq b_0 + b_1 x_1^{(i)} + b_2 x_2^{(i)}\]
The x’s are exogenuous ** \(e_i\) is independent of \(x^{(i)}\) (e.g., residulas are not larger for larger x’s)
gap <- gapminder %>% filter(year == 1982)
ggplot(gap, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm")
This fails the first assumption: the relationship is not linear.
gap <- gap %>% mutate(logGdpPercap = log(gdpPercap))
ggplot(gap, mapping = aes(x = logGdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm")
Let’s fit a model and plot the diagnostics.
library(ggfortify)
fit <- lm(lifeExp ~ logGdpPercap, data = gap)
autoplot(fit)
Which countries see like outliers? The big errors are for
gap[c(113, 4, 46),]
## # A tibble: 3 x 7
## country continent year lifeExp pop gdpPercap logGdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Sierra Leone Africa 1982 38.4 3464522 1465. 7.29
## 2 Angola Africa 1982 39.9 7016384 2757. 7.92
## 3 Gabon Africa 1982 56.6 753874 15113. 9.62
The counties with a lot of leverage (i.e., influence on the estimate) are
gap[c(110, 88),]
## # A tibble: 2 x 7
## country continent year lifeExp pop gdpPercap logGdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Saudi Arabia Asia 1982 63.0 11254672 33693. 10.4
## 2 Myanmar Asia 1982 58.1 34680442 424 6.05
If we are satisfied that the assumptions are roughly correct, we can test the hypothesis that higher gdp per capita is associate with higher life expenctancy:
summary(fit)
##
## Call:
## lm(formula = lifeExp ~ logGdpPercap, data = gap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7709 -2.8743 0.4812 3.6039 14.6986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6505 3.3463 -0.194 0.846
## logGdpPercap 7.4936 0.3990 18.780 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.762 on 140 degrees of freedom
## Multiple R-squared: 0.7158, Adjusted R-squared: 0.7138
## F-statistic: 352.7 on 1 and 140 DF, p-value: < 2.2e-16
We can reject the hypothesis that there is no pattern there.
What if we add the continent as a predictor? This is called “controlling” for a variable.
fit.cont <- lm(lifeExp ~ logGdpPercap + continent, data = gap)
autoplot(fit)
We will peek at the summary, but we don’t get to test two hypotheses
summary(fit.cont)
##
## Call:
## lm(formula = lifeExp ~ logGdpPercap + continent, data = gap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6618 -2.3396 0.1655 2.5571 12.4680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.2639 3.6536 3.357 0.00102 **
## logGdpPercap 5.3415 0.4871 10.966 < 2e-16 ***
## continentAmericas 7.2994 1.3956 5.230 6.25e-07 ***
## continentAsia 6.4727 1.1945 5.419 2.65e-07 ***
## continentEurope 9.5629 1.5685 6.097 1.05e-08 ***
## continentOceania 9.5336 3.8200 2.496 0.01377 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.033 on 136 degrees of freedom
## Multiple R-squared: 0.7894, Adjusted R-squared: 0.7817
## F-statistic: 102 on 5 and 136 DF, p-value: < 2.2e-16