Suppose we want to predict life expectancy from the GDP per capita. As we saw, it makes sense to draw a straight line through the log-log plot of lifeExp
vs. gdpPercap
:
gap.1982 <- gapminder %>% filter(year == 1982)
ggplot(gap.1982, mapping = aes(x = log(gdpPercap), y = log(lifeExp))) +
geom_point() +
geom_smooth(method = "lm")
Let’s now get the equation for the best line through the data:
lm(log(lifeExp) ~ log(gdpPercap), data = gap.1982)
##
## Call:
## lm(formula = log(lifeExp) ~ log(gdpPercap), data = gap.1982)
##
## Coefficients:
## (Intercept) log(gdpPercap)
## 3.0527 0.1266
For example, for a country with a gdpPercap
of $10,000, we would predict a log(lifeExp)
of
3.0527 + 0.1266 * log(10000)
## [1] 4.218729
The prediction for lifeExp
will be
exp(3.0527 + 0.1266 * log(10000))
## [1] 67.94707
We can get all the predictions using
model <- lm(log(lifeExp) ~ log(gdpPercap), data = gap.1982)
predict(model, newdata = gap.1982)
## 1 2 3 4 5 6 7 8
## 3.924370 4.090434 4.148530 4.055575 4.205327 4.303094 4.316175 4.301354
## 9 10 11 12 13 14 15 16
## 3.877797 4.312504 3.958230 4.072709 4.106637 4.119034 4.174096 4.193944
## 17 18 19 20 21 22 23 24
## 3.900069 3.853690 3.867576 4.036320 4.323585 3.921588 3.898603 4.133341
## 25 26 27 28 29 30 31 32
## 3.922336 4.114688 3.957156 3.877190 4.127854 4.137426 4.048286 4.254053
## 33 34 35 36 37 38 39 40
## 4.179146 4.273172 4.316707 4.061079 4.060269 4.177349 4.085921 4.105767
## 41 42 43 44 45 46 47 48
## 3.917701 3.845579 3.857754 4.296805 4.308296 4.270981 3.904479 4.318697
## 49 50 51 52 53 54 55 56
## 3.910429 4.272273 4.126313 3.907685 3.904829 4.015643 4.071308 4.266263
## 57 58 59 60 61 62 63 64
## 4.247411 4.325618 3.907460 3.979934 4.184090 4.265892 4.248138 4.273088
## 65 66 67 68 69 70 71 72
## 4.282382 4.155452 4.302489 4.107700 3.965012 4.106019 4.145807 4.363370
## 73 74 75 76 77 78 79 80
## 4.184625 3.898501 3.856508 4.288558 3.960681 3.869253 4.128909 3.866259
## 81 82 83 84 85 86 87 88
## 3.976916 4.092412 4.213674 4.014977 4.233298 4.053055 3.829483 3.818558
## 89 90 91 92 93 94 95 96
## 4.108600 3.885310 4.315011 4.290498 4.084709 3.915207 3.984853 4.341110
## 97 98 99 100 101 102 103 104
## 4.251470 3.973651 4.173713 4.110620 4.162876 4.048313 4.197396 4.239154
## 105 106 107 108 109 110 111 112
## 4.222818 4.137534 4.213597 3.911227 4.007791 4.372479 3.980068 4.271547
## 113 114 115 116 117 118 119 120
## 3.975529 4.271448 4.234711 4.292169 3.947796 4.199133 4.260624 3.990436
## 121 122 123 124 125 126 127 128
## 4.008147 4.099337 4.310604 4.350832 4.094920 4.181025 3.910170 4.037662
## 129 130 131 132 133 134 135 136
## 3.964669 4.207027 4.087947 4.110109 3.878781 4.294734 4.334747 4.172089
## 137 138 139 140 141 142
## 4.232504 3.883331 4.112904 4.013510 3.970565 3.897159
We can also get the prediction for a gdpPercap
of 1000 by creating our own data frame:
new.df <- data.frame(gdpPercap = 10000)
exp(predict(model, newdata = new.df))
## 1
## 67.94479
For gap.1982
, we can compute the sum of squared errors as follows.
The errors (for predicting the log of the life expectancy) are
log(gap.1982$lifeExp) - predict(model, newdata = gap.1982)
## 1 2 3 4 5
## -0.2391474617 0.1640429479 -0.0316409825 -0.3681464245 0.0423392656
## 6 7 8 9 10
## 0.0109212642 -0.0232528968 -0.0664942997 0.0344064543 -0.0093854207
## 11 12 13 14 15
## -0.0282882783 -0.0863396317 0.1516670917 -0.0002570061 -0.0256424823
## 16 17 18 19 20
## 0.0698617837 -0.0263294506 0.0064294423 0.0634064373 -0.0667645308
## 21 22 23 24 25
## 0.0039859385 -0.0442599971 0.0037127148 0.1231928040 0.2600958659
## 26 27 28 29 30
## 0.0848116405 0.0118714353 -0.0104994360 -0.0901677788 0.1591792155
## 31 32 33 34 35
## -0.0596168188 0.0009920808 0.1210877484 -0.0110559754 -0.0041645515
## 36 37 38 39 40
## -0.1731031731 0.0943394883 -0.0131360316 -0.0604625511 -0.0696871195
## 41 42 43 44 45
## -0.1412229925 -0.0638921785 -0.0529600006 0.0146645909 0.0077247358
## 46 47 48 49 50
## -0.2356082938 -0.0850095476 -0.0173377733 0.0738028924 0.0484096932
## 51 52 53 54 55
## -0.0635110584 -0.1490232630 -0.2329173400 -0.0748186425 0.0380732627
## 56 57 58 59 60
## 0.0572069207 -0.0076678732 0.0180573160 0.1284787210 0.0482533036
## 61 62 63 64 65
## -0.0960990271 -0.1381450173 0.0436899365 0.0370394452 0.0348398913
## 66 67 68 69 70
## 0.1101812227 0.0427441357 0.0470963456 0.1085512262 0.1295354874
## 71 72 73 74 75
## 0.0607195187 -0.0963475500 0.0198143188 0.1102494125 -0.0531394972
## 76 77 78 79 80
## -0.1589265546 -0.0694933321 -0.0484246413 0.0905984810 -0.0839802229
## 81 82 83 84 85
## 0.0046139283 0.1079580672 -0.0029550299 0.0366170135 0.0721309750
## 86 87 88 89 90
## 0.0354392614 -0.0730613960 0.2428496873 -0.0316052279 0.0185602210
## 91 92 93 94 95
## 0.0163797550 0.0114027629 -0.0021334696 -0.1633992567 -0.1600013429
## 96 97 98 99 100
## -0.0107717725 -0.1126621322 0.0545185769 0.0815020331 0.0921902415
## 101 102 103 104 105
## -0.0453679940 0.0801426793 0.0697804592 0.0481501974 0.0778630600
## 106 107 108 109 110
## 0.1093175104 0.0300288999 -0.0778576334 0.0923863513 -0.2291541708
## 111 112 113 114 115
## -0.0215619542 -0.0207402785 -0.3263007475 0.0018796267 0.0251479358
## 116 117 118 119 120
## -0.0286023034 -0.1876433381 -0.1359179980 0.0740490892 0.2401420691
## 121 122 123 124 125
## -0.0893871225 -0.0818550787 0.0256400488 -0.0173395343 0.0731394514
## 126 127 128 129 130
## 0.0978606743 0.0139394701 0.1305055461 0.0511911356 0.0246417919
## 131 132 133 134 135
## 0.0716860514 0.0013546448 0.0302175178 0.0098712168 -0.0219369264
## 136 137 138 139 140
## 0.0878408426 -0.0048383826 0.1910824288 0.0523026663 -0.1193859855
## 141 142
## -0.0227698185 0.2032176331
The sum of the squared errors is then
sum((log(gap.1982$lifeExp) - predict(model, newdata = gap.1982))^2)
## [1] 1.484817
Let’s now use both the log(gdpPercap)
and the year to predict the log of the life expectancy. We’ll use the full gapminder
dataset – otherwise using the year doesn’t really make sense.
model2 <- lm(log(lifeExp) ~ log(gdpPercap) + year, data = gapminder)
We’ll also define the old model that ignores the year:
model1 <- lm(log(lifeExp) ~ log(gdpPercap), data = gapminder)
Let’s look at model2
:
model2
##
## Call:
## lm(formula = log(lifeExp) ~ log(gdpPercap) + year, data = gapminder)
##
## Coefficients:
## (Intercept) log(gdpPercap) year
## -4.054829 0.135050 0.003543
The implied equation is
\(\log(\text{lifeExp}) \approx -4.054829 + 0.135050\log(\text{gdpPercap}) + 0.003543\text{year}\)
What does that mean? Here’s an example.
How is the predictions of \(\log(\text{lifeExp})\) are going to differ between 1980 and 1981, if the gdpPercap is still the same? The predictions would differ by \(0.003543\times(1981-1980) = 0.003543\) (Why?).
\(\log(ab) = \log(a) + \log(b)\). This means that if \(\log(b)\) increases by 2, \(ab\) increases by a factor of \(\exp(2)\).
In our case, if year increases by 1, \(0.003543\text{year}\) increases by \(0.003543\), so the predicted \(\log(\text{lifeExp})\) increases by \(0.003543\), so the predicted lifeExp
increases by a factor of \(\exp(0.003543)\) (i.e., the new prediction is \(\exp(0.003543)\) times the old one).
We can compare the two models:
model1 <- lm(log(lifeExp) ~ log(gdpPercap), data = gapminder)
model2 <- lm(log(lifeExp) ~ log(gdpPercap) + year, data = gapminder)
sum((log(gapminder$lifeExp) - predict(model1, newdata = gapminder))^2)
## [1] 35.54009
sum((log(gapminder$lifeExp) - predict(model2, newdata = gapminder))^2)
## [1] 29.51551
We expect model2
to do a better job of predicting the life expectancy, since it can also take the year into account. (And if it didn’t make sense to do so, it wouldn’t – it can always just multiply the log(gdpPercap) by 0 and get the same prediction as model 1, if that’s the best that can be done.)
As we might expect, the sum of squared errors is larger for model1
than for model2`.
Let’s predict the life expectancy using continent and the year:
model3 <- lm(lifeExp ~ continent + year, data = gapminder)
model3
##
## Call:
## lm(formula = lifeExp ~ continent + year, data = gapminder)
##
## Coefficients:
## (Intercept) continentAmericas continentAsia continentEurope
## -596.2613 15.7934 11.1996 23.0384
## continentOceania year
## 25.4609 0.3259
For Americas in 1980, the prediction will be -596.2613 + 15.7934 + 1980*0.3259
. For Africa in 1990, the prediction will be -596.2613 + 0.3259*1990
(Note that we don’t add a coefficient for Africa, since R chose to only use indicator variables for the other four continents).