Week 4 Lecture 1 Predictive Modeling

Predicting Life Expectancy

Suppose we want to predict life expectancy from the GDP per capita. As we saw, it makes sense to draw a straight line through the log-log plot of lifeExp vs. gdpPercap:

gap.1982 <- gapminder %>% filter(year == 1982)
ggplot(gap.1982, mapping = aes(x = log(gdpPercap), y = log(lifeExp))) +
         geom_point() + 
         geom_smooth(method = "lm")

Let’s now get the equation for the best line through the data:

lm(log(lifeExp) ~ log(gdpPercap), data = gap.1982)

## 
## Call:
## lm(formula = log(lifeExp) ~ log(gdpPercap), data = gap.1982)
## 
## Coefficients:
##    (Intercept)  log(gdpPercap)  
##         3.0527          0.1266

For example, for a country with a gdpPercap of $10,000, we would predict a log(lifeExp) of

3.0527 + 0.1266 * log(10000)

## [1] 4.218729

The prediction for lifeExp will be

exp(3.0527 + 0.1266 * log(10000))

## [1] 67.94707

We can get all the predictions using

model <- lm(log(lifeExp) ~ log(gdpPercap), data = gap.1982)
predict(model, newdata = gap.1982)

##        1        2        3        4        5        6        7        8 
## 3.924370 4.090434 4.148530 4.055575 4.205327 4.303094 4.316175 4.301354 
##        9       10       11       12       13       14       15       16 
## 3.877797 4.312504 3.958230 4.072709 4.106637 4.119034 4.174096 4.193944 
##       17       18       19       20       21       22       23       24 
## 3.900069 3.853690 3.867576 4.036320 4.323585 3.921588 3.898603 4.133341 
##       25       26       27       28       29       30       31       32 
## 3.922336 4.114688 3.957156 3.877190 4.127854 4.137426 4.048286 4.254053 
##       33       34       35       36       37       38       39       40 
## 4.179146 4.273172 4.316707 4.061079 4.060269 4.177349 4.085921 4.105767 
##       41       42       43       44       45       46       47       48 
## 3.917701 3.845579 3.857754 4.296805 4.308296 4.270981 3.904479 4.318697 
##       49       50       51       52       53       54       55       56 
## 3.910429 4.272273 4.126313 3.907685 3.904829 4.015643 4.071308 4.266263 
##       57       58       59       60       61       62       63       64 
## 4.247411 4.325618 3.907460 3.979934 4.184090 4.265892 4.248138 4.273088 
##       65       66       67       68       69       70       71       72 
## 4.282382 4.155452 4.302489 4.107700 3.965012 4.106019 4.145807 4.363370 
##       73       74       75       76       77       78       79       80 
## 4.184625 3.898501 3.856508 4.288558 3.960681 3.869253 4.128909 3.866259 
##       81       82       83       84       85       86       87       88 
## 3.976916 4.092412 4.213674 4.014977 4.233298 4.053055 3.829483 3.818558 
##       89       90       91       92       93       94       95       96 
## 4.108600 3.885310 4.315011 4.290498 4.084709 3.915207 3.984853 4.341110 
##       97       98       99      100      101      102      103      104 
## 4.251470 3.973651 4.173713 4.110620 4.162876 4.048313 4.197396 4.239154 
##      105      106      107      108      109      110      111      112 
## 4.222818 4.137534 4.213597 3.911227 4.007791 4.372479 3.980068 4.271547 
##      113      114      115      116      117      118      119      120 
## 3.975529 4.271448 4.234711 4.292169 3.947796 4.199133 4.260624 3.990436 
##      121      122      123      124      125      126      127      128 
## 4.008147 4.099337 4.310604 4.350832 4.094920 4.181025 3.910170 4.037662 
##      129      130      131      132      133      134      135      136 
## 3.964669 4.207027 4.087947 4.110109 3.878781 4.294734 4.334747 4.172089 
##      137      138      139      140      141      142 
## 4.232504 3.883331 4.112904 4.013510 3.970565 3.897159

We can also get the prediction for a gdpPercap of 1000 by creating our own data frame:

new.df <- data.frame(gdpPercap = 10000)
exp(predict(model, newdata = new.df))

##        1 
## 67.94479

For gap.1982, we can compute the sum of squared errors as follows.

The errors (for predicting the log of the life expectancy) are

log(gap.1982$lifeExp) - predict(model, newdata = gap.1982)

##             1             2             3             4             5 
## -0.2391474617  0.1640429479 -0.0316409825 -0.3681464245  0.0423392656 
##             6             7             8             9            10 
##  0.0109212642 -0.0232528968 -0.0664942997  0.0344064543 -0.0093854207 
##            11            12            13            14            15 
## -0.0282882783 -0.0863396317  0.1516670917 -0.0002570061 -0.0256424823 
##            16            17            18            19            20 
##  0.0698617837 -0.0263294506  0.0064294423  0.0634064373 -0.0667645308 
##            21            22            23            24            25 
##  0.0039859385 -0.0442599971  0.0037127148  0.1231928040  0.2600958659 
##            26            27            28            29            30 
##  0.0848116405  0.0118714353 -0.0104994360 -0.0901677788  0.1591792155 
##            31            32            33            34            35 
## -0.0596168188  0.0009920808  0.1210877484 -0.0110559754 -0.0041645515 
##            36            37            38            39            40 
## -0.1731031731  0.0943394883 -0.0131360316 -0.0604625511 -0.0696871195 
##            41            42            43            44            45 
## -0.1412229925 -0.0638921785 -0.0529600006  0.0146645909  0.0077247358 
##            46            47            48            49            50 
## -0.2356082938 -0.0850095476 -0.0173377733  0.0738028924  0.0484096932 
##            51            52            53            54            55 
## -0.0635110584 -0.1490232630 -0.2329173400 -0.0748186425  0.0380732627 
##            56            57            58            59            60 
##  0.0572069207 -0.0076678732  0.0180573160  0.1284787210  0.0482533036 
##            61            62            63            64            65 
## -0.0960990271 -0.1381450173  0.0436899365  0.0370394452  0.0348398913 
##            66            67            68            69            70 
##  0.1101812227  0.0427441357  0.0470963456  0.1085512262  0.1295354874 
##            71            72            73            74            75 
##  0.0607195187 -0.0963475500  0.0198143188  0.1102494125 -0.0531394972 
##            76            77            78            79            80 
## -0.1589265546 -0.0694933321 -0.0484246413  0.0905984810 -0.0839802229 
##            81            82            83            84            85 
##  0.0046139283  0.1079580672 -0.0029550299  0.0366170135  0.0721309750 
##            86            87            88            89            90 
##  0.0354392614 -0.0730613960  0.2428496873 -0.0316052279  0.0185602210 
##            91            92            93            94            95 
##  0.0163797550  0.0114027629 -0.0021334696 -0.1633992567 -0.1600013429 
##            96            97            98            99           100 
## -0.0107717725 -0.1126621322  0.0545185769  0.0815020331  0.0921902415 
##           101           102           103           104           105 
## -0.0453679940  0.0801426793  0.0697804592  0.0481501974  0.0778630600 
##           106           107           108           109           110 
##  0.1093175104  0.0300288999 -0.0778576334  0.0923863513 -0.2291541708 
##           111           112           113           114           115 
## -0.0215619542 -0.0207402785 -0.3263007475  0.0018796267  0.0251479358 
##           116           117           118           119           120 
## -0.0286023034 -0.1876433381 -0.1359179980  0.0740490892  0.2401420691 
##           121           122           123           124           125 
## -0.0893871225 -0.0818550787  0.0256400488 -0.0173395343  0.0731394514 
##           126           127           128           129           130 
##  0.0978606743  0.0139394701  0.1305055461  0.0511911356  0.0246417919 
##           131           132           133           134           135 
##  0.0716860514  0.0013546448  0.0302175178  0.0098712168 -0.0219369264 
##           136           137           138           139           140 
##  0.0878408426 -0.0048383826  0.1910824288  0.0523026663 -0.1193859855 
##           141           142 
## -0.0227698185  0.2032176331

The sum of the squared errors is then

sum((log(gap.1982$lifeExp) - predict(model, newdata = gap.1982))^2)

## [1] 1.484817

Multiple linear regression

Let’s now use both the log(gdpPercap) and the year to predict the log of the life expectancy. We’ll use the full gapminder dataset – otherwise using the year doesn’t really make sense.

model2 <- lm(log(lifeExp) ~ log(gdpPercap) + year, data = gapminder)

We’ll also define the old model that ignores the year:

model1 <- lm(log(lifeExp) ~ log(gdpPercap), data = gapminder)

Let’s look at model2:

model2

## 
## Call:
## lm(formula = log(lifeExp) ~ log(gdpPercap) + year, data = gapminder)
## 
## Coefficients:
##    (Intercept)  log(gdpPercap)            year  
##      -4.054829        0.135050        0.003543

The implied equation is

$\log(\text{lifeExp}) \approx -4.054829 + 0.135050\log(\text{gdpPercap}) + 0.003543\text{year}$

What does that mean? Here’s an example.

How is the predictions of $\log(\text{lifeExp})$ are going to differ between 1980 and 1981, if the gdpPercap is still the same? The predictions would differ by $0.003543\times(1981-1980) = 0.003543$ (Why?).

An aside on $\log$

$\log(ab) = \log(a) + \log(b)$. This means that if $\log(b)$ increases by 2, $ab$ increases by a factor of $\exp(2)$.

In our case, if year increases by 1, $0.003543\text{year}$ increases by $0.003543$, so the predicted $\log(\text{lifeExp})$ increases by $0.003543$, so the predicted lifeExp increases by a factor of $\exp(0.003543)$ (i.e., the new prediction is $\exp(0.003543)$ times the old one).

Which model would have the smaller sum of squared errors?

We can compare the two models:

model1 <- lm(log(lifeExp) ~ log(gdpPercap), data = gapminder)
model2 <- lm(log(lifeExp) ~ log(gdpPercap) + year, data = gapminder)
sum((log(gapminder$lifeExp) - predict(model1, newdata = gapminder))^2)

## [1] 35.54009

sum((log(gapminder$lifeExp) - predict(model2, newdata = gapminder))^2)

## [1] 29.51551

We expect model2 to do a better job of predicting the life expectancy, since it can also take the year into account. (And if it didn’t make sense to do so, it wouldn’t – it can always just multiply the log(gdpPercap) by 0 and get the same prediction as model 1, if that’s the best that can be done.)

As we might expect, the sum of squared errors is larger for model1 than for model2`.

Predictions using categorical variables

Let’s predict the life expectancy using continent and the year:

model3 <- lm(lifeExp ~ continent + year, data = gapminder)
model3

## 
## Call:
## lm(formula = lifeExp ~ continent + year, data = gapminder)
## 
## Coefficients:
##       (Intercept)  continentAmericas      continentAsia    continentEurope  
##         -596.2613            15.7934            11.1996            23.0384  
##  continentOceania               year  
##           25.4609             0.3259

For Americas in 1980, the prediction will be -596.2613 + 15.7934 + 1980*0.3259. For Africa in 1990, the prediction will be -596.2613 + 0.3259*1990 (Note that we don’t add a coefficient for Africa, since R chose to only use indicator variables for the other four continents).

Week 4 Lecture 1 Predictive Modeling

Predicting Life Expectancy

Multiple linear regression

An aside on \(\log\)

Which model would have the smaller sum of squared errors?

Predictions using categorical variables