SML201 Week 4 Lecture 1: Predictive Modelling with Linear Regression

Linear regression

Let’s run our first linear regression on the gapminder dataset for 1982.

library(gapminder)
gap.1982 <- gapminder %>% filter(year == 1982)
p <- ggplot(data = gap.1982,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point() +
    geom_smooth(method = "lm")

fit.gap.1 <- lm(lifeExp ~ gdpPercap, data=gap.1982)

We are trying to predict lifeExp (so the y here) using gdpPercap (the x). We can see the coefficients (the \(a_0\) and \(a_1\) using)

We can extract the individual coefficients using

fit.gap.1$coefficients[1]

## (Intercept) 
##    53.96495

fit.gap.1$coefficients[2]

##   gdpPercap 
## 0.001006563

So we can get predictions using

my.pred <- fit.gap.1$coefficients[1] + fit.gap.1$coefficients[2] * gapminder$gdpPercap
my.pred[1:10]

##  [1] 54.74951 54.79119 54.82365 54.80663 54.70979 54.75622 54.94938
##  [8] 54.82294 54.61855 54.60446

We can also get predictions using

predict(fit.gap.1, newdata=gap.1982)[1:10]

##        1        2        3        4        5        6        7        8 
## 54.94938 57.61966 59.74781 56.74000 63.02190 73.56979 75.70377 73.30218 
##        9       10 
## 54.64637 75.08249

(newdata needs to be a data frame with the column gdpPercap)

Let’s now compute the cost function \(\sum_i (y^{(i)}-pred(x^{(i)}))^2\):

sum((predict(fit.gap.1, gap.1982) - gap.1982$lifeExp)**2)

## [1] 7812.27

This looks large, but partly that’s because we are summing up all the squared errors. Let’s compute the mean squared error

sum((predict(fit.gap.1, gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982)

## [1] 55.01599

(Note: nrow computes the number of rows in a data frame).

This corresponds to the mean square error, \(\frac{\sum_i (y^{(i)}-pred(x^{(i)}))^2}{m}\).

Now if we want to get some idea of what out actual errors might be, we’d take the square root: \(\sqrt{\frac{\sum_i (y^{(i)}-pred(x^{(i)}))^2}{m}}\)

sqrt(sum((predict(fit.gap.1, gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982))

## [1] 7.417276

So in some sense, we’re can expect to be off by about 8 years. Pretty bad!

Recall that before we had the idea of plotting the graph with the x-axis scaled logarithmically.

p <- ggplot(data = gap.1982,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point() +
    scale_x_log10() +  
    geom_smooth(method = "lm")

This seems like the line fits much better. Will this be reflected in the root mean squared prediction error?

fit.xlog.1982 <- lm(lifeExp ~ log(gdpPercap), data = gap.1982)
sqrt(sum((predict(fit.xlog.1982, newdata = gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982))

## [1] 5.721202

Yes!

Now, let’s try to use the continent as a predictor variable.

fit.xlog.cont.1982 <- lm(lifeExp ~ log(gdpPercap) + continent, data = gap.1982)
sqrt(sum((predict(fit.xlog.cont.1982, newdata = gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982))

## [1] 4.925259

We are now below a RMSE of 5 years.

Let’s see what the coefficients for the categorical variables are

fit.xlog.cont.1982

## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap) + continent, data = gap.1982)
## 
## Coefficients:
##       (Intercept)     log(gdpPercap)  continentAmericas  
##            12.264              5.342              7.299  
##     continentAsia    continentEurope   continentOceania  
##             6.473              9.563              9.534

So, for example, if the continent is Asia, we’d add 6.473 to the predicted life expectancy.