Let’s run our first linear regression on the gapminder
dataset for 1982.
library(gapminder)
gap.1982 <- gapminder %>% filter(year == 1982)
p <- ggplot(data = gap.1982,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point() +
geom_smooth(method = "lm")
fit.gap.1 <- lm(lifeExp ~ gdpPercap, data=gap.1982)
We are trying to predict lifeExp
(so the y here) using gdpPercap
(the x). We can see the coefficients (the \(a_0\) and \(a_1\) using)
We can extract the individual coefficients using
fit.gap.1$coefficients[1]
## (Intercept)
## 53.96495
fit.gap.1$coefficients[2]
## gdpPercap
## 0.001006563
So we can get predictions using
my.pred <- fit.gap.1$coefficients[1] + fit.gap.1$coefficients[2] * gapminder$gdpPercap
my.pred[1:10]
## [1] 54.74951 54.79119 54.82365 54.80663 54.70979 54.75622 54.94938
## [8] 54.82294 54.61855 54.60446
We can also get predictions using
predict(fit.gap.1, newdata=gap.1982)[1:10]
## 1 2 3 4 5 6 7 8
## 54.94938 57.61966 59.74781 56.74000 63.02190 73.56979 75.70377 73.30218
## 9 10
## 54.64637 75.08249
(newdata
needs to be a data frame with the column gdpPercap
)
Let’s now compute the cost function \(\sum_i (y^{(i)}-pred(x^{(i)}))^2\):
sum((predict(fit.gap.1, gap.1982) - gap.1982$lifeExp)**2)
## [1] 7812.27
This looks large, but partly that’s because we are summing up all the squared errors. Let’s compute the mean squared error
sum((predict(fit.gap.1, gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982)
## [1] 55.01599
(Note: nrow
computes the number of rows in a data frame).
This corresponds to the mean square error, \(\frac{\sum_i (y^{(i)}-pred(x^{(i)}))^2}{m}\).
Now if we want to get some idea of what out actual errors might be, we’d take the square root: \(\sqrt{\frac{\sum_i (y^{(i)}-pred(x^{(i)}))^2}{m}}\)
sqrt(sum((predict(fit.gap.1, gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982))
## [1] 7.417276
So in some sense, we’re can expect to be off by about 8 years. Pretty bad!
Recall that before we had the idea of plotting the graph with the x-axis scaled logarithmically.
p <- ggplot(data = gap.1982,
mapping = aes(x = gdpPercap,
y = lifeExp))
p + geom_point() +
scale_x_log10() +
geom_smooth(method = "lm")
This seems like the line fits much better. Will this be reflected in the root mean squared prediction error?
fit.xlog.1982 <- lm(lifeExp ~ log(gdpPercap), data = gap.1982)
sqrt(sum((predict(fit.xlog.1982, newdata = gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982))
## [1] 5.721202
Yes!
Now, let’s try to use the continent as a predictor variable.
fit.xlog.cont.1982 <- lm(lifeExp ~ log(gdpPercap) + continent, data = gap.1982)
sqrt(sum((predict(fit.xlog.cont.1982, newdata = gap.1982) - gap.1982$lifeExp)**2)/nrow(gap.1982))
## [1] 4.925259
We are now below a RMSE of 5 years.
Let’s see what the coefficients for the categorical variables are
fit.xlog.cont.1982
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap) + continent, data = gap.1982)
##
## Coefficients:
## (Intercept) log(gdpPercap) continentAmericas
## 12.264 5.342 7.299
## continentAsia continentEurope continentOceania
## 6.473 9.563 9.534
So, for example, if the continent is Asia, we’d add 6.473
to the predicted life expectancy.