SML480 PSet 1

Problem 1

In gapminder from 1982, compare three possible ways of predicting the life expectancy:

Predict \(\log(\text{lifeExp})\) using \(\log(\text{gdpPercap})\), penalize the square difference betewen the predicted \(\log(\text{lifeExp})\) and the actual \(\log(\text{lifeExp})\)
Predict \(\log(\text{lifeExp})\) using \(\log(\text{gdpPercap})\), penalize the square difference betewen the predicted \(\text{lifeExp}\) and the actual \(\text{lifeExp}\)
Predict \(\text{lifeExp}\) using \(\text{gdpPercap}\), penalize the square difference betewen the predicted \(\text{lifeExp}\) and the actual \(\text{lifeExp}\)

Which method gives the best predictions in the sense of the sum of the square differences between predicted and actual \(\text{lifeExp}\)? What is the intuition there?

In Problem 1, you should use the the usual quadratic cost function.

Problem 2(a)

Suppose we use the Laplace distribution to model the distribution of the residuals in a regression instead of the Gaussian distribution. Write down a simple formula for the cost function that we can minimize to obtain the coefficients in this scenario. (Clear and readable photos of the write-up are fine.)

Problem 2(b)

Explain why using the cost function from Problem 2(a) can make the coefficient estimates more robust to outliers (compared to the quadratic cost function). An estimate is more robust to outliers if the presence of a single outlier does not affect the estimates for the coefficients.

Problem 2(c)

Show empirically that the cost function from Problem 2(a) can make coefficient estimates more robust to outliers: implement a function that does regression using the cost function from 2(a), make a synthetic dataset, and try adding outliers to it.

Problem 3

Pick a dataset from library(fivethirtyeight), and make up a problem that involves some combination of summarize/mutate/filter/select/distinct with at least 4 steps (more is better). Write down both the problem and the solution.