Precept 3 Problem Set

Create and save the file p3.R on your computer. The file p3.R, which should contain all the required functions, should be submitted on Blackboard once you have completed the required functions. You must work with one partner. You should work collaboratively. Both partners are responsible for being able to explain the work that’s been done to the preceptor. Additionally, work submitted on Blackboard will be graded for correctness. Only one of the partners should submit the code on Blackboard.

The file p3.R should have include the following code, with the NetIDs of the two partners substituted in:

student.netid.1 <- "netid1"
student.netid.2 <- "netid2"
assignment <- "precept3"

In addition, a comment should contain the full names of the partners.

Some of you will be tempted to use for and while-loops to solve some of the problems below (if you’ve used those before). Please don’t do this – the goal here is to try to use R the way professional data scientists use it, which usually means no loops.

Problem 1: Predicting Life Expectancy

Problem 1(a) (submit on Blackboard)

Write a function called MostAccuratePred that takes in a dataset in the same format as gapminder %>% filter(year == 1982) (i.e., the year is always the same), and finds the 10 countries for which the predictions based on log(gdpPercap) are the most accurate.

Problem 1(b) (not for submission on Blackboard)

Review the prediction errors (what is a good systematic way of doing that?). Do you notice any patterns? Discuss your hypotheses about the patterns, if any, with the preceptors.

Problem 2: Approximating Linear Regression

In this problem, you will write a function that finds good coefficients for linear regression.

Problem 2(a): Finding a good coefficient for linear regression (submit on blackboard)

Write a function called my.lm that could be used to find the coefficient in Simple Linear Regression. The function could be used like this

my.data <- data.frame(X = c(1, 2, 3), 
                      Y = c(3.1, 4.9, 7.05))

my.lm(my.data, intercept)  # Returns approximately 2, since Y ~ 2*X + 1

The function should work as follows. First, generate possible values for the coefficient using

seq(-5, 5, 0.1)

##   [1] -5.0 -4.9 -4.8 -4.7 -4.6 -4.5 -4.4 -4.3 -4.2 -4.1 -4.0 -3.9 -3.8 -3.7
##  [15] -3.6 -3.5 -3.4 -3.3 -3.2 -3.1 -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3
##  [29] -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9
##  [43] -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4  0.5
##  [57]  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9
##  [71]  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3
##  [85]  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7
##  [99]  4.8  4.9  5.0

Then, of those possible coefficients, find the one that produces the smallest error.

Try different inputs, and make sure that the answer you are getting is close to the answer you would expect. Explain to your preceptor how you came up with the different inputs.

Problem 2(b): Finding a good coefficient (challenge, not for submission on Blackboard)

Now, write a function that finds both a good intercept and a good coefficient. Hint: use a modification of the function in 2(a) that returns both the coefficient and the sum of squared errors in produces. Then repeatedly use that function for every possible intercept hypothesis.

Problem 3: Categorical Variables (submit on Blackboard)

In class, we ran the following:

fit <- lm(gdpPercap ~ continent, data = gapminder)
fit

## 
## Call:
## lm(formula = gdpPercap ~ continent, data = gapminder)
## 
## Coefficients:
##       (Intercept)  continentAmericas      continentAsia  
##              2194               4942               5708  
##   continentEurope   continentOceania  
##             12276              16428

fit$coefficients

##       (Intercept) continentAmericas     continentAsia   continentEurope 
##          2193.755          4942.356          5708.396         12275.721 
##  continentOceania 
##         16427.855

Write a function called predictGdpCont that takes in a vector like fit$coefficients and the name of a continent, and returns the prediction for that continent. Inside the function, you may not refer to gapminder or lm.

For example, the following should run:

fit <- lm(gdpPercap ~ continent, data = gapminder)
predictGdpCont(fit$coefficients, "Asia") # Returns the prediction for Asia

Make sure that the numbers you return correspond to what predict computes. You may assume the order of the coefficients will always be

"(Intercept)"       "continentAmericas" "continentAsia"     "continentEurope"   "continentOceania"