Midterm 1 Solutions

Q1 (a)

Write a function named SpecialSquare which takes in a number, and returns the square of that number, unless the number is 42. If the number is 42, the function returns 0.

SpecialSquare <- function(num){
  if(num == 42){
    return(0)
  }else{
    return(num**2)
  }
}

Q1(b)

Write code that calls a function named SpecialSquare2 using the inputs \(2\) and then \(42\) (so that there would be two calls in total), and prints “All tests passed” if the function returned what the function from Part (a) is supposed to return for both inputs, and “At least one test failed” otherwise. (Think of SpecialSquare2 as a function that another student wrote to answer 1(a).)

SpecialSquare2 <- SpecialSquare # Not part of the solution -- needed
                                # for the code to run

if (SpecialSquare2(2) == 4 & SpecialSquare2(42) == 0){
  cat("All tests passed")
}else{
  cat("At least one test failed")
}

## All tests passed

Q2(a)

The first several rows of the gapminder dataframe are shown below.

   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.

Write code to print out the number of different countries in Asia that appear in gapminder

gapminder %>% filter(continent == "Asia") %>% 
              select(country) %>% 
              n_distinct()

## [1] 33

Q2(b)

Write a function that takes in a year and the dataset gapminder, and returns the name of the continent on which the average GDP per capita was the highest during that year. The average GDP per capita on a continent is the mean of the GDPs per capita of all the countries on that continent.

GetRichestContinent <- function(gapminder, y){
  mGdp <- gapminder %>% filter(year == y) %>% 
                  group_by(continent) %>% 
                  summarize(mean.GDPpercap = mean(gdpPercap)) %>% 
                  arrange(desc(mean.GDPpercap))
  return(mGdp$continent[1])
}

Q3(a)

Suppose the following code was run to obtain fit

fit <- lm(Y ~ X1 + X2 + X3, data = my.data)

Write the function RMSE, which computes the Root-Mean-Square Error (RMSE) of a model on a dataset, and then write code to use the function to compute the RMSE of the model fit on the dataset my.data

### Not part of the answer ###################
my.data <- data.frame(X1 = c(1, 2, 50, 10, 15, 50), 
                      X2 = c(1.3, 2.9, 3.2, 2, 5, 1),
                      X3 = c(0.9, 0.9, 1.5, 3, 6, 2 ),
                      Y = c(3.1, 4.9, 7.05, 4, 7, 3))

fit <- lm(Y ~ X1 + X2 + X3, data = my.data)
##############################################

RMSE <- function(fit, my.data){
  return(sqrt(mean((predict(fit, newdata=my.data) - my.data$Y)**2)))
}

RMSE(fit, my.data)

## [1] 0.3274562

Q3(b)

Give two reasons that the RMSE is more informative than the SSE when figuring out whether a model works well.

The SSE grows with the number of datapoints even if the model works well. The RMSE does not, since we compute the mean squared error. That means that we do not need to consider the size of the dataset when looking at the RMSE.
The RMSE gives us an idea of how much “off” the predictions are – e.g., the predictions might be off by about 3 grams in general when we’re predicting the weight of something. Since the SSE is the sum of squared errors, the interpretation of that quantity is not as straightforward.

Q4

Explain why it is necessary to split the data into a training and a test set in a predictive modelling setting. Make sure to mention how the training and the test set are used, and what the problem would be with just using the training set.

The training set is used to fit the model parameters/coefficients – that is, to find the parameters/coefficients that produce the best predictions on the training set.

The test set is used for estimating how good the predictions would be on new data.

Because we may be overfitting – finding coefficients that work well for predicting on the training set but not in general – it can be misleading to try to estimate how good the predictions are by looking at the predictions on the training set. On the other hand, the test set is similar to new data in that that it wasn’t used for fitting the model parameters, so performance on the test set is a better measure of how well we’ll do on new data.

Q5

Consider the following code:

> titanic$Pclass <- as.factor(titanic$Pclass)
> glm(Survived ~ Age + Sex + Pclass, family = binomial, data = titanic)

Call:  glm(formula = Survived ~ Age + Sex + Pclass, family = binomial, 
    data = titanic)

Coefficients:
(Intercept)          Age      Sexmale      Pclass2      Pclass3  
    3.63492     -0.03427     -2.58872     -1.19912     -2.45544

How would you compute the probability of survival of a 45-year-old male passenger in first class? Write down what you would need to compute, but there is no need to perform the actual computation. An answer in the style of \(\sqrt{25+2.4^2}\) is perfectly OK. Your answer should not contain R code or any functions that you did not define.

\[\sigma(3.63492 - 0.03427\times 45 - 2.58872)\]

Here,

\[\sigma(z) = \frac{1}{1+\exp(-z)}\]

Q6

Suppose you are considering whether a variable should be treated as continuous or as categorical. Describe a situation where you would expect that using the categorical version of the variable would lead to a much lower cost on the training set than using the continous version.

Suppose we are predicting \(y\) using a single input variable \(x\) (say \(y\) is average GPA and \(x\) is precept number). If \(y \approx a_0 + a_1 x\), we’d expect that it doesn’t matter whether we are using \(x\) as continuous or categorical, since we’d get very good predictions either way.

On the other hand, suppose that it is not the case that \(y \approx a_0 + a_1 x\) – that is, you cannot draw a straight line through the data points. For example, perhaps the GPA is around 3.5 for even-numbered precepts, but 3.9 for odd-numbered precepts:

gpa <- data.frame(precept = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                  GPA = c(3.5, 3.9, 3.5, 3.9, 3.5, 3.9, 3.5, 3.9, 3.5, 3.9))

ggplot(data=gpa, mapping = aes(x = as.factor(precept), y = GPA)) + 
  geom_point() + 
  ylim(0, 4)

We can still predict perfectly if the \(x\) is categorical, but not if it is continuous.

(N.B., you could also use the example from class).

Q7

Consider a dataframe whose first several lines look as follows

   student GPA year
1      Bob 3.9 2015
2    Alice 4.1 2015
3     Matt 1.3 2015
4      Tim 4.0 2014
5 Samantha 2.0 2014
6     Jane 3.8 2014
7      Bob 2.0 2014

Write a function that takes in a dataframe like the one above, and returns a vector that contains the names of the students whose GPA was above average for the year during at least one year. For example, Bob’s GPA in 2015 was \(3.9\) and the average for 2015 was \(\frac{3.9 + 4.1 + 1.3}{3} = 3.1\), so Bob was above average in 2015 and would be included, even though his GPA in 2014 was below average. Samantha would not be included (absent additional unseen data), since only her GPA for 2014 is available, and it is below average.

GPAs <- data.frame(student = c("Bob", "Alice", "Matt", "Tim", "Samantha", "Jane", "Bob"),
                   GPA = c(3.9, 4.1, 1.3, 4.0, 2.0, 3.8, 2.0),
                   year = c(2015, 2015, 2015, 2014, 2014, 2014, 2014))

AboveAvgOnce <- function(GPAs){
  student.t <-  GPAs %>% group_by(year) %>% 
                         mutate(year.avg = mean(GPA)) %>% 
                         ungroup() %>% 
                         filter(GPA > year.avg) %>% 
                         select(student) %>% 
                         distinct()
  return(student.t$student)
}

AboveAvgOnce(GPAs)

## [1] Bob   Alice Tim   Jane 
## Levels: Alice Bob Jane Matt Samantha Tim