--- title: "Midterm 1 Solutions" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tidyverse) library(gapminder) ``` ### Q1 (a) Write a function named `SpecialSquare` which takes in a number, and returns the square of that number, unless the number is 42. If the number is 42, the function returns 0. ```{r} SpecialSquare <- function(num){ if(num == 42){ return(0) }else{ return(num**2) } } ``` ### Q1(b) Write code that calls a function named `SpecialSquare2` using the inputs $2$ and then $42$ (so that there would be two calls in total), and prints "All tests passed" if the function returned what the function from Part (a) is supposed to return for both inputs, and "At least one test failed" otherwise. (Think of `SpecialSquare2` as a function that another student wrote to answer 1(a).) ```{r} SpecialSquare2 <- SpecialSquare # Not part of the solution -- needed # for the code to run if (SpecialSquare2(2) == 4 & SpecialSquare2(42) == 0){ cat("All tests passed") }else{ cat("At least one test failed") } ``` ### Q2(a) The first several rows of the `gapminder` dataframe are shown below. country continent year lifeExp pop gdpPercap 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. Write code to print out the number of different countries in Asia that appear in `gapminder` ```{r} gapminder %>% filter(continent == "Asia") %>% select(country) %>% n_distinct() ``` ### Q2(b) Write a function that takes in a year and the dataset `gapminder`, and returns the name of the continent on which the average GDP per capita was the highest during that year. The average GDP per capita on a continent is the mean of the GDPs per capita of all the countries on that continent. ```{r} GetRichestContinent <- function(gapminder, y){ mGdp <- gapminder %>% filter(year == y) %>% group_by(continent) %>% summarize(mean.GDPpercap = mean(gdpPercap)) %>% arrange(desc(mean.GDPpercap)) return(mGdp$continent[1]) } ``` ### Q3(a) Suppose the following code was run to obtain `fit` fit <- lm(Y ~ X1 + X2 + X3, data = my.data) Write the function `RMSE`, which computes the Root-Mean-Square Error (RMSE) of a model on a dataset, and then write code to use the function to compute the RMSE of the model `fit` on the dataset `my.data` ```{r} ### Not part of the answer ################### my.data <- data.frame(X1 = c(1, 2, 50, 10, 15, 50), X2 = c(1.3, 2.9, 3.2, 2, 5, 1), X3 = c(0.9, 0.9, 1.5, 3, 6, 2 ), Y = c(3.1, 4.9, 7.05, 4, 7, 3)) fit <- lm(Y ~ X1 + X2 + X3, data = my.data) ############################################## RMSE <- function(fit, my.data){ return(sqrt(mean((predict(fit, newdata=my.data) - my.data$Y)**2))) } RMSE(fit, my.data) ``` ### Q3(b) Give two reasons that the RMSE is more informative than the SSE when figuring out whether a model works well. 1. The SSE grows with the number of datapoints even if the model works well. The RMSE does not, since we compute the *mean* squared error. That means that we do not need to consider the size of the dataset when looking at the RMSE. 2. The RMSE gives us an idea of how much "off" the predictions are -- e.g., the predictions might be off by about 3 grams in general when we're predicting the weight of something. Since the SSE is the sum of *squared* errors, the interpretation of that quantity is not as straightforward. ### Q4 Explain why it is necessary to split the data into a training and a test set in a predictive modelling setting. Make sure to mention how the training and the test set are used, and what the problem would be with just using the training set. The training set is used to fit the model parameters/coefficients -- that is, to find the parameters/coefficients that produce the best predictions on the training set. The test set is used for estimating how good the predictions would be on new data. Because we may be overfitting -- finding coefficients that work well for predicting on the training set but not in general -- it can be misleading to try to estimate how good the predictions are by looking at the predictions on the training set. On the other hand, the test set is similar to new data in that that it wasn't used for fitting the model parameters, so performance on the test set is a better measure of how well we'll do on new data. ### Q5 Consider the following code: > titanic$Pclass <- as.factor(titanic$Pclass) > glm(Survived ~ Age + Sex + Pclass, family = binomial, data = titanic) Call: glm(formula = Survived ~ Age + Sex + Pclass, family = binomial, data = titanic) Coefficients: (Intercept) Age Sexmale Pclass2 Pclass3 3.63492 -0.03427 -2.58872 -1.19912 -2.45544 How would you compute the probability of survival of a 45-year-old male passenger in first class? Write down what you would need to compute, but there is no need to perform the actual computation. An answer in the style of $\sqrt{25+2.4^2}$ is perfectly OK. Your answer should not contain R code or any functions that you did not define. $$\sigma(3.63492 - 0.03427\times 45 - 2.58872)$$ Here, $$\sigma(z) = \frac{1}{1+\exp(-z)}$$ ### Q6 Suppose you are considering whether a variable should be treated as continuous or as categorical. Describe a situation where you would expect that using the categorical version of the variable would lead to a much lower cost on the training set than using the continous version. Suppose we are predicting $y$ using a single input variable $x$ (say $y$ is average GPA and $x$ is precept number). If $y \approx a_0 + a_1 x$, we'd expect that it doesn't matter whether we are using $x$ as continuous or categorical, since we'd get very good predictions either way. On the other hand, suppose that it is not the case that $y \approx a_0 + a_1 x$ -- that is, you cannot draw a straight line through the data points. For example, perhaps the GPA is around 3.5 for even-numbered precepts, but 3.9 for odd-numbered precepts: ```{r} gpa <- data.frame(precept = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), GPA = c(3.5, 3.9, 3.5, 3.9, 3.5, 3.9, 3.5, 3.9, 3.5, 3.9)) ggplot(data=gpa, mapping = aes(x = as.factor(precept), y = GPA)) + geom_point() + ylim(0, 4) ``` We can still predict perfectly if the $x$ is categorical, but not if it is continuous. (N.B., you could also use the example from class). ### Q7 Consider a dataframe whose first several lines look as follows student GPA year 1 Bob 3.9 2015 2 Alice 4.1 2015 3 Matt 1.3 2015 4 Tim 4.0 2014 5 Samantha 2.0 2014 6 Jane 3.8 2014 7 Bob 2.0 2014 Write a function that takes in a dataframe like the one above, and returns a vector that contains the names of the students whose GPA was above average for the year during at least one year. For example, Bob's GPA in 2015 was $3.9$ and the average for 2015 was $\frac{3.9 + 4.1 + 1.3}{3} = 3.1$, so Bob was above average in 2015 and would be included, even though his GPA in 2014 was below average. Samantha would not be included (absent additional unseen data), since only her GPA for 2014 is available, and it is below average. ```{r} GPAs <- data.frame(student = c("Bob", "Alice", "Matt", "Tim", "Samantha", "Jane", "Bob"), GPA = c(3.9, 4.1, 1.3, 4.0, 2.0, 3.8, 2.0), year = c(2015, 2015, 2015, 2014, 2014, 2014, 2014)) AboveAvgOnce <- function(GPAs){ student.t <- GPAs %>% group_by(year) %>% mutate(year.avg = mean(GPA)) %>% ungroup() %>% filter(GPA > year.avg) %>% select(student) %>% distinct() return(student.t$student) } AboveAvgOnce(GPAs) ```