In the babynames dataset, we computed the number of distinct names per capita, as well as the total number of distinct names, for each year. In plain English, what would be the questions that could be answered with those kinds of measures? What are the arguments for each of the measures?
Explain the effect of scaling the x-axis on a log-scale on a scatterplot. Give specific examples of transformations of curves plotted in regular cartesian coordinates due to changing the x-axis and y-axis to a log scale.
Explain the use of “training sets” for building models for predicting, for example, house prices.
How to obtain predictions from a linear regression model?
How can you measure how well a linear regression model is working? Be precise.
State the criterion according to which lm() selects the best coefficients.
Why might transforming inputs (e.g., using the log function) lead to better predictions?
Explain the difference between categorical variables and continuous variables. Explain why some variables might be plausibly thought of as either categorical or continuous.
In the case of categorical variables, state the criterion for the optimal coefficients that lm() uses.
Write down lm()’s cost function when using several categorical variables.
When doing prediction using logistic regression, how do we obtain a probability? How do we obtain guesses (about 0 or 1)?
What is the false positive rate? In the context of detecting disease, why are false positive rates important?
What is the false negative rate? In the context of detecting disease, why are false positive rates important?
What is the positive predictive value? In the context of detecting disease, why are false positive rates important?
Why is it important to use a test set (rather than simply use the training set for everything)?
What is overfitting? Explain overfitting by analogy with “teaching to the test”.
Explain the difference between stat = "identity" and stat = "count" when using geom_bar
What are two ways to display to curves on the same graph at the same time using ggplot? Explain with reference to the kinds of data frames with which those two ways are compatible.