SML 201 Midterm 1 Theory Study Problems

  1. In the babynames dataset, we computed the number of distinct names per capita, as well as the total number of distinct names, for each year. In plain English, what would be the questions that could be answered with those kinds of measures? What are the arguments for each of the measures?

  2. Explain the effect of scaling the x-axis on a log-scale on a scatterplot. In particular, if the scatterplot followed a curve, what would happen to the curve? Why might this be useful?

  3. Explain why the chart on p.6 of the Intro to DataViz slides is misleading. (N.B., there is a discussion of the chart in Healy’s book)

  4. How can the fact that color “pops out” more than shape be used in creating effective DataViz?

  5. What are Gestalt rules? How are Gestalt rules related to misleading DataViz?

  6. Explain the use of “training sets” for building models for predicting, for example, house prices.

  7. How to obtain predictions from a linear regression model?

  8. How can you measure how well a linear regression model is working? Be precise.

  9. State what the criterion for the optimal coefficients lm() uses.

  10. Why might transforming inputs (e.g., using the \(log\) function) lead to better predictions?

  11. Explain the difference between categorical variables and continuous variables. Explain why some variables might be plausibly thought of as either categorical or continuous.

  12. In the case of categorical variables, state the criterion for the optimal coefficients that lm() uses.

  13. Explain why we have one fewer dummy variable than we have categories when using a categorical variable.

  14. Write down the cost function when using several categorical variables.

  15. When doing prediction using logistic regression, how do we obtain a probability? How do we obtain guesses (about 0 or 1)?

  16. Recall that the cost function used when running logistic regression is \[-\sum_i \left(y^{(i)}\log p^{(i)} + (1-y^{(i)})\log(1-p^{(i)})\right)\]

Explain how each term in the formula is computed, and why this formula makes sense. (N.B., you are not expected to remember this formula, just to be aware of its existence and to be able to explain the terms in it).

  1. What are false positive rates? In the context of detecting disease, why are false positive rates important?

  2. What are false negative rates? In the context of detecting disease, why are false positive rates important?

  3. What is the positive predictive value? In the context of detecting disease, why are false positive rates important?

  4. Why is it important to use a test set rather than simply just use the training set?