## SML 201 Midterm 2 Theory Study Problems

1. Explain how to use a validation set when selecting variables. What are the training and the test sets used for?

2. The prediction for lifeExp is made using $$\text{lifeExp} \approx 12.26 + 5.34\log(gdpPercap) + a_{asia}I_{asia} + ...$$. (See here, slide 2.) Do the algebra to explain why an increase by 1 in $$\log(\text{gdpPercap})$$ corresponds to an increase by a factor of approximately $$2.7$$ in gdpPercap.

3. For slide 2 here, explain how to interpret the coefficients that correspond to categories in continent. Relate the coefficients to the differences in predictions for Africa and the other continents.

4. If we observe an association between A and B, list the reasons the relationship might not be cause. Tell “stories” about the potential reasons that longer time spent on studying is associated with higher GPAs.

5. How to compute the fair odds of an event?

6. Suppose the fair odds for candidate A winning are 2:5. Explain this in terms of the terms for a bet. Compute the probability that candidate A will win.

7. Explain how to obtain all the number on slide 7

8. For all of the following, define the criterion, and explain whether a violation of the criterion would constitute disparate impact, disparate treatment, both, or neither: demographic parity; accuracy parity; true positive parity; predictive value parity; fairness through unawareness.

9. What are the two ways we discussed of equalizing accuracy parity?

10. Explain the importance of the distinction between reoffending and being re-arrested in the context of model fairness.

11. Why might sample size disparity contribute to unfairness?

12. Suppose we are trying to estimate the bias (a.k.a weight) of a coint by tossing it many times. Sketch the graph of estimated bias versus the number of tosses look like. Explain the intuition for why it looks the way it does.

13. Explain why $$f(k; p) = p^k (1-p)^{1-k}$$ makes sense as the pmf for a Bernoulli distirbution (work out what the expression equals for k = 0 and k = 1)

14. What is the pmf for the number of Heads we would get if we were tossing a fair coin twice?

15. What is the cumulative mass function of the number of Heads we would get if we were tossing a fair coin twice?

16. Describe the strategies for variable selection based on the performance on the validation set of models that include/don’t include particular variables. (See last slide here)

17. Define the p-value. Why is a p-value only defined in reference to a particular null-hypothesis?

18. Give an example of a p-value calculation where we assume the Binomial distribution. Explain how in the case of a Binomial model, the Gaussian approximation can be used to compute the p-value.

19. Explain the idea behind using fake data simulation to approximate p-values.

20. What is a null hypothesis?

21. Without using rt, generate datapoints from a t-distribution with 10 degrees of freedom.

22. What is the null hypothesis in the Darwin’s Finches example? If the null hypothesis is true, what quantity would be t-distributed? How can we use pt to compute the p-value in the Darwin’s Finches example?

23. What does a low p-value mean? What does a high p-value mean?

24. Suppose we have a sample of 5 heights of Princeton students x = c(7.1, 6.2, 5.5, 5.4, 6.0). Assuming that the heights of Princeton students are normally distributed, how can we test the hypothesis that the average height of a Princeton student is 5.9?

25. Sketch a boxplot. Explain all the components of the boxplot.

26. Define Type I and Type II errors

27. For a scenario of your choice where the null hypothesis is true, sketch a graph of the probability of Type I error and sample size. Explain the choices you made.

28. In project 2, you computed about 7000 t statistics. For an individual statistic, how could you use it to test a hypothesis (provide R code involving pt). What would be the problem with conducting 7000 separate hypothesis tests?

29. What is the point of pre-registering null hypotheses?

30. What is the problem with the “file drawer effect”? How would publishing negative results help the scientific community?

31. What is p-hacking?

32. Explain what happened in Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon

33. Explain why a Type II error would occur very often if you toss a coin 5 times and have a null hypothesis about the probability of Heads. Support your argument using outputs of pnorm.

34. What are Type M and Type S errors?

35. What are some appropriate uses of Q-tips?