SML310 Mini-Project 1: Statistical Inference and Multi-Level Models

Overview

For this mini-project, we are concerned with statistical inference and generative models. You will be “generating” fake data – i.e., writing functions that use random-number generators to create a new dataset. If you are using R, you may find Ch. 5 of Shalizi helpful.

Part 1: Low P-Values (20 pts)

For this part, you should demonstrate that low p-values do not necessarily imply an important association between two variables.

Write a function to generate a dataset \((X, Y)\) containing \(n\) data points, such that

\[Y^{(i)}=a_0 + a_1 X^{(i)} + \epsilon_i \text{ , with } \epsilon_i\sim N(0, \sigma) \]

Set the parameters to be such that the relationship between \(X\) and \(Y\) is very weak. State what the parameters are, and plot \(Y\) vs. \(X\).

Now, show that for larger \(n\), you will nearly always reject the hypothesis that \(a_1 = 0\), even if \(a_1\) is very close to 0. Make an appropriate figure to demonstrate this fact. Think of a way to make the figure convincing.

Part 2: Lurking Variables

Part 2(a): Warm-up (15 pts)

Generate a dataset \((X_1, X_2, Y)\) where the correlation between \(X_1\) and \(Y\) is negative, but the coefficient \(a_1\) for \(X_1\) is positive when you run the regression

\[Y = a_0 + a_1 X_1 + a_2 X_ 2.\]

State how you created the dataset, and show that it conforms to the required property.

Part 2(b): Challenge (10 pts)

Now, come up with a (semi-)plausible scenario where \(X_2\) is a lurking variable. That is, in your report, come up with a story about the data \((X_1, X_2, Y)\). Now, re-generate the data in a way that makes clear that the data could be generated if the scenario actually happened. In order to constrain the task, the scenario has to take place at an Ivy League university.

Include your code, an explanation, and a demonstration of the effect of \(X_2\) as a lurking variable.

For this question, you may not consult with other people regarding general ideas

Part 3: Free Throws

Free throws are an important part of basketball. Here, we will be using a dataset of the number of free throws that Shaquille O’Neal, one of the greatest basketball players in the last few decades who was also known for his poor free-throwing ability, has taken in 23 different games. The total number of attempted free throws is recorded, as well as the number of times that the attempts were successful.

lines <- 
"Game   Scored  N.Attempts
1   4   5
2   5   11
3   5   14
4   5   12
5   2   7
6   7   10
7   6   14
8   9   15
9   4   12
10  1   4
11  13  27
12  5   17
13  6   12
14  9   9
15  7   12
16  3   10
17  8   12
18  1   6
19  18  39
20  3   13
21  10  17
22  1   6
23  3   12"
con <- textConnection(lines)
shaq <- read.csv(con, sep="")
shaq

##    Game Scored N.Attempts
## 1     1      4          5
## 2     2      5         11
## 3     3      5         14
## 4     4      5         12
## 5     5      2          7
## 6     6      7         10
## 7     7      6         14
## 8     8      9         15
## 9     9      4         12
## 10   10      1          4
## 11   11     13         27
## 12   12      5         17
## 13   13      6         12
## 14   14      9          9
## 15   15      7         12
## 16   16      3         10
## 17   17      8         12
## 18   18      1          6
## 19   19     18         39
## 20   20      3         13
## 21   21     10         17
## 22   22      1          6
## 23   23      3         12

We are supplying the following Stan code that you can use in completing the following questions (you will need to modify it with some of them).

shaq_model_stan <- "
data{
  int<lower=0> N;
  int scored[N];
  int attempted[N];
}

parameters{
  real mu;
  real<lower=0> sigma;
  vector[N] alpha_norm;
}

transformed parameters{
  real alpha[N];
  for(n in 1:N)
    alpha[n] = mu + sigma * alpha_norm[n];
}

model{
  alpha_norm ~ normal(0, 1);
  scored ~ binomial(attempted, inv_logit(alpha));
}"

Part 3(a): Good Days and Bad Days (10)

We are interested in determining the true probability of successfully scoring a free throw in each game. What is a hierarchical model that would make sense if Shaq had a different probability of scoring on different days?

Part 3(b): Fit a hierarchical model in Stan (15)

Fit the model you proposed using Stan. Display the posterior distribution for each of the parameters of the model.

From the posteriors that you plotted, does it appear that Shaq has good days and bad days? What are you basing this conclusion on?

Explain each line of the Stan code you are using. (You may want to review the Binomial distribution)

Part 3(c): No-pooling, partial pooling, and complete pooling (15)

Suppose we want to model the probability of Shaq’s scoring a free throw, for each game. What are the no-pooling, partial-pooling, and complete-pooling hierarchical models that would be used to estimate the probabilities?

Run the models, and display the posterior distributions for each day using each of the pooling methods. One way to display the distributions it to display the 95% predictive interval for each day, similarly to what we did with Minnesota counties in lecture. (But note that you’d be displaying e.g. the 70% predictive interval, i.e., an interval that contains the parameter with 70% probability, rather than the confidence interval)

Report quality (10 pts)

Your report should be readable and reproducible. 10 pts will be awarded for very readable and professional reports. 5 pts will be awarded for reports that are readable with some effort. Points will be deducted for lack of reproducibility.

Note that the quality of the write-up is graded both separately (under the Report quality rubric) and when grading your answers to the individual questions.