For this mini-project, we are concerned with statistical inference and generative models. You will be “generating” fake data – i.e., writing functions that use random-number generators to create a new dataset. If you are using R, you may find Ch. 5 of Shalizi helpful.
For this part, you should demonstrate that low p-values do not necessarily imply an important association between two variables.
Write a function to generate a dataset \((X, Y)\) containing \(n\) data points, such that
\[Y^{(i)}=a_0 + a_1 X^{(i)} + \epsilon_i \text{ , with } \epsilon_i\sim N(0, \sigma) \]
Set the parameters to be such that the relationship between \(X\) and \(Y\) is very weak. State what the parameters are, and plot \(Y\) vs. \(X\).
Now, show that for larger \(n\), you will nearly always reject the hypothesis that \(a_1 = 0\), even if \(a_1\) is very close to 0. Make an appropriate figure to demonstrate this fact. Think of a way to make the figure convincing.
Generate a dataset \((X_1, X_2, Y)\) where the correlation between \(X_1\) and \(Y\) is negative, but the coefficient \(a_1\) for \(X_1\) is positive when you run the regression
\[Y = a_0 + a_1 X_1 + a_2 X_ 2.\]
State how you created the dataset, and show that it conforms to the required property.
Now, come up with a (semi-)plausible scenario where \(X_2\) is a lurking variable. That is, in your report, come up with a story about the data \((X_1, X_2, Y)\). Now, re-generate the data in a way that makes clear that the data could be generated if the scenario actually happened. In order to constrain the task, the scenario has to take place at an Ivy League university.
Include your code, an explanation, and a demonstration of the effect of \(X_2\) as a lurking variable.
For this question, you may not consult with other people regarding general ideas
lines <-
"Game Scored N.Attempts
1 4 5
2 5 11
3 5 14
4 5 12
5 2 7
6 7 10
7 6 14
8 9 15
9 4 12
10 1 4
11 13 27
12 5 17
13 6 12
14 9 9
15 7 12
16 3 10
17 8 12
18 1 6
19 18 39
20 3 13
21 10 17
22 1 6
23 3 12"
con <- textConnection(lines)
shaq <- read.csv(con, sep="")
shaq
## Game Scored N.Attempts
## 1 1 4 5
## 2 2 5 11
## 3 3 5 14
## 4 4 5 12
## 5 5 2 7
## 6 6 7 10
## 7 7 6 14
## 8 8 9 15
## 9 9 4 12
## 10 10 1 4
## 11 11 13 27
## 12 12 5 17
## 13 13 6 12
## 14 14 9 9
## 15 15 7 12
## 16 16 3 10
## 17 17 8 12
## 18 18 1 6
## 19 19 18 39
## 20 20 3 13
## 21 21 10 17
## 22 22 1 6
## 23 23 3 12
We are supplying the following Stan code that you can use in completing the following questions (you will need to modify it with some of them).
shaq_model_stan <- "
data{
int<lower=0> N;
int scored[N];
int attempted[N];
}
parameters{
real mu;
real<lower=0> sigma;
vector[N] alpha_norm;
}
transformed parameters{
real alpha[N];
for(n in 1:N)
alpha[n] = mu + sigma * alpha_norm[n];
}
model{
alpha_norm ~ normal(0, 1);
scored ~ binomial(attempted, inv_logit(alpha));
}"
We are interested in determining the true probability of successfully scoring a free throw in each game. What is a hierarchical model that would make sense if Shaq had a different probability of scoring on different days?
Fit the model you proposed using Stan. Display the posterior distribution for each of the parameters of the model.
From the posteriors that you plotted, does it appear that Shaq has good days and bad days? What are you basing this conclusion on?
Explain each line of the Stan code you are using. (You may want to review the Binomial distribution)
Suppose we want to model the probability of Shaq’s scoring a free throw, for each game. What are the no-pooling, partial-pooling, and complete-pooling hierarchical models that would be used to estimate the probabilities?
Run the models, and display the posterior distributions for each day using each of the pooling methods. One way to display the distributions it to display the 95% predictive interval for each day, similarly to what we did with Minnesota counties in lecture. (But note that you’d be displaying e.g. the 70% predictive interval, i.e., an interval that contains the parameter with 70% probability, rather than the confidence interval)
Your report should be readable and reproducible. 10 pts will be awarded for very readable and professional reports. 5 pts will be awarded for reports that are readable with some effort. Points will be deducted for lack of reproducibility.
Note that the quality of the write-up is graded both separately (under the Report quality rubric) and when grading your answers to the individual questions.