Problem 1 (Practice generating fake data)
Consider a scenario where a data point is:
- Tar level
- Asbestos level
- Cancer (a binary variable)
- Draw a causal graph for this situation.
- Come up with a data model for this situation (i.e., indicate the distribution of each variable)
- Demonstrate what happens when you regress asbestos level on tar level when controlling and when not controlling for cancer. (Note: “regress Y on X” means you are predicting Y using X)
Problem 2 (Why “controlling” and “conditioning” are kind of the same thing)
In this problem, you will be investigating the relationship between controlling and conditioning on a variable.
A datapoint will consist of:
- Hours spent studying per day
- Major (one of “Math” and “Physics”; assume those are the only majors available)
- GPA
Come up with a plausible model. Draw the causal graph and write down the data model equations. Generate fake data.
Using the fake data you generated, display P(GPA|hours≥8,major=Math) and P(GPA|hours≥8,major=Physics) (as histograms). Note the difference between the means. Now, set up a regression and predict the difference between those means using a regression.