Problem 1 (Practice generating fake data)

Consider a scenario where a data point is:

  1. Draw a causal graph for this situation.
  2. Come up with a data model for this situation (i.e., indicate the distribution of each variable)
  3. Demonstrate what happens when you regress asbestos level on tar level when controlling and when not controlling for cancer. (Note: “regress Y on X” means you are predicting Y using X)

Problem 2 (Why “controlling” and “conditioning” are kind of the same thing)

In this problem, you will be investigating the relationship between controlling and conditioning on a variable.

A datapoint will consist of:

Come up with a plausible model. Draw the causal graph and write down the data model equations. Generate fake data.

Using the fake data you generated, display \(P(\text{GPA}|\text{hours} \geq 8, \text{major} = \text{Math})\) and \(P(\text{GPA}|\text{hours} \geq 8, \text{major} = \text{Physics})\) (as histograms). Note the difference between the means. Now, set up a regression and predict the difference between those means using a regression.