SML310 Precept 4

Problem 1 (Practice generating fake data)

Consider a scenario where a data point is:

Tar level
Asbestos level
Cancer (a binary variable)

Draw a causal graph for this situation.
Come up with a data model for this situation (i.e., indicate the distribution of each variable)
Demonstrate what happens when you regress asbestos level on tar level when controlling and when not controlling for cancer. (Note: “regress Y on X” means you are predicting Y using X)

Problem 2 (Why “controlling” and “conditioning” are kind of the same thing)

In this problem, you will be investigating the relationship between controlling and conditioning on a variable.

A datapoint will consist of:

Hours spent studying per day
Major (one of “Math” and “Physics”; assume those are the only majors available)
GPA

Come up with a plausible model. Draw the causal graph and write down the data model equations. Generate fake data.

Using the fake data you generated, display \(P(\text{GPA}|\text{hours} \geq 8, \text{major} = \text{Math})\) and \(P(\text{GPA}|\text{hours} \geq 8, \text{major} = \text{Physics})\) (as histograms). Note the difference between the means. Now, set up a regression and predict the difference between those means using a regression.