SML310: Research Projects in Data Science (Calendar)

Lecture calendar

Lectures

Reading and Materials

Week 1

Probability review

Maximum likelihood, R code

Bayesian inference intro (to be continued), R code

Reading: Review conditional probability. Shalizi Ch. 5

Studies mentioned in class: sex ratio at birth. Cohen et al., Insult, Aggression, and the Southern Culture of Honor: An "Experimental Ethnography"

Week 2

Bayesian inference intro (cont'd), R code

Introduction to statistical inference R code

The Truth About Linear Regression

Reading: Shalizi Ch. 2 and Ch. 5

Reading: When to control for lurking variables? Harvard admissions, the gender wage gap

Video: Bayesian Inference about Unicorns

Reading: OpenIntro Statistics 4.3.4, 5.1-5.2.1

Reading: the American Statistical Association's statement on p-values. Unilever statement on q-tips: "People may use [Q-tips] for ear cleaning, but we instruct against it," said Stanton of Unilever. Andrew Gelman: "I've never in my professional life made a Type I error or a Type II error"

Just for fun: the Replication Crisis

Just for fun: the scandal around the Stanford Prison Experiment

Just for fun: Psychology journal bans P values (N.B., this did not catch on.)

Just for fun: the Princeton connection to Darwin's finches

Week 3

Hierarchical models

Hierarchical models case study: restaraunt chains

Causal inference

Code (radon): multi.Rmd (html). Data: srrs2.dat

Code (finches): finches.Rmd (html)

Reading: Shalizi, Chapters 1-3. Shalizi & Gelman, Philosophy and the Practice of Bayesian Inference. Gelman & Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Ch. 12.

Shalizi, Chapters 21-24.

Week 4

Causal inference (cont'd).

Code: fake_shaq.R

Code: polls.R (polls.dta)

Code: fake_shaq.R

Linear classifiers. logreg2d.html (Rmd)

Precept: Precept 4 (Rmd)

Reading: Shalizi, Chapters 21-24.

Just for fun: the decision in SFFA vs. Harvard

Just for fun: Basketball skills and height, conditioned on being in the NBA

Week 5

Intro to k-Nearest Neighbors

Logistic Regression on high-dimensional datasets

Word embeddings

Precept: asbestos_causal.R, heart_causal_handout.R

Reading: Sen and Wasow, Race as a Bundle of Sticks: Designs that Estimate Effects of Seemingly Immutable Characteristics

Reading: CIML Ch. 3, CIML Ch. 7

Week 6

Generative models

Handout: gaussian_cancer.R

Precept: Python and word embeddings

Reading: CIML Ch. 9

Week 7

Intro to neural networks

Gradient descent

Training neural networks

Intro to Convolutional Networks

Precept: nb.html

Reading: CIML, Ch. 10

Paper: Caliskan et al., Semantics derived automatically from language corpora contain human-like biases, Science 356, (2017).

Paper: Greenwald et al., Measuring Individual Differences in Implicit Cognition: The Implicit Association Test, J. of Personality and Social Psychology Vol. 74, No. 6 (1998). Project implicit at Harvard. Can We Really Measure Implicit Bias? Maybe Not in the Chronicle of Higher Education.

Paper (seminar on Friday): Spirling and Rodriguez, Word Embeddings: What works, what doesn’t, and how to tell the difference for applied research

Week 8

Neural networks handout

Intro to Convolutional Networks (cont'd)

Precept: Intro to NumPy (prez.jpg), Gradient Descent, Intro to PyTorch (solution: linear regression with PyTorch)

Paper (advanced): Belkin, Reconciling modern machine-learning practice and the classical bias–variance trade-off, PNAS 116(32) (2019)

Paper (mentioned briefly): Dissecting racial bias in an algorithm used to manage the health of populations, Science Vol. 366, Issue 6464, pp. 447-453 (2019)

Reading: CIML, Ch. 10 (continue)

Reading (reference): SciPy Lecture Notes

Reading (reference): Deep Learning with PyTorch: A 60 Minute Blitz

Week 9

Recap: Training machine learning models with gradient descent

Overfitting (cont'd)

Maximum Likelihood with PyTorch

Classifying digits with PyTorch (mnist_all.mat)

For fun: The Bee Gees' How Deep is Your Love in PyTorch (source code)

Precept: Fitting neural networks in PyTorch (some solutions)

Reading: CIML, Ch. 10 (continue)

Reading: Deep Learning with PyTorch: A 60 Minute Blitz (continue)

Week 10

Transfer Learning and Unsupervsied Learning

Intro to RNNs — generating language

Machine Translation with RNN

Precept: Mini-Project 3

Reading: cs231n notes on transfer learning

Reading: Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks (2015)

Just for fun: Multilingual Neural Machine Translation and "Machine Interlingua"

Week 11

Fairness in Machine Learning

Fun with ConvNets + transfer learning

Presentations

Reading: The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning. NIPS 2017 tutorial on Fairness in Machine Learning (slides, video). The analysis by Corbett-Davies et al of the COMPAS dataset in a WaPo blog post.

Papers mentioned: Farid and Dressel, The accuracy, fairness, and limits of predicting recidivism. Mitchell et al, Model Cards for Model Reporting.

Paper on monkey brains and trasnfer learning: Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition

Papers: Neural Style Transfer from the Bethge lab at the University of Tübingen.

Just for fun: Deep Dream grocery trip

Just for fun: the Matthew Effect

Week 12

To what extent is published research reproducible?

Presentations

Papers on reproducible science and false discovery rates:

John Ioannidis, Why Most Published Research Findings Are False (PLoS Medicine, 2005)
Jager and Leek, An estimate of the science-wise false discovery rate and application to the top medical literature (Biostatistics, 2014). Also see the discussion by several authors
Open Science Collaboration, Estimating the reproducibility of psychological science (Science, 2015)
Simmons et al, False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant (Psychological Science, 2011). (See also The Garden of Forking Paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time, which we discussed before)
Brad DeLong and Kevin Lang, Are All Economic Hypotheses False? (Journal of Political Economy, 1992)

Just for fun: was the Stanford prison experiment a real experiment?

Python practicals calendar

Practicals

Homework

Week 1

Intro to Python (problems)

Booleans (problems)

Lists and loops (problems)

Functions (problems)

Please submit exercises 4.1, 4.2, and 5.2 before the next precept. Other exercises are recommended, but don't need to be submitted.

Week 2

Operating on lists and strings

Types in Python

Parallel lists (problems)

Please submit exercises 6.3, 6.4(a), 6.4(b), 7.1, and 7.5 before the next precept. Other exercises are recommended, but don't need to be submitted. Exercises 7.1-7.5 are probably easier than 6.3 and 6.4, so you might like to start with them.

Week 3

Dictionaries (problems)

Nothing to submit, but completing the problems is strongly recommended if you weren't already fluent in Python. We may ask for your to complete the problems if you need substantial help with Project 2.