LecturesReading and Materials
Week 1

Welcome to SML201

Lecture 1 (Rmd source): evaluating R expressions, printing to the console, variables, conditionals, functions.

Lecture 2 (Rmd source): comments, syntax, vectors, indexing vectors, operating on logical values, parallel vectors, intro to data frames.

Reading: DataCamp's Intro to R, Ch. 1, 2, 5

Just for fun: Physician salary data

Week 2

Lecture 1 (Rmd source): wrangling data with dplyr (pipes, group_by, summarize)

Lecture 2 (Rmd source): review of dplyr (on the board), n(), per capita statistics, an aside on rounding, sapply, grep, and an application to the OKCupid dataset.

Reading R for Data Science, Ch. 5

Just for fun: the history of grep

Week 3

Lecture 1 (Rmd source): selecting columns, more on sapply

Lecture 2 (Rmd source). Tidy data (Rmd source). Intro to DataViz. On the board: plotting x on a log-scale.

Reading: continue reading R for Data Science, Ch. 5. Data Visualization, Ch. 1-3 (focus on Ch. 1 and Ch. 3; not everything in Ch. 2 will be discussed).

Week 4

Lecture 1: a complicated example with dplyr, more DataViz with ggplot (Rmd source), intro to predictive modelling (slides). Predictive modelling with linear regression (Rmd source).

Lecture 2: Intro to predictive modelling, cont'd. Predictive modelling with logistic regression (Rmd source). Download titanic.csv

Reading: OpenIntro Statistics, Ch. 7.1-7.2.1., 7.2.3-7.2.5, 7.2.7. Ch. 8.1-8.1.2.

OpenIntro Statistics, Ch. 8.4-8.4.3

Continue reading Data Visualization, Ch. 3

Just for fun: the Titanic may have been atypical in following the "women and children first" policy. Typically, more men than women survive. See M. Elinder and O. Erixson, Gender, social norms, and survival in maritime disasters, PNAS vol. 109 no. 33, 2012.

Week 5

Lecture 1 (Rmd source), continued ggplot lecture (Rmd source), barcharts and histograms in ggplot (Rmd source)

Lecture 2 (Rmd source): review of string literals, sample, a quick into to pseudo-random numbers, overfitting, the training/test/validation split. A very brief intro to model selection.

Reading: continue reading Data Visualization, Ch. 3 and Ch. 4

Continue reading OpenIntro Statistics, Ch. 7.1-7.2.1., 7.2.3-7.2.5, 7.2.7. Ch. 8.1-8.1.2. Ch. 8.4-8.4.3.

Just for fun: the Dennis the Dentist study.

Just for fun: Pseudo-random numbers

Week 6

Lecture: Cross-Validation, Test/Train/Validation split, variable selection. Interpreting regression coefficients, intro to association vs. causation. Fairness in Machine Learning.

Reading: Julia Dressel and Hany Farid, The accuracy, fairness, and limits of predicting recidivism. Sam Corbett-Davies and Sharad Goel, The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning (more technical). Margaret Mitchell et al, Model Cards for Model Reporting.

Week 7

Lecture 1 (code, Rmd source): Probability, probability mass functions; cumulative mass functions.

Lecture 2: Mid-semester discussion, training/test/validation splits: review. Project overview. Review of the Precept 5 solutions.

Reading: OpenIntro Statistics Ch. 2, with emphasis on 2.1, 2.3, 2.4, 2.5.

Week 8

Lecture 1 (Rmd source): review of probability. Cumulative probability.

Lecture 2: Intro to P-values and the normal distribution (Rmd source). More p-values (Rmd source).

Reading: OpenIntro Statistics Ch. 2 (review). Ch. 3.1.1, 3.1.5, 3.3.1, 3.4.1-3.4.2, 4.1-4.3. Note: the reading goes into more detail than is needed for understanding the lecture. Use the reading if you're interested in learning more.

Just for fun: Seasonality of births in schizophrenia. College athletics and month of birth.

Just for fun: How race and religions match in online dating + bonus astrology content

Just for fun:

Week 9

Lecture 1: Fairness recap, Intro to Project 2, p-values.

Lecture 2: Normal approximations and the t-test (Rmd). reading CSV files (salaries.csv, salaries1.csv)

Reading: OpenIntro Statistics Ch. 2, with emphasis on 2.1, 2.3, 2.4, 2.5.

Reading: Julia Dressel and Hany Farid, The accuracy, fairness, and limits of predicting recidivism (repeat)

Reading: Laura Hoopes, Genetic Diagnosis: DNA Microarrays and Cancer

Just for fun: FiveThirtyEight's p-value video

Just for fun: Data Scientist: the sexiest job of the 21st century at the Harvard Business Review.

Week 10

Lecture 1 (Rmd source): P-values review

Lecture 2: Hypothesis testing. Code: Rmd,html

Reading: OpenIntro Statistics 4.3.4, 5.1-5.2.1

Reading: the American Statistical Association's statement on p-values. Unilever statement on q-tips: "People may use [Q-tips] for ear cleaning, but we instruct against it," said Stanton of Unilever. Andrew Gelman: "I've never in my professional life made a Type I error or a Type II error"

Just for fun: the Amazon reviews of A Million Random Digits with 100,000 Normal Deviates (the review mentioned in class)

Just for fun: the Replication Crisis

Just for fun: the scandal around the Stanford Prison Experiment

Just for fun: Psychology journal bans P values (N.B., this did not catch on.)

Week 11

Lecture 1: Hypothesis testing recap, Confidence Intervals

Reading: OpenIntro Statistics Ch. 4.1-4.3

Week 12

Lecture 1: Inference with linear regression, Rmd, html. Linear regression, Rmd, html

Lecture 2: Introduction to artificial neural networks, looking ahead

Reading: OpenIntro Statistics Ch. 5, Ch. 6.1-6.2, Ch. 7.

Reading (books on machine learning): Neural Networks and Deep Learning by Michael Nielsen. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An introduction to Statistical Learning.

Neural network success stories: Mastering the game of Go without human knowledge, Human-level control through deep reinforcement learning

Just for fun: the room draw analysis

Just for fun: are you better at object recognition than a neural network?