SML201: Introduction to Data Science (Calendar)

Calendar

	Lectures	Reading and Materials
Week 1	Welcome to SML201 Lecture 1 (Rmd source): evaluating R expressions, printing to the console, variables, conditionals, functions. Lecture 2 (Rmd source): comments, syntax, vectors, indexing vectors, operating on logical values, parallel vectors, intro to data frames.	Reading: DataCamp's Intro to R, Ch. 1, 2, 5 Just for fun: Physician salary data
Week 2	Lecture 1 (Rmd source): wrangling data with dplyr (pipes, group_by, summarize) Lecture 2 (Rmd source): review of dplyr (on the board), n(), per capita statistics, an aside on rounding, sapply, grep, and an application to the OKCupid dataset.	Reading R for Data Science, Ch. 5 Just for fun: the history of grep
Week 3	Lecture 1 (Rmd source): selecting columns, more on sapply Lecture 2 (Rmd source). Tidy data (Rmd source). Intro to DataViz. On the board: plotting x on a log-scale.	Reading: continue reading R for Data Science, Ch. 5. Data Visualization, Ch. 1-3 (focus on Ch. 1 and Ch. 3; not everything in Ch. 2 will be discussed).
Week 4	Lecture 1: a complicated example with dplyr, more DataViz with ggplot (Rmd source), intro to predictive modelling (slides). Predictive modelling with linear regression (Rmd source). Lecture 2: Intro to predictive modelling, cont'd. Predictive modelling with logistic regression (Rmd source). Download titanic.csv	Reading: OpenIntro Statistics, Ch. 7.1-7.2.1., 7.2.3-7.2.5, 7.2.7. Ch. 8.1-8.1.2. OpenIntro Statistics, Ch. 8.4-8.4.3 Continue reading Data Visualization, Ch. 3 Just for fun: the Titanic may have been atypical in following the "women and children first" policy. Typically, more men than women survive. See M. Elinder and O. Erixson, Gender, social norms, and survival in maritime disasters, PNAS vol. 109 no. 33, 2012.
Week 5	Lecture 1 (Rmd source), continued ggplot lecture (Rmd source), barcharts and histograms in ggplot (Rmd source) Lecture 2 (Rmd source): review of string literals, sample, a quick into to pseudo-random numbers, overfitting, the training/test/validation split. A very brief intro to model selection.	Reading: continue reading Data Visualization, Ch. 3 and Ch. 4 Continue reading OpenIntro Statistics, Ch. 7.1-7.2.1., 7.2.3-7.2.5, 7.2.7. Ch. 8.1-8.1.2. Ch. 8.4-8.4.3. Just for fun: the Dennis the Dentist study. Just for fun: Pseudo-random numbers
Week 6	Lecture: Cross-Validation, Test/Train/Validation split, variable selection. Interpreting regression coefficients, intro to association vs. causation. Fairness in Machine Learning.	Reading: Julia Dressel and Hany Farid, The accuracy, fairness, and limits of predicting recidivism. Sam Corbett-Davies and Sharad Goel, The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning (more technical). Margaret Mitchell et al, Model Cards for Model Reporting.
Week 7	Lecture 1 (code, Rmd source): Probability, probability mass functions; cumulative mass functions. Lecture 2: Mid-semester discussion, training/test/validation splits: review. Project overview. Review of the Precept 5 solutions.	Reading: OpenIntro Statistics Ch. 2, with emphasis on 2.1, 2.3, 2.4, 2.5.
Week 8	Lecture 1 (Rmd source): review of probability. Cumulative probability. Lecture 2: Intro to P-values and the normal distribution (Rmd source). More p-values (Rmd source).	Reading: OpenIntro Statistics Ch. 2 (review). Ch. 3.1.1, 3.1.5, 3.3.1, 3.4.1-3.4.2, 4.1-4.3. Note: the reading goes into more detail than is needed for understanding the lecture. Use the reading if you're interested in learning more. Just for fun: Seasonality of births in schizophrenia. College athletics and month of birth. Just for fun: How race and religions match in online dating + bonus astrology content Just for fun: horoscope.com
Week 9	Lecture 1: Fairness recap, Intro to Project 2, p-values. Lecture 2: Normal approximations and the t-test (Rmd). reading CSV files (salaries.csv, salaries1.csv)	Reading: OpenIntro Statistics Ch. 2, with emphasis on 2.1, 2.3, 2.4, 2.5. Reading: Julia Dressel and Hany Farid, The accuracy, fairness, and limits of predicting recidivism (repeat) Reading: Laura Hoopes, Genetic Diagnosis: DNA Microarrays and Cancer Just for fun: FiveThirtyEight's p-value video Just for fun: Data Scientist: the sexiest job of the 21st century at the Harvard Business Review.
Week 10	Lecture 1 (Rmd source): P-values review Lecture 2: Hypothesis testing. Code: Rmd,html	Reading: OpenIntro Statistics 4.3.4, 5.1-5.2.1 Reading: the American Statistical Association's statement on p-values. Unilever statement on q-tips: "People may use [Q-tips] for ear cleaning, but we instruct against it," said Stanton of Unilever. Andrew Gelman: "I've never in my professional life made a Type I error or a Type II error" Just for fun: the Amazon reviews of A Million Random Digits with 100,000 Normal Deviates (the review mentioned in class) Just for fun: the Replication Crisis Just for fun: the scandal around the Stanford Prison Experiment Just for fun: Psychology journal bans P values (N.B., this did not catch on.)
Week 11	Lecture 1: Hypothesis testing recap, Confidence Intervals	Reading: OpenIntro Statistics Ch. 4.1-4.3
Week 12	Lecture 1: Inference with linear regression, Rmd, html. Linear regression, Rmd, html Lecture 2: Introduction to artificial neural networks, looking ahead	Reading: OpenIntro Statistics Ch. 5, Ch. 6.1-6.2, Ch. 7. Reading (books on machine learning): Neural Networks and Deep Learning by Michael Nielsen. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An introduction to Statistical Learning. Neural network success stories: Mastering the game of Go without human knowledge, Human-level control through deep reinforcement learning Just for fun: the room draw analysis Just for fun: are you better at object recognition than a neural network?