SML201: Introduction to Data Science

Spring 2020

Course team


Course description   Introduction to Data Science provides a practical introduction to the burgeoning field of data science. The course introduces students to the essential tools for conducting data-driven research, including the fundamentals of programming techniques and the essentials of statistics. Students will work with real-world datasets from various domains; write computer code to manipulate, explore, and analyze data; use basic techniques from statistics and machine learning to analyze data; learn to draw conclusions using sound statistical reasoning; and produce scientific reports. No prior knowledge of programming or statistics is required.

Course assignments

Problem sets
Problem set 1: Vectors and Data Frames (3%). Due: Feb. 21 Feb. 24 9 p.m. Solutions.
Problem set 2: Wrangling Data Frames and Linear Regresion (3%) Due: March 2 March 5 9p.m. Solutions.
Projects (topics are tentative)
Project 1: Auditing the COMPAS score (11%). Due: March 30 April 6 9p.m. Solutions (Rmd)
Project 2: Cancer and Gene Expressign Data (12%). Due: April 13 April 20 9p.m.
Project 3 (ICU option): Risk Prediction for ICU Patients (12%). Due: May 11 12 5p.m.
Project 3 (COVID option): SARS-CoV-2 and the SIR model (12%). Due: May 11 12 5p.m.

Every student has a total of 6 grace days they can use throughout the term (except for Project 3) to avoid a lateness penalty of 10% per 24 hours, rounded up to the nearest whole number of days. You cannot use more than three grace days at a time.

Tests
Term Test 1: Thursday March 12 Tuesday March 24 Term test 2: Saturday April 25, 1pm-5:30pm.

Precept Assignments
Week of Feb. 3: Precept 1: Intro, functions (Rmd source). Solutions (Rmd source)
Week of Feb. 10: Precept 2: Vectors and Data Frames (Rmd source). Solutions. Q4 video solutions.
Week of Feb. 17: Precept 3: Wrangling Data Frames (Rmd source). Solutions. Video solutions: Q1, Q2, Q3, Q4
Week of Feb. 24: Precept 4: Linear Regression and sapply (Rmd source). Solutions (Rmd source). Video solutions: Q1, Q2, Q3.
Week of Mar. 2: Precept 5: Logistic regression and ggplot (Rmd source). Solutions (Rmd source). Video solutions: Q1, Q2, Q2b-end, Q3, Q4, Q5.
Week of Mar. 9: No new assignment, but course staff available to answer questions during precept time.
Week of Mar. 23: Precept 6: Overfitting, a preview of tidy data (Rmd source). Solutions for Q1(Rmd source). Solutions (Rmd source). Video solutions: Problems 1 and 2
Week of Mar. 30: Precept 7: R Markdown, tidy data, and fairness criteria (Rmd source). Solutions (Rmd source).
Week of Apr. 6: Precept 8: Cumulative probability and p-values (Rmd source). Solutions (Rmd source) Video solutions
Week of Apr. 13: Precept 9: P-values II (Rmd source). Solutions (Rmd source). Video solutions
Week of Apr. 20: Precept 10: P-values review (Rmd source). Solutions (Rmd source). Video solutions: Q1, Q2-Q4, Q5
Week of Apr. 27: Precept 11: Linear Regression (Rmd source). Solutions (Rmd source).

Logistics

Class meetings

The morning section meets at McComick Hall 101 on Zoom on Tues 11:00am-12:20pm and Thurs 11:00am-12:20pm Eastern Time.

The afternoon section meets at Robertson Hall 001 on Zoom on Tues 3:00pm-4:20pm and Thurs 3:00pm-4:20pm Eastern Time.

Precept meetings

See here for precept logistics/assignments and here for links.

Instructor office hours
Mondays and Fridays 2:30-3:30, or email for an appointment. See the link below for Zoom links.
Preceptor office hours
Preceptors will be available during scheduled office hours or by appointment.

Course information

Evaluation
35%: Projects
6%: Problem Sets
32%: Tests (best of 10% Test 1 and 22% Test 2 or 22% Test 1 and 10% Test 2)
5%: iClicker quizzes + lecture viewing declarations
22%: In-precept assignments

Resources

Software

Please install R and RStudio as soon as possible.

Reading

Textbooks
Statistical Thinking for the 21st Century by Russell A. Poldrack (Free e-boom at the book website)
Data Visualization: A practical introduction by Kieran Healy (Free e-book draft available from the book website)
R for Data Science by Garrett Grolemund and Hadley Wickham. (Free e-book available from the book website)
SML201 students will have access to online DataCamp courses for free, courtesy of DataCamp. See the course Piazza for details on how to sign up.

An inclusive environment

We strive to build and maintain an inclusive environment in class — an environment that allows every student to reach their full potential. Please do not hesitate to contact me and/or your preceptor to let us know if you need special accommodation or with any concerns.

Design credit: CS229, Jan 2019.