SML201: Introduction to Data Science

Spring 2019

Course staff


Course description   Introduction to Data Science provides a practical introduction to the burgeoning field of data science. The course introduces students to the essential tools for conducting data-driven research, including the fundamentals of programming techniques and the essentials of statistics. Students will work with real-world datasets from various domains; write computer code to manipulate, explore, and analyze data; use basic techniques from statistics and machine learning to analyze data; learn to draw conclusions using sound statistical reasoning; and produce scientific reports. No prior knowledge of programming or statistics is required.

Course assignments

Projects
Project 1: Auditing the COMPAS score (10%). Due: Monday April 1 at 11PM (late submission with no penalty is allowed until Friday April 5 at 11PM).
Project 2: Cancer and Microarray Data (10%). Due: Monday April 15 at 11PM (late submission with no penalty is allowed until Tuesday April 23 at 11PM)
Project 3: Risk Prediction for ICU Patients (15%). Due: Tuesday May 14 at 5PM (firm deadline)

Precept Problem Sets
Week of Feb 11: Precept 1: Intro, functions, and dataframes (Rmd source). No take-home component. Solutions: html (Rmd source). Worth: 2%
Week of Feb 18: Precept 2: dplyr and intro to ggplot (Rmd source). check2.R, submission2.R, instructions for using check2.R, Part 1 and Part 2. Solutions: p2_soln.R. Worth: 2% in-precept, 1% complete-at-home (Due Monday Feb 25 at 11PM).
Week of Feb 25: Precept 3: predictive modelling (Rmd source). You must pass check3.R to earn grades for the problem set. Worth: 2% in-precept, 1% complete-at-home (Due Monday March 4 at 11PM). Solutions: Part 1, Part 2, Part 3. Solutions code
Week of March 4: Precept 4: predictive modelling with logistic regression, review, ggplot (Rmd source). Worth: 2% in-precept. Solutions.
Week of March 11: Precept 5: Overfitting and an introduction to sampling (Rmd source). Worth: 2% in-precept. Solutions (Rmd source). Video solutions: Part 1, Part 2, Part 3
Week of March 25: Precept 6: replicate and histograms (Rmd source). Worth: 2% in-precept. Solutions: html, Rmd
Week of April 1: Precept 7: cumulative probability (Rmd source). Worth: 2% in-precept. Solutions: html, Rmd
Week of April 8: Precept 8: P-values I (Rmd source). Worth: 2% in-precept. Solutions: html, Rmd
Week of April 15: Precept 9: P-values II (Rmd source). Worth: 2% in-precept. Solutions: html, Rmd
Week of April 22: No new precept assignment. Students can come in to make up past in-precept problem sets.
Week of April 29: Precept 10: Regression and Inference (Rmd source). Worth: 2% in-precept. Solutions: html, Rmd. Just for fun: some of the outliers you'll see are phylogenetically close.

Tests

Midterm test (15%): Tuesday March 12, in class. Reference sheet. Study problems. Solutions (Rmd source)
End-of-term test (15%): Thursday April 25, in class. Reference sheet. Study problems. Solutions (Rmd source)

Extra Credit

Class-wide DataViz contest (submissions due April 28, on Blackboard)

Midterm 1 Make-Up Assignment on DataCamp: details here (completion due May 1 on DataCamp; please complete ealier if you need to catch up)

Logistics

Class meetings
The morning section meets at East Pyne 010 on Tues 11:00am-12:20pm and Thurs 11:00am-12:20pm. The afternoon section meets at Aaron Burr 219 on Tues 3:00pm-4:20pm and Thurs 3:00pm-4:20pm.
Instructor office hours
Wednesday 2:30pm-3:30pm, Friday 12:00pm-1:00pm in CSML 202. Or email for an appointment. Or drop by to see if I'm in. Feel free to chat with me after lecture.
Preceptor office hours
Preceptors are available during scheduled office hours or by appointment.

Course information

Evaluation
35%: Projects
30%: Precept problems + follow-up problem sets on selected weeks
30%: Tests
5%: iClicker quizzes

Resources

Software

Please install R and RStudio as soon as possible.

Reading

Textbooks
R for Data Science by Garrett Grolemund and Hadley Wickham. (Free e-book available from the book website)
Data Visualization: A practical introduction by Kieran Healy (Free e-book draft available from the book website)
OpenIntro Statistics 3rd Edition by David Diez and Mine Cetinkaya-Rundel (Free e-book available from the book website)
SML201 students have access to online DataCamp courses for free, courtesy of DataCamp. See the course Piazza for details on how to sign up.

An inclusive environment

We strive to build and maintain an inclusive environment in class — an environment that allows every student to reach their full potential. Please do not hesitate to contact me and/or your preceptor to let us know if you need special accommodation or with any concerns.

Design credit: CS229, Jan 2019.