The UofT Data Science Group (UTDS)

Email guerzhoy@cs for details on meetings in January 2017

Facebook group

Chat with other people at the FB group, The idea is to let everyone know which features/learning methods you're using. For the submission, we could potentially combine everyone's work.

Github repo

Our GitHub repo is here. If you're using the code, please join the UTDS team. Please email Michael (guerzhoy@cs.toronto.edu) so he can send you an invite.

Meetings

2016 meetings will be Fridays. We are working on the Home Depot dataset. See the Facebook group for meeting annoucements.

Meetings ~~Tuesdays 5:30~~ and Fridays 5 in BA4261 (behind the elevators near the CS Undergrad Office)

Initial meeting: Friday Oct. 9 at 3:30 in BA5256

Minutes

Tuesdays or Fridays work for Michael, but not necessarily for other people. We'll try to figure this out soon
Problem to work on: the Rossmann Store Sales problem looks good
The plan is to work as a team and submit as a team. That way, we can share code and ideas
Recommended software: the Anaconda distribution contains most of what you'd need for working in Data Science with Python. I'll also try to set up Caffe for people who want to do neural networks
My rossmann.py file that reads in the data and runs a regularized/ridge linear regression on it using Pandas and Scikit-Learn (NOTE: this has been superceded by new and better versions posted on the FB group)
Discussion at the meeting: given the evaluation metric (where what's important is the ratio of the prediction to the actual value), it makes sense to predict the log-sales rather than the sales
More discussion: it seems like there are two clusters in the data on log-sales (you can't quite see it in the histogram without taking the log): 0 sales and log-sales centered at a certain mean. Can we run logistic regression to distinguish the two? (Or is it even as complicated as that? Maybe 0 sales just means it's a holiday
How do sale depend on the day of the week?
Idea: run decision tree regression
General discussion at the end of the meeting: linear regression, ridge regression, logistic regression, cross validation
Recommended course for beginners: CS 109 at Harvard
Another reasonable reference for beginners: Data Science for Business
Other good advanced texts: The Elements of Statistical Learning (the pdf is free on the web from the book homepage, Information Theory, Inference, and Learning Algorithms (the pdf is free on the web from the book homepage.

Resources

scikit-learn (Available on CDF if you run /local/packages/anaconda3/bin/python)
Weka
LIBSVM

About

Data Science is a new field devoted to the analysis of large datasets. Everything from Google's cat detection algorithms to Nate Silver's election predictions can be considered applications of Data Science. As petabytes of data become available, the opportunities for data scientists are growing

The goal of UTDS is to create an environment where people interested in Data Science can get their feet wet in the field, improve their skills, share knowledge, network, and build their resumes with the goal of professionally working in Data Science. We will have internal Data Science competitions, tutorials, as well as informal discussion groups. We will also combine forces to compete in external competitions on Kaggle (which has cash prizes).

The Organizer

Michael Guerzhoy is a Lecturer in the Department of Computer Science. When he's not teaching, he works in machine learning, Bayesian statistics, and computer vision. Michael is the winner of the Canadian Conference on Artificial Intelligence Best Paper Award for 2014 and holds three patents in the fields of machine learning and computer vision. Michael worked in R&D in computer vision and machine learning. Michael occasionally works as a statistical and data science consultant; he has consulted for CBC as well as for several researchers at UofT.

Meeting Format

We will work together on one or several Kaggle challanges. The meetings will be a mix of brainstorming and tutorials by Michael and other members of UTDS.

Who Can Participate?

Everyone is welcome, space permitting. If you can code well and know a little probability theory, you can start learning, although the learning curve might be steep. If you know more, all the better!

What About CSC321, CSC411, STA412, STA414, CSC2515, etc.?

They're great! You will benefit from having taken those if you want to participate in the group, but all that's required to get started is decent coding skills, a little knowledge of probability theory, and a desire to learn. At UTDS we will focus on the practical side of things, while the machine learning courses focus more on the theoretical side of things (though of course there's always a mix of theory and practice.)

I'm sold. Where do I sign up?

Please fill out a survey and include your email address to be notified about meeting schedules