Advanced Data Systems (CSC2508), Fall 2019
Course Description
The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several
design principles of data management systems and architectures.
The goal of this course is two-fold. First present a review of the fundamental design components of
modern data management architectures including a review of relational and NoSQL systems. Second
review and explore how fundamental components can be re-designed by incorporating Deep Learning principles
and techniques and explore the resulting (performance and system) implications. We will also review
and investigate a few novel data management application scenarios that are uniquely enabled by merging Deep Learning
and query processing technologies.
This is a graduate seminar course. There will be a combination of
presentations by the instructor and the participants. All participants
are expected to actively engage in the course, be familiar with all the
material presented and drive the discussions for the part of the course
they are responsible for. The course involves a project,
more details will be available in class.
Announcements and clarifications
Administrivia
Instructor: Nick Koudas
Lectures: BA025
Office: BA 5240
TA: TBD
|
Office hours: by appointment
Instructor telephone: 416 946-5819
Instructor email: my last name @ uoft cs domain
Course web page: here
|
Course structure
At the start of every lecture, I would ask a member of the class to summarise the main
topic that we will discuss. I would be interested to hear your thoughts on why is this paper
important and whether there is anything you would do to challange in the methodology
or thesis of this paper. This is your chance to bring up any issues you wish that demonstrate your deep understanding
of the topic.
You are expected to actively participate in the discussions for each lecture
and be fully familiar with the papers presented.
For each paper you are assigned to present you are expected to do all the
background research and collect all suitable references. You will share you
slide deck with the class and make it available through the course shared folder along
with all references you used.
The class folder with access to reading material (and presentations as become available) is here
Readings
Review of relational technology (9/9)
- Background: Read Chapter 3,4,10,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
- Overview of Query Optimization in relational Systems, S. Chaudhuri PODS 1998
- R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
- Improved Selectivity Estimation by Combining Knowledge from Sampling Synopses, Muller et. al., VLDB 2018
Overview of noSQL (9/16)
- M. Stonebraker, SQL databases v. NoSQL databases, in Communications. ACM 53(4): 10-11 (2010).
- J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
- Jeffrey Dean, Sanjay Ghemawat: MapReduce: a flexible data processing tool. 72-77, CACM January 2010
- Spark Apache Spark a unified Engine for Big Data Processing, CACM 2016. Zaharia et. al.,
Indexing (9/23)
- SageDB A Learned Database System, CIDR 2018
- The Case for Learned Indicies, SIGMOD 2018
- Considerations for Handling Updates in Learned Index Structures, aiDM 2019
- ALEX: An updatable Adaptive Learned Index, 2019
Query Optimization (9/30)
- Towards A Hands Free Query Optimizer with Deep Learning, CIDR 2019
- Plan Structured Deep Neural Network Models for Query Performance Prediction, VLDB 2019
- Learning State Representations for Query Optimization With Deep Reenforcement Learning DEEM 2018
- Deep Reinforcement Learning for join order enumeration aiDM 2018
- Learning to Optimize Join Queries with Deep Reinforcement Learning
Selectivity Estimation (10/7)
- Learned Cardinalities: Estimating Correlated Joins With Deep Learning CIDR 2019
- Multi attribute selectivity estimation using deep learning
- MADE: Masked Autoencoder for Distribution Estimation NIPS 2016
Data Exploration (10/21)
- Intelligent Rollups in Multidimensional Data VLDB 2001
- Interactive Data Exploration with Smart Drill down VLDB 2015
- Neural Cubes Deep Representations for Visual Data Exploration
- Visual Exploration of Machine Learning Results Using Data Cube Analysis HILDA 2016
Entity Resolution (10/28)
- A Theory of Record Linkage
- Magellan: Toward Building Entity matching Systems VLDB 2016
- Entity Matching: How Similar is Similar VLDB 2011
- Entity Resolution Tutorial
- Deep Learning for Entity Matching a design space exploration, SIGMOD 2018
Entity Resolution Optimizations (11/11)
Data Managament for Video Streams (11/18)
RDBMS for Machine Learning (11/25)
- Scalable Linear Algebra on a relational database systems, ICDE 2017
- Scaling machine Learning via compressed linear algebra, VLDB 2016
- SystemML Declarative Machine Lerning on Spark VLDB 2016
- materialization Optimizations for feature selection workloads SIGMOD 2014
Project Presentations(12/2)
Other Resources
Breakdown of marks
The course mark will be broken down into
the categories listed below, with points assigned as indicated:
Weight | Item | Minimal mark | Moderate mark | High mark |
30% | Participation | Present | Talkative | Insightful comments or questions |
20% | Presentations | Factually correct | Designed and delivered well | Transmits effectively key points, implications, etc. |
5% | Quality of feedback to peers | Focus on nitpicks and minutiae | Suggest incremental improvements | Identify structural strengths and flaws |
45% | Final project | Unambitious and/or badly planned | Partially implemented and/or poorly presented | Implemented successfully with key learning points presented |
Project proposals
The course is associated with a
project. Proposed class projects will be described by the instructor.
Feel free to discuss your ideas with the instructor and propose your
own project. However the project you propose HAS to be associated with the material in the class.
This is very important and it is not up for discussion. The project should have a research component.
Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading
is associated with each project.
The project proposal (due date Oct 21) should contain the following information:
- Topic to be addressed and the nature of the problem
- State of the art (prior work, what remains unsolved, etc.)
- The proposed technique to be implemented/evaluated
- To what degree the project will repeat existing work
- Specific, measurable goals: deliverables, and dates you expect to produce them
Project proposals should be a couple of pages at most. A project status report is due on Nov 11. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.