Advanced Data Systems (CSC2508), Fall 2019

Course Description

The maturity of several Deep Learning technologies has influenced the design and instigated re-thinking several design principles of data management systems and architectures. The goal of this course is two-fold. First present a review of the fundamental design components of modern data management architectures including a review of relational and NoSQL systems. Second review and explore how fundamental components can be re-designed by incorporating Deep Learning principles and techniques and explore the resulting (performance and system) implications. We will also review and investigate a few novel data management application scenarios that are uniquely enabled by merging Deep Learning and query processing technologies.

This is a graduate seminar course. There will be a combination of presentations by the instructor and the participants. All participants are expected to actively engage in the course, be familiar with all the material presented and drive the discussions for the part of the course they are responsible for. The course involves a project, more details will be available in class.

Announcements and clarifications

Administrivia

Instructor: Nick Koudas
Lectures: BA025
Office: BA 5240
TA: TBD

Office hours: by appointment
Instructor telephone: 416 946-5819
Instructor email: my last name @ uoft cs domain
Course web page: here

Course structure

At the start of every lecture, I would ask a member of the class to summarise the main topic that we will discuss. I would be interested to hear your thoughts on why is this paper important and whether there is anything you would do to challange in the methodology or thesis of this paper. This is your chance to bring up any issues you wish that demonstrate your deep understanding of the topic.

You are expected to actively participate in the discussions for each lecture and be fully familiar with the papers presented. For each paper you are assigned to present you are expected to do all the background research and collect all suitable references. You will share you slide deck with the class and make it available through the course shared folder along with all references you used.

The class folder with access to reading material (and presentations as become available) is here

Readings

Review of relational technology (9/9)

Background: Read Chapter 3,4,10,13,14,15 from ramakrishnan's book (Database Management Systems) 3rd Edition.
Overview of Query Optimization in relational Systems, S. Chaudhuri PODS 1998
R-Trees: A Dynamic Index Structure for Spatial Searching, A. Guttman, SIGMOD 1984
Improved Selectivity Estimation by Combining Knowledge from Sampling Synopses, Muller et. al., VLDB 2018

Overview of noSQL (9/16)

M. Stonebraker, SQL databases v. NoSQL databases, in Communications. ACM 53(4): 10-11 (2010).
J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
Jeffrey Dean, Sanjay Ghemawat: MapReduce: a flexible data processing tool. 72-77, CACM January 2010
Spark Apache Spark a unified Engine for Big Data Processing, CACM 2016. Zaharia et. al.,

Indexing (9/23)

SageDB A Learned Database System, CIDR 2018
The Case for Learned Indicies, SIGMOD 2018
Considerations for Handling Updates in Learned Index Structures, aiDM 2019
ALEX: An updatable Adaptive Learned Index, 2019

Query Optimization (9/30)

Towards A Hands Free Query Optimizer with Deep Learning, CIDR 2019
Plan Structured Deep Neural Network Models for Query Performance Prediction, VLDB 2019
Learning State Representations for Query Optimization With Deep Reenforcement Learning DEEM 2018
Deep Reinforcement Learning for join order enumeration aiDM 2018
Learning to Optimize Join Queries with Deep Reinforcement Learning

Selectivity Estimation (10/7)

Learned Cardinalities: Estimating Correlated Joins With Deep Learning CIDR 2019
Multi attribute selectivity estimation using deep learning
MADE: Masked Autoencoder for Distribution Estimation NIPS 2016

Data Exploration (10/21)

Intelligent Rollups in Multidimensional Data VLDB 2001
Interactive Data Exploration with Smart Drill down VLDB 2015
Neural Cubes Deep Representations for Visual Data Exploration
Visual Exploration of Machine Learning Results Using Data Cube Analysis HILDA 2016

Entity Resolution (10/28)

A Theory of Record Linkage
Magellan: Toward Building Entity matching Systems VLDB 2016
Entity Matching: How Similar is Similar VLDB 2011
Entity Resolution Tutorial
Deep Learning for Entity Matching a design space exploration, SIGMOD 2018

Entity Resolution Optimizations (11/11)

Data Managament for Video Streams (11/18)

RDBMS for Machine Learning (11/25)

Scalable Linear Algebra on a relational database systems, ICDE 2017
Scaling machine Learning via compressed linear algebra, VLDB 2016
SystemML Declarative Machine Lerning on Spark VLDB 2016
materialization Optimizations for feature selection workloads SIGMOD 2014

Project Presentations(12/2)

Other Resources

PhD Comics
Grad school

Food for thought: grad school

Food for thought: more talks

Breakdown of marks

The course mark will be broken down into the categories listed below, with points assigned as indicated:

Weight	Item	Minimal mark	Moderate mark	High mark
30%	Participation	Present	Talkative	Insightful comments or questions
20%	Presentations	Factually correct	Designed and delivered well	Transmits effectively key points, implications, etc.
5%	Quality of feedback to peers	Focus on nitpicks and minutiae	Suggest incremental improvements	Identify structural strengths and flaws
45%	Final project	Unambitious and/or badly planned	Partially implemented and/or poorly presented	Implemented successfully with key learning points presented

Project proposals

The course is associated with a project. Proposed class projects will be described by the instructor. Feel free to discuss your ideas with the instructor and propose your own project. However the project you propose HAS to be associated with the material in the class. This is very important and it is not up for discussion. The project should have a research component. Project ideas will be outlined in class but you are responsible for proposing your project. Some background reading is associated with each project. The project proposal (due date Oct 21) should contain the following information:

Topic to be addressed and the nature of the problem
State of the art (prior work, what remains unsolved, etc.)
The proposed technique to be implemented/evaluated
To what degree the project will repeat existing work
Specific, measurable goals: deliverables, and dates you expect to produce them

Project proposals should be a couple of pages at most. A project status report is due on Nov 11. The status report should include a description of progress to date and what is expected to be accomplished by the final project presentation day.