Advanced database systems (CSC2508), Fall 2016
Over the last few years a plethora of new data management systems and
architectures have become mainstream. Such systems have vastly diverse
application focus, architectures and differ significantly from
traditional transactional database management systems and data warehouses. The goal of this course is to
explore these systems, broadly characterised as noSQL/newSQL systems and
understand their strengths and limitations. We will also explore
new trends in data management fueld by application needs, such as support for advanced analytics,
stream processing systems and main memory data processing.
This is a graduate seminar course. There will be a combination of
presentation by the instructor and the participants. All participants
are expected to actively engage in the course, be familiar with all the
material presented and drive the discussions for the part of the course
they are responsible for. The course involves a project,
more details will be available in class.
Announcements and clarifications
Instructor: Nick Koudas
Office: BA 5240
Office hours: by appointment
Instructor telephone: 416 946-5819
Instructor email: my last name @ uoft cs domain
Course web page: here
At the start of every lecture, I would ask a member of the class to summarise the main
topic that we will discuss. I would be interested to hear your thoughts on why is this paper
important and whether there is anything you would do to challange in the methodology
or thesis of this paper. This is your chance to bring up any issues you wish that demonstrate your deep understanding
of the topic.
You are expected to actively participate in the discussions for each lecture
and be fully familiar with the paper presented.
For each system you are assigned to present you are expected to do all the
background research and collect all suitable references. You will share you
slide deck with the class and make it available through this website along
with all references you used.
For each type of system presented make sure you structure your presentation
along the following lines:
- Programming Language Interface
- Data Model + Operators
- Physical Structures (e.g., indexes, hash tables)
- Users + Applications + Target Workloads
- Transaction Support:
- Consistency Model
- Failure Model + Replica Management
- ACID Properties
- Elasticity + Data Redistribution
- System Architecture
- Why makes the system different from others?
Parallel Databases (9/12)
- Background: Read Chapter 16, 22 from ramakrishnan's book.
- Overview of Transactions, NoSQL
- DeWitt et. al., Parallel Database Systems: The Future of High
Performance Processing, CACM 1992 (*)
Introduction to noSQL (9/19)
- M. Stonebraker, SQL databases v. NoSQL databases, in Communications. ACM 53(4): 10-11 (2010).
- J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI 2004.
- Jeffrey Dean, Sanjay Ghemawat: MapReduce: a flexible data processing tool. 72-77, CACM January 2010
- The Performance of Map Reduce an in Depth Study, VLDB 2010 (suggested)
SQL on Hadoop (9/26)
- SQL on Hadoop Systems (Tutorial), VLDB 2015
- BigTable: A Distributed Storage System for Structured Data, OSDI
- Hive : A Warehouse solution over s map reduce framework, VLDB 2009
- Major Technical Advancements in Apache Hive, SIGMOD 2014 (suggested)
- Dryad: Distributed Data Parallel Programs from Sequential Building Blocks, EuroSys 2007 (suggested)
SQL on Hadoop Cont (10/3)
- SQL on Hadoop Systems, VLDB 2015
- Impala: A Modern Open Source SQL Engine on Hadoop, CIDR 2015
- HadoopDB: An Architectural Hybrid of Map Reduce and DBMS technologies for analytical workloads, VLDB 2009
- SQL on Hadoop: Full Circle Back to Parallel Database Architectures (suggested)
- Query Optimization discussion
- Spark: Cluster Computing with Working Sets. Matei Zaharia et. al., . HotCloud 2010. June 2010.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia et. al., NSDI 2012
- Scaling Spark in the Real World, VLDB 2015
- Lessons from running large scale spark workloads (presentation)
- Spark SQL: Relational Data Processing in Spark SIGMOD 2015
- MapReduce vs Spark for Large Scale Data Analytics VLDB 2016
- A top Down Method For Performance Analysis and Counters Architecture, ISPASS 2014
- Profiling R on a contemporary processor VLDB 2015
- Deep-dive Analysis of the Data Analytics Workload in CloudSuite, IISWC 2014
Analytics I (11/14)
- Learning Generalized Linear Models over Normalized Data, SIGMOD 2014
- Optimizing Machine Learning over normalized Data, VLDB 2015
Analytics II (11/21)
- To join or not to join: Thinking twice about joins before feature extraction, SIGMOD 2016
- Materialization optimizations for feature extraction workloads VLDB 2014
Analytics III (11/28 and 12/5)
- Data Cleaning Problems and Current Approaches IEEE DEB 2000
- ActiveClean: Interactive Data Cleaning for Statistical Modeling, VLDB 2016
- Progressive Approach to Relational Entity Resolution, VLDB 2014
- Guided Data Repair, VLDB 2011
Project Proposals / Paper presentations (12/12)
Breakdown of marks
The course mark will be broken down into
the categories listed below, with points assigned as indicated:
|Weight||Item||Minimal mark||Moderate mark||High mark|
|30%||Participation||Present||Talkative||Insightful comments or questions|
|20%||Presentations||Factually correct||Designed and delivered well||Transmits effectively key points, implications, etc.|
|5%||Quality of feedback to peers||Focus on nitpicks and minutiae||Suggest incremental improvements||Identify structural strengths and flaws|
|45%||Final project||Unambitious and/or badly planned||Partially implemented and/or poorly presented||Implemented successfully with key learning points presented|
The course is associated with a
project. Proposed class projects will be described by the instructor.
Feel free to discuss your ideas with the instructor and propose your
own project. However the project you propose HAS to be associated with the material in the class.
This is very important and it is not up for discussion. The project should have a research component. A simple
implementation using the systems we discuss in class, on a data set you find interesting, does not constitute a project for this class. The projects will be outlined in class
and descriptions will be distributed in class. Some background reading
is associated with each project. The relevant technical papers will be
distributed in class.
The project proposal (due date Nov 7) should contain the following information:
Project proposals should be a couple of pages at most.
- Topic to be addressed and the nature of the problem
- State of the art (prior work, what remains unsolved, etc.)
- The proposed technique to be implemented/evaluated
- To what degree the project will repeat existing work
- Specific, measurable goals: deliverables, and dates you expect to produce them