CSC2412 Algorithms for Private Data Analysis (Fall 2018)

Course Information

Instructor Aleksandar Nikolov
Office: Sandford Fleming 2301B
E-mail:
Location Bahen 3116 (BA3116)
Time Thursday, 2-4pm
Office hours by appointment.
Reading As a reference we will use the monograph by Dwork and Roth (from Aaron Roth's website).
We will also use the recent tutorial notes by Salil Vadhan.
Grading Based on one problem set (20%), a written project (50%), and a presentation (30%).

Course Description

The practical applicability of data analysis to sciences and decision-making is often limited by individual privacy concerns. Some of the most interesting data sets - medical studies, genetic data, movie reviews, search logs, social network connections - contain sensitive information about people represented in the data set. If individuals do not think their privacy concerns are addressed, they may refuse to participate in data collection, or not answer surveys truthfully. These issues limit the validity of the data analysis results, and make balancing privacy and usefulness essential to many applications of big data. In this course we will study privacy in data analysis from a rigorous theoretical perspective. We will focus on Differential Privacy: a recent approach to achieving strong provable privacy guarantees in the analysis of sensitive data. Informally, a data analysis algorithm is differentially private if changing the data of a single individual changes the output distribution of the algorithm only slightly. This guarantee ensures that the privacy risk to any individual increases only slightly by participating in data collection. Our focus will be on the design of efficient differentially private algorithms. In the process, we will learn about fascinating connections between differential privacy and game theory, learning theory, and geometry.

While we focus on the algorithms, there will be proofs. It is important to prove formally that an algorithm provides privacy: this is not something that can be verified experimentally. Most proofs are simple, but the course does require mathematical maturity and a background in probability theory, and the design and analysis of algorithms.

Problem Set

The Problem Set is posted. It is due in three weeks, on November 2 at noon. You should submit via MarkUs at markus.teach.cs.toronto.edu/csc2412-2018-09. See the problem set file for more detailed instructions.

Problem Set Solutions

Class Projects

General Information

At the end of the semester you need to submit a written project report and present the project in front of the class. Projects can be done in groups of two or individually. I strongly encourage you to work in groups of two. This way you get to know other students, you get to collaborate on research projects, and you can aim for a more ambitious project.

Your project should fall in one of the following categories:

Project Proposal

You should submit a Project Proposal by noon on Friday, October 26. The proposal should be one page. It is ok if it is a little longer but I will stop reading after the second page. It should describe the question you are investigating and your goals for the project. You should propose a range of goals, from ambitious ones to ones that you are certain you can achieve. The proposal should also briefly describe existing work related to your question and goals; you can limit yourself to the one or two most relevant papers. Aim for one paragraph to describe the question, another for the research goals, and a third for the related work.

Proposals should be submitted via MarkUs (markus.teach.cs.toronto.edu/csc2412-2018-09) by noon on October 26. They will not be graded: the goal is to give you feedback before you start working on your project. If you are working in a group, identify both members of the group in the proposal, and submit as a group in MarkUs. (Let me know if you do not know how to do that.)

Here are some suggested project ideas. These are not project proposals: before submitting one of them, you want to flesh them out by specifying concrete goals and giving a more careful comparison with one or two most relevant papers. These project ideas can be claimed on a first-come, first-served basis, so email me as soon as you decide that you want to pursue one of them.

You can also take a look at the recent workshops TPDP 2018 and Banff, the privacy sessions of CCS 2018, and the publications list of the Harvard Privacy Tools Project.

I have created a Piazza discussion board for the class. Feel free to use it to ask questions or to find a partner for the project.

Project Presentation

The presentations for your projects will be on December 6 (the last day of class), at the usual hour, and in the usual meeting place. Prepare a 10 minute presentation that describes briefly what you studied, and what results you achieved. It is ok to indicate that something is still in progress. In case your project is a survey, your presentation should point out some key ideas in the work you are summarizing, and present an open problem, and describe why it is important or interesting. The main thing I will be looking for in the presentations will be clarity.

I will bring cookies.

Project Report

Your written project reports are due on December 20. Aim for 5-7 typed pages, in A4 format with reasonable font and margins (so, for example, 10-11pt font and 1 inch margins all aroound). You can submit your report by emailing it directly to me.

Your report should clearly describe the problem, or set of problems, that is addressed by your project. Then summarize the most relevant related work, including a brief summary of the main results. Then describe your findings and end with a discussion of directions for future research.

Of course, reports will be different for different types of projects:

Schedule of Lectures

Find a tentative schedule of lectures below. [V] denotes Vadhan's notes, and [DR] denotes Dwork and Rothblum's monograph: see links above.

  1. Attacks: reconstruction and tracing attacks against privacy;
    Reading: Lecture notes from 2017.
  2. Differential Privacy: definition; equivalent formulations; Randomized Response;
    Reading: Section 1 of [V], Chapter 2 of [DR]
  3. Basic Mechanisms and Properties of DP: Laplace and Gaussian noise mechanisms; composition; group privacy;
    Reading: Section 2 of [V], Sections 3.1-3.3 and beginning of 3.5, Appendix A of [DR]
  4. Exponential Mechanism: computing the median; synthetic data;
    Reading: Sections 3.4 and 4.1 of [DR]
  5. Learning the Database: the multiplicative weight update algorithm;
    Reading: Sections 4.2. of [DR], and Lecture notes from 2017
  6. The Projection Mechanism;
    Reading: Section 12.4. of [DR], Sections 7.3. of [V], and Lecture notes from 2017
  7. Projection Mechanism Analysis, Marginal Queries
    Reading: Lecture notes from 2017
  8. Empirical Risk Minimization: Private Gradient Descent;
    Reading: Lecture notes.
  9. The Sparse Vector Technique: answering queries online;
    Reading: Section 3.6. of [DR]
  10. Privacy and Adaptive Data Analysis: why private data analysis on the sample generalizes to the population;
    Reading: Lecture notes.
  11. Class Presentations.