CSC2412 (Fall 2017): Algorithms for Private Data Analysis

Course Information

Instructor	Aleksandar Nikolov Office: Sandford Fleming 2301B E-mail:
Location	Bahen 3116 (BA3116)
Time	Thursday, 2-4pm Office hours by appointment.
Reading	As a reference we will use the monograph by Dwork and Roth (from Aaron Roth's website). We will also use the recent tutorial notes by Salil Vadhan.
Grading	Based on one problem set (20%), a written project (50%), and a presentation (30%).

Course Description

The practical applicability of data analysis to sciences and decision-making is often limited by individual privacy concerns. Some of the most interesting data sets - medical studies, genetic data, movie reviews, search logs, social network connections - contain sensitive information about people represented in the data set. If individuals do not think their privacy concerns are addressed, they may refuse to participate in data collection, or not answer surveys truthfully. These issues limit the validity of the data analysis results, and make balancing privacy and usefulness essential to many applications of big data. In this course we will study privacy in data analysis from a rigorous theoretical perspective. We will focus on Differential Privacy: a recent approach to achieving strong provable privacy guarantees in the analysis of sensitive data. Informally, a data analysis algorithm is differentially private if changing the data of a single individual changes the output distribution of the algorithm only slightly. This guarantee ensures that the privacy risk to any individual increases only slightly by participating in data collection. Our focus will be on the design of efficient differentially private algorithms. In the process, we will learn about fascinating connections between differential privacy and game theory, learning theory, and geometry.

While we focus on the algorithms, there will be proofs. It is important to prove formally that an algorithm provides privacy: this is not something that can be verified experimentally. Most proofs are simple, but the course does require mathematical maturity and a background in probability theory, and the design and analysis of algorithms.

Problem Set

The Problem Set is posted. It is due in three weeks, on November 2 at noon. You should submit via MarkUs at markus.teach.cs.toronto.edu/csc2412-2018-09. See the problem set file for more detailed instructions.

Problem Set Solutions

Class Projects

General Information

At the end of the semester you need to submit a written project report and present the project in front of the class. Projects can be done in groups of two or individually. I strongly encourage you to work in groups of two. This way you get to know other students, you get to collaborate on research projects, and you can aim for a more ambitious project.

Your project should fall in one of the following categories:

Theory Project. Your base goal should be a survey of the theoretical literature on a particular question or several related questions in private data analysis. In addition to the survey, you should aim at some theoretical research goals, like designing a new algorithm, proving an impossibility result, or investigating a new definition. It is great if the research is successful, but there will be no penalty if it is not, and your project ends up as just the survey.
Empirical Project. Your goal should be to implement one or several private data analysis algorithms, or attacks, and evaluate them empirically.
Combined Project. Such a project has a combination of theoretical and empirical components. For example, you can propose an algorithm, and attempt to both analyze it theoretically, and evaluate it empirically.

Project Proposal

You should submit a Project Proposal by noon on Friday, October 26. The proposal should be one page. It is ok if it is a little longer but I will stop reading after the second page. It should describe the question you are investigating and your goals for the project. You should propose a range of goals, from ambitious ones to ones that you are certain you can achieve. The proposal should also briefly describe existing work related to your question and goals; you can limit yourself to the one or two most relevant papers. Aim for one paragraph to describe the question, another for the research goals, and a third for the related work.

Proposals should be submitted via MarkUs (markus.teach.cs.toronto.edu/csc2412-2018-09) by noon on October 26. They will not be graded: the goal is to give you feedback before you start working on your project. If you are working in a group, identify both members of the group in the proposal, and submit as a group in MarkUs. (Let me know if you do not know how to do that.)

Here are some suggested project ideas. These are not project proposals: before submitting one of them, you want to flesh them out by specifying concrete goals and giving a more careful comparison with one or two most relevant papers. These project ideas can be claimed on a first-come, first-served basis, so email me as soon as you decide that you want to pursue one of them.

You can also take a look at the recent workshops TPDP 2018 and Banff, the privacy sessions of CCS 2018, and the publications list of the Harvard Privacy Tools Project.

I have created a Piazza discussion board for the class. Feel free to use it to ask questions or to find a partner for the project.

Project Presentation

The presentations for your projects will be on December 6 (the last day of class), at the usual hour, and in the usual meeting place. Prepare a 10 minute presentation that describes briefly what you studied, and what results you achieved. It is ok to indicate that something is still in progress. In case your project is a survey, your presentation should point out some key ideas in the work you are summarizing, and present an open problem, and describe why it is important or interesting. The main thing I will be looking for in the presentations will be clarity.

I will bring cookies.

Project Report

Your written project reports are due on December 20. Aim for 5-7 typed pages, in A4 format with reasonable font and margins (so, for example, 10-11pt font and 1 inch margins all aroound). You can submit your report by emailing it directly to me.

Your report should clearly describe the problem, or set of problems, that is addressed by your project. Then summarize the most relevant related work, including a brief summary of the main results. Then describe your findings and end with a discussion of directions for future research.

Of course, reports will be different for different types of projects:

If your project is a survey of theoretical work, then you want to go into more details of existing research. Beyond summarizing the main results, you also want to give an idea of the main techiques and definitions, sketch one or several proofs of important results. You also should identify open problems, speculate how they can be solved, and maybe make some concrete conjectures. Basically, in a survey I want to see that you have understood how different papers relate to each other, what their main insights are, and what they leave open, and how you would attack the open problems.
For a theoretical project that proves some new results, you should carefully write out the definitions, assumptions, and, of course your proofs. Try to get your proofs to be as clear and as simple as you can.
For an empirical project, you should describe exactly what data you are using (synthetic or real), how you generated/preprocessed the data, what algorithms you used, and how you chose their paramaters. You should describe the experimental set up. Your findings should be presented cleanly in plots and tables.

Schedule of Lectures

Find a tentative schedule of lectures below. [V] denotes Vadhan's notes, and [DR] denotes Dwork and Rothblum's monograph: see links above.

Attacks: reconstruction and tracing attacks against privacy;
Reading: Lecture notes from 2017.
Differential Privacy: definition; equivalent formulations; Randomized Response;
Reading: Section 1 of [V], Chapter 2 of [DR]
Basic Mechanisms and Properties of DP: Laplace and Gaussian noise mechanisms; composition; group privacy;
Reading: Section 2 of [V], Sections 3.1-3.3 and beginning of 3.5, Appendix A of [DR]
Exponential Mechanism: computing the median; synthetic data;
Reading: Sections 3.4 and 4.1 of [DR]
Learning the Database: the multiplicative weight update algorithm;
Reading: Sections 4.2. of [DR], and Lecture notes from 2017
The Projection Mechanism;
Reading: Section 12.4. of [DR], Sections 7.3. of [V], and Lecture notes from 2017
Projection Mechanism Analysis, Marginal Queries
Reading: Lecture notes from 2017
Empirical Risk Minimization: Private Gradient Descent;
Reading: Lecture notes.
The Sparse Vector Technique: answering queries online;
Reading: Section 3.6. of [DR]
Privacy and Adaptive Data Analysis: why private data analysis on the sample generalizes to the population;
Reading: Lecture notes.
Class Presentations.

CSC2412 Algorithms for Private Data Analysis (Fall 2018)