CSC2412 (Fall 2019): Algorithms for Private Data Analysis

Course Information

Instructor	Aleksandar Nikolov Office: Sandford Fleming 2301B E-mail:
Location	Bissel 112 (BL 112)
Time	Thursday, 2-4pm Office hours by appointment.
Reading	As a reference we will use the monograph by Dwork and Roth (from Aaron Roth's website). We will also use the tutorial notes by Salil Vadhan. There will be lecture notes posted for some of the lectures.
Grading	Based on two problem sets (20%), a written project (50%), and a presentation (30%).
Discussion Forum	Feel free to sign up for and use Piazza.

Course Description

The practical applicability of data analysis to sciences and decision-making is often limited by individual privacy concerns. Some of the most interesting data sets - medical studies, genetic data, movie reviews, search logs, social network connections - contain sensitive information about people represented in the data set. If individuals do not think their privacy concerns are addressed, they may refuse to participate in data collection, or not answer surveys truthfully. These issues limit the validity of the data analysis results, and make balancing privacy and usefulness essential to many applications of big data.

In this course we will study privacy in data analysis from a rigorous theoretical perspective. We will focus on Differential Privacy: a recent approach to achieving strong provable privacy guarantees in the analysis of sensitive data. Informally, a data analysis algorithm is differentially private if changing the data of a single individual changes the output distribution of the algorithm only slightly. This guarantee ensures that the privacy risk to any individual increases only slightly by participating in data collection. Our focus will be on the design of efficient differentially private algorithms. In the process, we will learn about fascinating connections between differential privacy and machine learning, geometry, and game theory.

While we focus on the algorithms, there will be proofs. It is important to prove formally that an algorithm provides privacy: this is not something that can be verified experimentally. Most proofs are simple, but the course does require mathematical maturity and a background in probability theory, and the design and analysis of algorithms.

Problem Sets

Problem Set 1 is posted. It is due in three weeks, on October 25, by 11:59pm. You should submit via MarkUs at markus.teach.cs.toronto.edu/csc2412-2019-09. See the problem set file for more detailed instructions.

Problem Set Solutions

Problem Set 2 is posted. It is due in two weeks, on November 29, by 11:59pm. You should submit via MarkUs at markus.teach.cs.toronto.edu/csc2412-2019-09. See the problem set file for more detailed instructions.

Class Projects

General Information

At the end of the semester you need to submit a written project report and present the project in front of the class. Projects can be done in groups of two or individually. I strongly encourage you to work in groups of two. This way you get to know other students, you get to collaborate on research projects, and you can aim for a more ambitious project.

Your project should fall in one of the following categories:

Theory Project. Your base goal should be a survey of the theoretical literature on a particular question or several related questions in private data analysis. In addition to the survey, you should aim at some theoretical research goals, like designing a new algorithm, proving an impossibility result, or investigating a new definition. It is great if the research is successful, but there will be no penalty if it is not, and your project ends up as just the survey.
Empirical Project. Your goal should be to implement one or several private data analysis algorithms, or attacks, and evaluate them empirically.
Combined Project. Such a project has a combination of theoretical and empirical components. For example, you can propose an algorithm, and attempt to both analyze it theoretically, and evaluate it empirically.

Here are some suggested project ideas, as examples.

Project Proposal

You should submit a Project Proposal by 11:59pm on Friday, October 25. The proposal should be one page. It is ok if it is a little longer but I will stop reading after the second page. It should describe the question you are investigating and your goals for the project. You should propose a range of goals, from ambitious ones to ones that you are certain you can achieve. The proposal should also briefly describe existing work related to your question and goals; you can limit yourself to the one or two most relevant papers. Aim for one paragraph to describe the question, another for the research goals, and a third for the related work.

Proposals should be submitted via MarkUs (markus.teach.cs.toronto.edu/csc2412-2019-09) by 11:59pm on October 25. They will not be graded: the goal is to give you feedback before you start working on your project. If you are working in a group, identify both members of the group in the proposal, and submit as a group in MarkUs. (Let me know if you do not know how to do that.)

Here are some suggested project ideas. These are not project proposals: before submitting one of them, you want to flesh them out by specifying concrete goals and giving a more careful comparison with one or two most relevant papers. These project ideas can be claimed on a first-come, first-served basis, so email me as soon as you decide that you want to pursue one of them.

You can take a look at the accepted papers in the upcoming workshop TPDP 2019, differential privacy papers from ICML, and presentations from the five workshops during the Simons Institute Semester. You can also take a look at the workshops TPDP 2018 and Banff, the privacy sessions of CCS 2018, and the publications list of the Harvard Privacy Tools Project.

I have created a Piazza discussion board for the class. Feel free to use it to ask questions or to find a partner for the project.

Project Presentation

The presentations for your projects will be on December 5 (the last day of class), at the usual hour, and in the usual meeting place. Prepare an 8 minute (do not go over time!) presentation that describes briefly what you are studying, and any (partial) results you have already achieved. It is ok to indicate that something is still in progress. In case your project is a survey, your presentation should point out some key ideas in the work you are summarizing, and present an open problem, and describe why it is important or interesting. The main thing I will be looking for in the presentations will be clarity.

I will bring cookies.

Project Report

Your written project reports are due on December 20. Aim for 5-7 typed pages, in A4 format with reasonable font and margins (so, for example, 10-11pt font and 1 inch margins all aroound). You can submit your report by emailing it directly to me.

Your report should clearly describe the problem, or set of problems, that is addressed by your project. Then summarize the most relevant related work, including a brief summary of the main results. Then describe your findings and end with a discussion of directions for future research.

Of course, reports will be different for different types of projects:

If your project is a survey of theoretical work, then you want to go into more details of existing research. Beyond summarizing the main results, you also want to give an idea of the main techiques and definitions, sketch one or several proofs of important results. You also should identify open problems, speculate how they can be solved, and maybe make some concrete conjectures. Basically, in a survey I want to see that you have understood how different papers relate to each other, what their main insights are, and what they leave open, and how you would attack the open problems.
For a theoretical project that proves some new results, you should carefully write out the definitions, assumptions, and, of course your proofs. Try to get your proofs to be as clear and as simple as you can.
For an empirical project, you should describe exactly what data you are using (synthetic or real), how you generated/preprocessed the data, what algorithms you used, and how you chose their paramaters. You should describe the experimental set up. Your findings should be presented cleanly in plots and tables.

Schedule of Lectures

Find a tentative schedule of lectures below. [V] denotes Vadhan's notes, and [DR] denotes Dwork and Roth's monograph: see links above.

Attacks: reconstruction and tracing attacks against privacy;
Reading: Lecture notes.
Differential Privacy: definition; equivalent formulations; Randomized Response;
Reading: Section 1 of [V], Chapter 2 of [DR]
Basic Mechanisms and Properties of DP: Laplace and Gaussian noise mechanisms; composition; group privacy;
Reading: Section 2 of [V], Sections 3.1-3.3 and beginning of 3.5, Appendix A of [DR]
Exponential Mechanism: private PAC learning;
Reading: Sections 3.4 and 11.1 of [DR], Section 8.1. of [V]
Empirical Risk Minimization: Private Gradient Descent;
Reading: Lecture notes.
Privacy and Adaptive Data Analysis: why private data analysis on the sample generalizes to the population;
Reading: Lecture notes.
Learning the Database: the multiplicative weight update algorithm;
Reading: Sections 4.2. of [DR], and Lecture notes
The Projection Mechanism;
Reading: Section 12.4. of [DR], Sections 7.3. of [V], and Lecture notes
Projection Mechanism Analysis, Marginal Queries
Reading: Lecture notes
The Sparse Vector Technique: answering queries online;
Reading: Section 3.6. of [DR]
Class Presentations.

CSC2412 Algorithms for Private Data Analysis (Fall 2019)