CSC 2541, Fall 2024:

Generative AI for Images

Department of Computer Science

University of Toronto

Announcements

Please see the Quercus page for this course for announcements.

Zoom link for rehearsals.

Before making a presentation or handing in a piece of work, please read the corresponding rubric (below), as it contains detailed information on what is expected.

Overview

Generative AI has recently achieved revolutionary performance and burst into public view through such systems as ChatGPT and DALL-E. This course examines the techniques that have made this possible, with an emphasis on machine vision and image synthesis. Topics will be selected from diffusion models, score matching, normalizing flows, neural differential equations, variational autoencoders, transformers, and large language models. Many of these techniques are mathematically sophisticated.

This is primarily a seminar course in which students read and present papers from the literature, though there may be some supplementary lectures on advanced material. There will also be a major course project. The goal is to bring students to the state of the art in this exciting field.

Prerequisites:

An advanced course in Machine Learning (such as csc413 or csc2516), especially neural nets, a solid knowledge of linear algebra, the basics of multivariate calculus and probability, and programming skills, especially programming with vectors and matrices (e.g., Numpy). Some knowledge of differential equations would be an asset. Mathematical maturity will be assumed.

Classes:

Friday, 1-3pm. The first class will be on Fri Sept 6.

All classes are in-person starting at 1:10pm.

Location: MY 330 (Myhal Centre, 55 St. George St.)

Instructor:

Anthony Bonner

Email: anthony [dot] bonner [at] utoronto [dot] ca

Office: Pratt 396B

Phone: 416-997-3463

Office Hours: by appointment

Teaching Assistants:

Shuhong Zheng

Jesse Bettencourt

Textbook:

There is no required textbook for this course. However, the following two books contain essential material at the graduate level. Both are available as free pdf downloads for U of T students and faculty:

Christopher Bishop and Hugh Bishop, Deep Learning - Foundations and Concepts, Springer, 2024. An up-to-date book with good descriptions of many of the methods covered in this course.

Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006. Good treatment of essential background material, both basic and advanced. The material on variational inference (Chapter 10) is particularly relevant to this course, including the excellent illustrations of KL divergence.

Course Structure

The course is structured along the same lines as csc2547 that I gave in spring 2022, though the topics and papers covered this year are quite different:

First two classes: lectures on background material.

Next eight classes: student presentations of papers from the literature.

Last two classes: project presentations

Paper presentations (tentative):

The goal is for each paper presentation to be a high-quality, accessible tutorial.

Each week will focus on one or two topics

You will vote for your choice of topic/week (soon).

I will assign you to a week (soon).

Papers on each topic will be listed below.

If you have a particular paper you would like to add to the list, let me know.

8 weeks and 60 students = 7 or 8 students per week and 13 minutes per student (including questions and transition time).

Two-week planning cycle:

About two weeks before your presentation, you will vote online for the particular paper you will present.

(If you have not voted by the deadline, you will be assigned a paper that no one else wants to read.)

The following week, you will vote online for a rehearsal date and time.

At the agreed-upon date and time, meet online with me or the TA for a practice presentation (required).

Present in class under strict time constraints (just like a conference).

Papers may be presented in teams of two or three with longer presentations (13 minutes per team member).

Unless a paper is particularly difficult or long, a team will be expected to cover a group of related papers (one paper per team member).

A team may cover one paper listed below and one or more of its references.

Each team member should talk for a total of about 13 minutes (though possibly in multiple short segments).

These presentations should be related. e.g., one presentation may give background material for the others, or one may expand upon the others.

Each member of a team will receive the same grade (based on overall team performance).

Feel free to suggest other possibilities.

Before presenting a paper, practice the talk by yourself or to a friend to make sure you can finish in the required time without rushing.

Guidelines for presenting a paper:

Make eye contact with the audience.

Do not rush.

Speak clearly, and do not say "Um" or "Uh".

Do not read from a prepared script. Instead, elaborate on what is on the slides.

Be clear on what problem the paper is trying to solve. (It is far too common to describe a solution without describing the problem it is solving.)

The overall presentation should be understandable to most of the audience.

Do not go into too much detail. (What you do not say is as important as what you do.)

Each slide should describe at most one idea.

No more than one idea per minute (so no more than one slide per minute, unless a single idea is spread over several slides)

You should be able to explain anything that you put on a slide. (e.g., If you mention MCMC on a slide, then you should be able to say something informative about it, even if you don't understand it completely.)

Don't just explain how a system works. Also explain why it works. (e.g., why are the latent variables interpretable.) This will require you to understand the system.

If it isn't clear why a system works, you should think about it and speculate on possible reasons.

Be able to answer questions about what you have presented.

Minimize the use of formulas.

Use pictures wherever possible.

Describe how the system is trained. end-to-end? supervised? unsupervised? pre-trained with synthetic data?

Describe the loss function for training.

Projects:

You may propose any project you like, as long as it is about generative AI for images and it has a major technical component.

Here are some project ideas and considerations.

Projects may be done individually or in teams of up to three. More will be expected of a team project.

The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and how well-supported your conclusions are. Full marks will require a novel contribution.

Each team will write a short (2-4 pages) research project proposal, which ideally will be structured similarly to a standard paper. It should include a description of a minimum viable project, some nice-to-haves if time allows, and a short review of related work. You don’t have to do what your project proposal says - the point of the proposal is mainly to have a plan and to make it easy for me to give you feedback. (Team proposals should be close to 4 pages and should describe the division of labour among team members.)

Towards the end of the course everyone will present their project in a short, roughly 4-minute presentation.

At the end of the class you’ll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NeurIPS. (Team reports should be close to 8 pages.)

Each member of a team will receive the same grade (based on overall project quality)

Project Considerations:

Is your idea sensible?

Can you download all the necessary data and software?

Can you run the software? (on the data?)

Do you need to modify the software?

Can you make the modifications?

Can you compile and run the source code?

Are you writing software from scratch?

Do you have the computational resources (GPUs)?

Do you have the time to complete the project?

Start by duplicating the results in a paper (if it gives enough details).

The following compute networks have GPUs that you may be able to use: CSLab, Google Colab and the Digital Research Alliance of Canada.

Marking Scheme:

[20%] Paper presentation. Rubric

[20%] Project proposal, due Tuesday October 15. Rubric

[20%] Project presentations, Friday November 22 and 29. Rubric

[40%] Project report (and optional code), due Tuesday December 17. Rubric

All course work should be submitted through Quercus

Tentative Schedule: (details to be added)

September 6: Lecture

Course Intro [slides]

Variational Autoencoders [slides1, slides2]

References:

Bishop, 2024, Chaper 19

Bishop, 2006, Chapter 10

Kingma and Welling, Auto-Encoding Variational Bayes. The original paper on Variational Autoencoders.

September 13: Lecture

Intro [slides]

Variational Autoencoders [slides3a, slides3b]

References:

Bishop, 2024, Chaper 19

Mnih and Gregor, Neural Variational Inference and Learning in Belief Networks. Backpropagation through sampling of discrete latent variables.

Student Presentations

September 20: Foundations and Background

There is a long history of AI techniques for generating images, forming the foundation and inspiration of many current methods. Here are some of the seminal papers. Some methods generate random images. Others generate images satisfying specified conditions (such as a text description).

Suggested papers for presentation:

Pixel Recurrent Neural Networks Autoregressive image generation (Anandita Mahika)

Attention is all you need The original paper on Transformers (Abhinav Muraleedharan)

An Image is Worth 16x16 Words The original paper on Vision Transformers (Xiajun Deng)

NICE: Non-Linear Independent Components Estimation Flow-based image generation (Chen-Hao Chow)

Variational Inference with Normalizing Flows The paper that coined the term "Normalizing Flow" (Andreas Burger)

Deep Conditional Generative Models Using Variational Autoencoders to generate images subject to conditions

Generative Adversarial Networks The original paper on GANs (Yichen Cai)

Conditional Adversarial Networks Using GANs to generate images subject to conditions (Sean Xiao)

Deep Residual Learning for Image Recognition The original paper on ResNets (Sherwin Bahmani)

September 27: Autoregressive Models

Autoregressive models generate images sequentially, one pixel at a time or one line at a time, using previously generated pixels as context for generating the next pixels. This is a compute-intensive process, and making it hierarchical speeds up image generation and leads to higher image quality.

Suggested papers for presentation:

Neural Discrete Representation Learning Describes VQ-VAE, a variational autoencoder with discrete latent variables. (Mickell Als and Justin Wu)

Conditional Image Generation with PixelCNN Decoders Image generation with input prompts. (Oufan Ouyang)

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications Describes an improved implementation of PixelCNN (Or Perel)

WaveNet: A Generative Model for Raw Audio Develops techniques for audio that are frequently applied to images. (Murdock Aubry)

Hierarchical image generation:

Generating Diverse High-Fidelity Images with VQ-VAE-2 (Mickell Als and Justin Wu

PixelCNN Models with Auxiliary Variables for Natural Image Modeling (Michael Yuan)

Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

Hierarchical Autoregressive Image Models with Auxiliary Decoders (Xiaoyan Li)

October 4: Generative Transformer Models

Influenced by the success of Transformers in generating text, researchers are adapting them to the generation of images. The list below sets the stage with two papers on text generation.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding The seminal paper on unsupervised pretraining of transformers for natural language (Ken Shi)

Language Models are Unsupervised Multitask Learners An early paper showing the revolutionary potential of ChatGPT (Steven Yuan)

Generative Pretraining From Pixels (Jiawei Wang)

Masked Autoencoders Are Scalable Vision Learners (Mingjie Zhao)

Image Transformer (Eric Zheng)

BEiT: BERT Pre-Training of Image Transformers (Yi Lu)

MaskGIT: Masked Generative Image Transformer (Courtney Amm)

Taming Transformers for High-Resolution Image Synthesis (Wentao Ma)

Zero-Shot Text-to-Image Generation A description of Dall-E (Ujan Sen)

October 11: Diffusion Models and Score Matching

Diffusion models have recently burst onto the scene for image generation, both in research and commercially in systems such as Dall-E, Emu and SDXL, producing state-of-the-art images of amazing detail and complexity. They have deep mathematical foundations, rooted in statistical physics. Application papers are much less mathematical, but generally assume that you have grasped the fundamentals. This week's papers focus on those fundamentals.

There is a close connection between diffusion models and score matching, and an excellent description of both can be found in Bishop, 2024, Chapter 20. The related topic of Langevin sampling is described in Chapter 14, Section 3.

Essential Reading: Bishop, 2024, Sections 20.1 to 20.3 and 14.3, and Papers 1, 2 and 4 below.

Suggested papers for presentation:

Deep Unsupervised Learning using Nonequilibrium Thermodynamics The seminal paper on diffusion processes for generative modelling in machine learning. (Shaohong Chen)

Generative Modeling by Estimating Gradients of the Data Distribution The seminal paper on score matching for generative models. (Felix Taubner)

Improved Techniques for Training Score-Based Generative Models Making score-based models state-of-the-art. (Jinyang Zhao)

Denoising Diffusion Probabilistic Models The seminal paper connecting diffusion models to score matching, dramatically increasing image quality. (Yifeng Chen)

Variational Diffusion Models (Jinyu Liu)

Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation (Remi Grzeczkowicz)

Improved Denoising Diffusion Probabilistic Models

October 18: Guided Diffusion Models

Instead of generating random images, guided diffusion models respond to prompts for specific kinds of images. These prompts may be a simple class label (e.g., cat), an arbitrary text description (e.g., an Italian village with a Romanesque church on a hillside at sunset), an incomplete image that needs to be filled in (in-painting), a low-resolution image that is to be converted to high-resolution (super-resolution), or in general, any condition that constrains the kind of image to be generated.

Essential reading: Bishop, 2024, Section 20.4, and Papers 1, 2 and 8 below.

Suggested papers for presentation:

Diffusion Models Beat GANs on Image Synthesis The paper that introduced guided diffusion, by generating images from a given class, known as classifier guidance. (Mohammad Abdul Basit)

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models The first paper to generate images based on a text prompt, a form of classifier-free guidance. (Navtegh Singh Gill)

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Generating images based on a text prompt. (Xuanchi Ren)

Image Super-Resolution via Iterative Refinement Turning low-resolution images into high-resolution images. (David Tomarov)

Cascaded Diffusion Models for High Fidelity Image Generation Turning low-resolution images into really-high-resolution images. (Jianzhong You)

Diffusion Probabilistic Modeling for Video Generation Generating video with diffusion models (Anny Dai)

Controllable and Compositional Generation with Latent-Space Energy-Based Models (Daniel Eftekhari)

Classifier-Free Diffusion Guidance The seminal paper on classifier-free guidance.

October 25: Latent Diffusion and Accelerated Sampling

Diffusion models are compute intensive and can take a long time to generate large detailed images. Latent diffusion models speed things up by working in a lower-dimensional latent space, instead of high-dimensional pixel space. This is the basis of some commercially developed systems, such as Emu, SDXL and Dall-E. Another approach is to reduce the number of steps needed in the diffusion process. This is called accelerated sampling.

Essential reading: Papers 1 and 4 below.

Suggested papers for presentation:

High-Resolution Image Synthesis with Latent Diffusion Models The seminal paper on latent diffusion models. (Yasamin Zarghami)

Score-based Generative Modeling in Latent Space Score-base matching can also be done in latent space.

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack Increasing image quality with a small amount of supervised fine tuning. (Stuti Wadhwa)

Denoising Diffusion Implicit Models The seminal paper on accelerated sampling. (Soroush Mehraban and Amirhossein Kazerouni)

Progressive Distillation for Fast Sampling of Diffusion Models Converting a slow model into a fast model using distillation.

On Distillation of Guided Diffusion Models

Consistency Models Generating samples in a single step, instead of many steps. (Baorun Mu)

Geometric Latent Diffusion Models for 3D Molecule Generation Generating 3-dimensional models instead of 2-dimensional images.

Hierarchical Text-Conditional Image Generation with CLIP Latents A description of Dall-E 2. (Peiwen Guo)

November 1: No class (Reading Week)

November 8: Transformer Architectures for Diffusion Models

The dominant architecture for diffusion models is based on U-Nets, a kind of convolutional neural network with skip connections originally developed for segmenting biomedical images, and frequently cited in the machine learning/vision literature. However, inspired by the success of transformers for generating text and images, researches have been developing transformer-based architectures for diffusion models. The first paper below sets the stage by describing U-Nets.

Essential reading: Section 3 of Paper 1 from October 18.

Suggested papers for presentation:

U-Net: Convolutional Networks for Biomedical Image Segmentation The original paper on U-Nets. (Naomi Kothiyal)

All are Worth Words: A ViT Backbone for Diffusion Models (Yanting Chen)

Scalable Diffusion Models with Transformers (Tom Blanchard)

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale (Younwoo Choi)

Simple diffusion: End-to-end diffusion for high resolution images (Daihao Wu)

November 15: Neural Differential Equations

Residual networks tend to make small non-linear changes at each layer, and the cumulative effect of many such layers is a complex non-linear model. Such networks are naturally viewed as discretized versions of continuous differential equations. This groundbreaking view of neural nets simplifies many problems. It also allows us to exploit the rich body of work on differential equations and to use well-developed numerical solvers to improve the efficiency of forward propagation. The field of differential equations itself has also been rejuvenated by this development, introducing a whole new class of non-linearities to the field, linearities defined by neural nets.

The first paper below sets the stage by describing reversible residual nets, a precursor to neural ordinary differential equations (ODEs), the simplest kind of differential equation. A gentle introduction to neural ODEs can be found here. Normally, the output of a neural network is computed from the input, but in a reversible net, the input can also be computed from the output. In addition, when random noise is injected into a residual net at each layer, the net becomes a discretized version of a stochastic differential equation (SDE). A good introduction to SDEs can be found in Bishop, 2024, Chapter 20, Section 3.4

The Reversible Residual Network: Backpropagation Without Storing Activations (Alexei Ivanov)

Neural Ordinary Differential Equations The seminal paper on treating residuals networks as ODEs.

Score-Based Generative Modeling through Stochastic Differential Equations

Classifier-Free Diffusion Guidance The seminal paper on classifier-free guidance, which treats a diffusion model as continuous instead of discrete. (Sumin Lee)

Elucidating the Design Space of Diffusion-Based Generative Models

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Project Presentations

November 22:

November 29: