CSC 2547, Winter 2022:

Machine Learning for Machine Vision as Inverse Graphics

Department of Computer Science

University of Toronto

Announcements

Please hand in all your course work on Markus. This includes the slides from your paper presentation, your project proposal, and your final course project.

I have created a Piazza page for this class. Please sign yourself up.

If you are not presenting a paper in Feb 2 or 9, then you can now vote for a week when you would like to present, using a Google docs survey form.

You should vote for your first, second and third choice of week.

You should vote by 8pm on FEBRUARY 2.

You will not be voting for a particular paper, just a particular week, as the papers assigned to each week are not yet finalized. I have added some papers to the web site to give you a flavour of the papers involved. I will be adding more papers in the next few days.

If you plan a team presentation, the team should submit a single voting form (not one for each team member).

I will assign students (and teams) to weeks after 8pm on February 2. I will try to give everyone their first choice.

The particular paper(s) that you present will be decided two weeks before your presentation date.

Transition to in-person classes:

I am currently thinking that we will continue with on-line classes until reading week, and then convert to in-person classes after reading week (which is the earliest date at which we are allowed to have purely in-person classes). This means that the first 3 weeks of paper presentations will be on-line, and the rest will be in-person. I will keep you posted on this.

We need 5 or 6 students to volunteer to present papers on February 9. Papers are suggested below. These papers are probably less abstract and less mathematical than those in later weeks. If you are interested, please let me know ASAP. Advantages of volunteering now:
- More support.
- No overlap with course project or project proposal deadlines.

Papers for the week of February 9 are now listed below. Feel free to suggest additional papers.

Topics have been assigned to each of the remaining weeks in the semester, below. You will soon vote on which week/topic you wish to present a paper to the class.

Lecture recordings are now available on Quercus.

We need 5 or 6 students to volunteer to present papers on February 2, the first week of presentations. If you are interested, please let me know ASAP. Advantages of being first:
- More support.
- No overlap with course project or project proposal deadlines.

Papers for the week of February 2 are now listed below. Feel free to suggest additional papers.

Please see the Quercus page for this course.

The first class will be on Weds Jan 12 via Zoom.

Overview

Convolutional neural networks have achieved astounding breakthroughs on a number of machine vision tasks, especially object classification. However, unlike people, they can require vast amount of data to train, and their (sometimes comical) mistakes show that they do not truly understand what they see. This limits their abilities and leaves them short of the full promise of Artificial Intelligence.

To fully understand a scene, a computer must have a rich, 3-dimensional representation of the world. It must be able to infer what objects are in a scene, their position, orientation, size, shape, color, texture, category, what parts they are composed of, their relationship to other objects in the scene, as well as the illumination and position and viewing angle of the camera. In other words, a scene understanding program must be able to represent the world in much the same way as a computer graphics program does. The main difference is that computer graphics generates a 2-dimensional image from a 3-dimensional representation, while scene understanding aims to do the reverse: to infer a 3-dimensional representation of a scene from a 2-dimensional image. Note that once a 3-dimensional representation has been inferred, it should be possible to answer many common-sense questions about an image. It should also be possible to use a graphics program to regenerate the image from the 3-dimensional representation, and moreover, to generate modified versions of the image, in which objects have been moved or rotated and illumination or camera positions have changed.

This view of scene understanding is known as inverse graphics. Inverting the graphics process to generate a 3-dimensional representation of an image is a difficult, non-deterministic problem. This course approaches the problem with machine learning. That is, we investigate techniques for learning programs that do inverse graphics, as well as related techniques for overcoming the limitations of convolutional neural networks for vision.

This is an advanced graduate course in machine learning. It is primarily a seminar course in which students will read and present papers from the literature. There will also be a major course project. The goal is to bring students to the state of the art in this exciting field. Tentative topics include discriminative and generative approaches, variational inference and autoencoders, capsule networks, group symmetries and equivariance, visual attention and transformers, point nets, inferring 3D structure and part-whole relationships, self-supervised and contrastive learning, adversarial learning.

Prerequisites:

A solid introduction to Machine Learning (such as csc411 or a graduate course in ML), especially neural nets, a solid knowledge of linear algebra, the basics of multivariate calculus and probability, and programming skills, especially programming with vectors and matrices. Mathematical maturity will be assumed. This is primarily a machine-learning course, and a background in computer vision or computer graphics is not required.

Classes:

Wednesday, 4-6pm. The first class will be on Weds Jan 12.

Classes start at 4:10pm

BA 2175 when in-person classes are allowed. Otherwise, online via Zoom. (See Quercus for the Zoom link.)

Please see the Quercus page for this course for the Zoom link.

Instructor:

Anthony Bonner

Email: bonner [at] cs [dot] toronto [dot] edu

Office: BA 5230

Phone: 416-997-3463

Office Hours: by appointment

Teaching Assistant:

Saba Ale Ebrahim

Course Structure

The course is an updated version of csc2547 that I gave in spring 2020, and is organized along the lines of csc2547: Learning to Search given by David Duvenaud, though the course content is quite different.

First few classes: lectures on background material.

Next several classes: student presentations of papers from the literature.

Last two classes: project presentations

Paper presentations:

The goal is for each paper presentation to be a high-quality, accessible tutorial.

Each week will focus on one or two topics

You will vote for your choice of topic/week (soon).

I will assign you to a week (soon).

Papers on each topic will be listed below.

If you have a particular paper you would like to add to the list, let me know.

7 weeks and 40 students = 6 students per week and 18 minutes per student (including questions).

Two-week planning cycle:

Two weeks before your presentation, meet me after class to discuss and assign papers.

The following week, meet with me and/or the TA for a practice presentation (required).

Present in class under strict time constraints (just like a conference).

Papers may be presented in teams of two or three with longer presentations (18 minutes per team member).

Unless a paper is particularly difficult or long, a team will be expected to cover a group of related papers (one paper per team member).

A team may cover one paper listed below and one or more of its references.

Each team member should talk for a total of about 18 minutes (though possibly in multiple short segments).

These presentations should be related. e.g., one presentation may give background material for the others, or one may expand upon the others.

Each member of a team will receive the same grade (based on overall team performance).

Feel free to suggest other possibilities.

When presenting a paper,

Each slide should describe at most one idea.

No more than one idea per minute (so no more than one slide per minute, unless a single idea is spread over several slides)

You should be able to explain anything that you put on a slide. (e.g., If you mention MCMC on a slide, then you should be able to say something informative about it, even if you don't understand it completely.)

Don't just explain how a system works. Also explain why it works. (e.g., why are the latent variables interpretable.) This will require you to understand the system.

If it isn't clear why a system works, you should think about it and speculate on possible reasons.

Be able to answer questions about what you have presented.

Minimize the use of formulas.

Use pictures wherever possible.

Describe how the system is trained. end-to-end? supervised? unsupervised? pre-trained with synthetic data?

Describe the loss function for training.

Projects:

You may propose any project you like, as long as it is about machine learning and vision and it has a major technical component.

Here are some project ideas and considerations.

Projects may be done individually or in teams of up to three. More will be expected of a team project.

The grade will depend on the ideas, how well you present them in the report, how clearly you position your work relative to existing literature, how illuminating your experiments are, and how well-supported your conclusions are. Full marks will require a novel contribution.

Each team will write a short (2-4 pages) research project proposal, which ideally will be structured similarly to a standard paper. It should include a description of a minimum viable project, some nice-to-haves if time allows, and a short review of related work. You don’t have to do what your project proposal says - the point of the proposal is mainly to have a plan and to make it easy for me to give you feedback. (Team proposals should be close to 4 pages and should describe the division of labour among team members.)

Towards the end of the course everyone will present their project in a short, roughly 8-minute presentation.

At the end of the class you’ll hand in a project report (around 4 to 8 pages), ideally in the format of a machine learning conference paper such as NeurIPS. (Team reports should be close to 8 pages.)

Each member of a team will receive the same grade (based on overall project quality).

Marking Scheme:

[20%] Paper presentation. Rubric

[20%] Project proposal, due February 23. Rubric

[20%] Project presentations, March 30 and April 6. Rubric

[40%] Project report and code, due April 18 (April 13 if you are graduating in June). Rubric

All course work should be submitted through Markus

Tentative Schedule

Lectures:

January 12: lecture

Intro and overview

Slides on vision as inverse graphics, by Vinjai Vale.

Variational autoencoders

Quick intro

Mathematical details

References:

Short video on how to trick a neural net. Read about it here.

Tutorial on variational autoencoders by Carl Doersch.

Video on variational autoencoders

The original paper on variational autoencoders by Kingma and Welling, with the reparameterization trick.

Background material:

Introduction to autoencoders and variational autoencoders from csc413.

January 19: lecture

Overview of presentations, topics and discriminative models

Lecture on variational inference and autoencoders

January 26: lecture

overview of projects and generative models

Lecture on variational inference and autoencoders (continued) with the REINFORCE algorithm

Intro to control variates

Readings:

backpropagation through discrete random variables based on REINFORCE

The Concrete Distribution and Categorical Reparameterization. These two papers (published simultaneously) both introduce the Gumbel-softmax trick for estimating low-variance (but biased) gradients for backpropagating through discrete random variables using reparameterization.

REBAR Combines REINFORCE and Gumbel-softmax to estimate low-variance and unbiased gradients for backpropagating through discrete random variables.

tutorial on variational autoencoders by Jaan Altosaar

Student Presentations:

February 2: Discriminative Approaches (Supervised)

Since the deep-learning revolution of 2012, there has been a surge of work on using convolutional neural nets in a feed-forward, discriminative fashion to address a large number of problems in machine vision.

Suggested papers for presentation:

Human pose estimation:

cascade of CNNs

Markov random field

spatial model (Balagopal Unnikrishnan)

Object detection and localization:

based on region proposals.

multi-scale sliding window

faster region proposal network (Xiaohui Zeng)

spatial pyramid pooling

based on regression/classification.

single shot multibox detector

you only look once

objects as points (Lunjun Zhang)

Image transformation:

texture synthesis

semantic segmentation (Ziyi Wu)

depth prediction (Zian Wang)

scene labeling

artistic style

colorization

feature interpolation (Huan Ling)

If you would like to present a different paper on this topic, please let me know.

February 9: Generative Models (Unsupervised)

Since the development of variational autoencoders (vae) in 2014, there has been extensive research on using them to learn representations of images. The accuracy and completeness of a representation can be tested by generating the image from the representation and comparing this to the original image. In this way, representations can be learned in an unsupervised way, without the need for labelled data.

Suggested papers for presentation:

Learning disentangled representations:

Beta-vae (constrained variational autoencoders)

weakly-supervised disentangling

semi-supervised disentangling

with Boltzmann machines

training CNNs to disentangle

Learning 3D structure of single objects:

3D structure from images

via 2.5D sketches (a representation intermediate between 2D and 3D)

deep voxels (3D pixels as a representation) (Kin Chau)

multi-view stereo images

Scene understanding:

neural scene derendering

scene representation networks

occlusion

Towards inverse graphics:

scenes with multiple objects

overcoming occlusion, chapter 3

describing scenes with programs (You would not be expected to cover the entire paper)

Conditional image generation (background: Conditional vae)

arbitrary conditions

attributes to images

Other:

learning a compositional representation

visually grounded imagination

visual analogies

visual question answering

If you would like to present a different paper on this topic, please let me know.

February 16: Visual Attention and Transformers

Background:

The original paper on Transformers, in Natural Language Understanding.

Suggested papers for presentation:

Non-local Neural Nets. Long-range dependencies in video

Object-centric learning with slot attention

Augmenting CNNs with attention (Tianshu Zhu)

channel attention in convolutional nets

Split-attention networks. Combining channel attention with multi-path inference

Self-attention and convolution (Haozhen Shen)

Visual transformers:

Vision Transformers. Adapting transformers for use in Machine Vision. (Nikhil Verma)

Data-efficient image transformers. Using distillation and attention to reduce data requirements (Runqing Zhang)

Emergent properties in self-supervised vision transformers (Seung Wook Kim)

Object detection with transformers (Mengyang Liu)

Visual transformers for token-based image representation and processing.

March 2: Self-supervised and Contrastive Learning

Suggested papers for presention:

Unsupervised representation learning by solving pretext tasks:

Predicting image context (Weiming Ren)

Solving jigsaw puzzles

Predicting image rotation (Cong Wei)

Noise-Contrastive Estimation: Unsupervised representation learning by discriminating between images and noise

Basic statistical theory

Applied to word embeddings

Applied to images

Applied to pretext problems

Contrastive Predictive Coding. Inferring long-distance interactions in images, text and audio.

Contrastive learning:

An early paper on contrastive learning establishing many of the basic ideas

Using unsupervised discrimination to treat each image as a separate class

Barlow Twins. Using cross correlation to prevent encoder collapse (Shea Cardozo)

A simple framework for contrastive learning in vision (Vasudev Sharma)

Contrasting cluster assignments (Deepkamal Gill)

Bootstrap your own latent

March 9: 3D Structure, 3D Point Clouds and Implicit Functions

Learning shape abstractions by assembling volumetric primitives

A learnable convex decomposition of shape (Haohua Tang)

Non-rigid structure from motion

3D point clouds as input:

Category-specific, symmetric keypoints from point clouds

Learning canonical point-cloud representations

3D shape completion in the wild (Yujie Wu)

Representing 3D structure with implicit functions:

Representing shape with signed distance functions

Representing shape with occupancy fields (Yun-Chun Chen)

Representing structured shape with local implicit functions

Representing texture with implicit fields

Neural Radiance Fields A 5D representation: 3D structure + 2D viewing angle. (Fengjia Zhang)

March 16: Capsules and Point/Graph Nets

Point/Graph Nets:

In a traditional neural net, the input is a vector of fixed size. In contrast, in point nets the input is a set of points, and in graph nets it is a graph of nodes and edges. In both cases, the set or the graph can be of variable size. This is particularly useful for dealing with sparse visual data, such as point clouds, which arise in many problems in computer vision. Note that point nets can be viewed as a special kind of graph net in which the graph has nodes but no edges. Point and graph nets display permutation equivariance, that is, the points, nodes and edges are sets, so their order is unimportant. Finally, note that although the input is of variable size, the net itself has only a fixed number of (learnable) weights.

Suggested papers for presentation:

Point Nets (Zixuan Pan)

Dynamic graph CNN for learning on point clouds

Point Nets ++ (Yihan Duan)

Attentive context normalization for robust permutation-equivariant learning

Capsules:

Although convolutional neural networks have achieved amazing breakthroughs in computer vision, they require vast amounts of data for learning, make silly mistakes and do not understand what they see. Capsule networks are a recently developed alternative that addresses these problems. The notions of object and geometry are built into capsule networks and do not have to be learned, so less data is required and silly, non-geometric images are not misinterpreted.

Each capsule in a network represents an object, and unlike a neuron, which has a single output, a capsule has many outputs, representing the many properties of an object, such as its position, orientation, and texture. Moreover, unlike convolutional networks, which throw away positional information in the pooling layers, capsule networks keep track of the spatial relationships between objects. During training, a capsule network learns a model of common types of objects, including their parts and the spatial relationship of the parts to the whole. In effect, a capsule network learns a spatial grammar during training, and builds a spatial parse tree of an image during inference.

Foundations:

Transforming autoencoders (the original paper, with an emphasis on the first layer)

Dynamic routing (higher layers: building objects out of parts)

EM routing (building objects out of parts)

Suggested papers for presentation:

Stacked capsule autoencoders

Part representation by flow capsules

Deep equivariant capsule networks

Canonical capsules for 3D point clouds

Geometric capsule autoencoders for 3D point clouds

March 23: Adversarial Learning and Equivariance

Adversarial Learning:

Since the development of Generative Adversarial Networks (GANs) in 2014, there has been an explosion of work in adversarial methods and amazing breakthroughs in the generation of photo-realistic images. Here, we look at some of the applications of adversarial methods to inverse graphics. Like VAE's, the idea is to learn an image generator and then invert the image generation process. However, the way in which a GAN learns to generate images is completely different from that of a VAE and has added a whole new dimension to machine learning.

Background:

the original GAN paper

Wasserstein GAN An improvement that reduces many of the problems with GAN training.

Improved Training of Wasserstein GANs

Conditional GANs Placing conditions on what images are generated.

progressive growing of GANs Demo

Suggested papers for presentation:

GAN dissection Visualizing and understanding GANs. (Yunqing Zeng)

Teaching a GAN what not to learn

Instance-conditioned GAN Generating new images based on a given image.

Self-attention GANs

Training GANs with limited data

StyleGAN Generating images with different artistic styles. (This is a relatively easy paper, but you will also have to understand parts of this background paper on artistic style transfer, especially Section 3.)

Stabilizing GAN training with noise

Equivariance:

Convolutional neural networks have translational invariance and equivariance built in. That is, because the weights used at different locations are the same, they can recognize objects no matter where they are located in an image. There is now a substantial body of research into extending invariance and equivariance to geometric transformations other than translation, so that CNNs can recognize objects no matter how large or small they are, and no matter how much they are rotated, skewed or transformed in other ways. Depending on your background, the papers below may introduce you to new mathematical concepts (such as Fourier transforms, eigenfunctions or group representations, depending on the paper). These concepts are not difficult and have wide application, and Google and Wikipedia will answer all your questions.

Suggested papers for presentation:

Rotation:

harmonic networks (2D rotation)

spherical CNNs (3D rotation for data on the surface of a sphere, as with omni-directional vision)

filter decomposition (rotational and radial basis filters)

Scale:

local scale invariance (a basic approach to scale invariance)

deep scale spaces (a sophisticated approach based on group theory)

Other transformations

affine and non-linear transformations

global scale and rotation equivariance (section 3 can be skipped)

Vector neurons for 3D rotations on 3D data.

Learning to orient 3D surfaces

Project Presentations:

March 30: project presentations

April 6: project presentations