David Ross - Ph.D. Thesis

Abstract

A fundamental goal of computer vision is the ability to analyze motion. This can range from the simple task of locating or tracking a single rigid object as it moves across an image plane, to recovering the full pose parameters of a collection of nonrigid objects interacting in a scene. The current state of computer vision research, as with the preponderance of challenges that comprise ``artificial intelligence'', is that the abilities of humans can only be matched in very narrow domains by carefully and specifically engineered systems.

The key to broadening the applicability of these successful systems is to imbue them with the flexibility to handle new inputs, and to adapt automatically without the manual intervention of human engineers. In this research we attempt to address this challenge by proposing solutions to motion analysis tasks that are based on machine learning.

We begin by addressing the challenge of tracking a rigid object in video, presenting two complementary approaches. First we explore the problem of learning a particular choice of appearance model---principal components analysis (PCA)---from a very limited set of training data. However, PCA is far from the only appearance model available. This raises the question: given a new tracking task, how should one select the most appropriate models of appearance and dynamics? Our second approach proposes a data-driven solution to this problem, allowing the choice of models, along with their parameters, to be learned from a labelled video sequence.

Next we consider motion analysis at a higher level of organization. Given a set of trajectories obtained by tracking various feature points, how can we discover the underlying non-rigid structure of the object or objects? We propose a solution that models the observed sequence in terms of probabilistic ``stick figures'', under the assumption that the relative joint angles between sticks can change over time, but their lengths and connectivities are fixed. We demonstrate the ability to recover the invariant structure and the pose of articulated objects from a number of challenging datasets.