|
Summer 2004 Talk Descriptions
Reconstruction of 3D models
from intensity images and partial depth
Abstract: This talk addresses the probabilistic inference of geometric structures from images. Specifically, of synthesizing range data to enhance the reconstruction of a 3D model of an indoor environment by using video images and (very) partial depth information. In our method, we interpolated the available range data using statistical inferences learned from the concurrently available video images and from those (sparse) regions where both range and intensity information is available. The spatial relationships between the variations in intensity and range can be efficiently captured by the neighborhood system of a Markov Random Field (MRF). In contrast to classical approaches to depth recovery (i.e. stereo, shape from shading), we can afford to make only weak prior assumptions regarding specific surface geometries or surface reflectance functions since we compute the relationship between existing range data and the images we start with. Experimental results show the feasibility of our method.
Motion Segmentation Incorporating
Active Contours for Spatial Coherence
Abstract: We present a novel method for performing spatially coherent motion estimation by integrating region and boundary information. The method begins with a layeredparametricflowmodel.Sincethe resulting flow estimates are typically sparse, we use the computed motion in a novel way to compare intensity values between images, tehreby providing improved spatial coherence of a moving region. This dense set of intensity constraints is then used to intialize an active contour, which is influenced by both motion and intensity data to track the object's boundary. The active contour, in turn, provides additional spatial coherence by identifying motion constraints within the object boundary and using them exclusively in subsequent motion estimation for that object. The active contour is therefore automatically initialized once and, in subsequent frmes, is warped forward based on the motion model. The spatial coherence constraints provided by both the motion and the boundary information act together to overcome their individual limitations. Furthermore, the approach is general, and makes no assumptions about a static background and/or a static camera. We apply the method to image sequence in which both the object and the background are moving.
Continuous Image Representation
for Retrieval and Clustering
Abstract: In this talk I will present an image retrieval system based on a MoG image representation. I will present a method for approximating the Kullback-Leibler (KL) divergence between two mixtures of Gaussians based on the unscented transform. The proposed method is utilized to obtain an efficient image similarity measure. I will also discuss a method for unsupervised image clustering which is based on a information-theoretic principle, the information bottleneck (IB) principle. Images are clustered such that the mutual information between the clusters and the image content is maximally preserved.
Joint work with Hayit Greenspan and Shiri Gordon, Tel Aviv University.
Stochastic Spatio-Temporal Grammars
for Images and Video
Abstract: Probabilistic Context-Free Grammars (PCFGs) induce distributions over strings. Strings can be viewed as observations that are maps from indices to terminals. The domains of such maps are totally ordered and the terminals are discrete. We extend PCFGs to induce densities over observations with unordered domains and continuous-valued terminals. We call our extension Spatial Random Tree Grammars (SRTGs). While SRTGs are context sensitive, the inside-outside algorithm can be extended to support exact likelihood calculation, MAP estimates, and ML estimation updates in polynomial time on SRTGs. We call this extension the center-surround algorithm. SRTGs extend mixture models by adding hierarchal structure that can vary across observations. The center-surround algorithm can recover the structure of observations, learn structure from observations, and classify observations based on their structure. We have used SRTGs and the center-surround algorithm to process both static images and dynamic video. In static images, SRTGs have been trained to distinguish houses from cars. In dynamic video, SRTGs have been trained to distinguish entering from exiting. We demonstrate how the structural priors provided by SRTGs support these tasks.
Joint work with Charles Bouman, Shawn Brownfield, Bingrui Foo, Mary Harper, Ilya Pollak, and James Sherman.
Learning Models of Human Behavior
using a Value Directed Approach
Abstract: In this talk, I will discuss learning partially observable Markov decision process (POMDP) models of human interactive tasks, in which a decision-making agent must interact with a human to achieve its goals. Central to our work is the modeling of human behaviors within this context using computer vision. The functions we seek to learn are mappings from spatially and temporally extended observations (video sequences) to actions which are optimal with respect to some utility function (i.e. a policy). Given the large observation space of the video sequences, this task is one of learning spatio-temporal abstractions of the observations that can enable optimal action choices at the level of goals and utility. One of the most significant advantages of this type of learning is that it does not require labeled data from expert knowledge about which behaviors are significant in a particular interaction. Rather, the learning process {\em discovers} the significant behaviors, leading to a transportable system which can be applied to any task. A key idea is that only those behaviors which are ``important'' (in the sense that they enable or help the system to complete its goals) need to be distinguished perceptually. That is, we assume some observation subspace that spans the perceptual distinctions that need to be made in order to achieve goals. We find this subspace by learning dynamic Bayesian network models of human behaviors. The parameters of these models are learned from training data using an {\em a-posteriori} constrained optimization technique based on the expectation-maximization algorithm.
I will briefly overview POMDPs and multi-agent POMDPs, and expose some of the difficulties inherent in learning to act in a multi-agent situation. I will then describe the computer vision observation functions which we have experimented with. I will show how to learn the parameters of these models, and show how they can be applied to a gestural robot control experiment and a collaborative card game. Finally, I will describe my current work towards building these models for an assisted living task, and will discuss some of the outstanding computer vision issues.
Abstract: I will discuss the problem of temporally aligning two or more videos of a dynamic 3D scene, captured simultaneously from distinct viewpoints. Informally, the problem involves computing the relative frame rate between the videos and their temporal shift, if any. After motivating the problem and describing potential applications in video analysis, I will talk about a solution that will be presented at CVPR'04.
Unlike existing methods, which only work for two videos and rely on a computationally-intensive search in the space of temporal alignments, we reduce the problem for N views to the robust estimation of a single line in R^N. This line captures all temporal relations between the videos and can be computed without any prior knowledge of these relations. Experimental results show that the method can accurately align videos even when they have large mis-alignments (e.g., hundreds of frames), when the problem is seemingly ambiguous (e.g., scenes with roughly periodic motion), and when accurate manual alignment is difficult (e.g., due to slow-moving objects).
This is joint work with R. Carceroni, F. Padua, and G. Santos at UFMG-Brazil.

Send questions or comments about this page to 
Page last modified on Saturday, November 20, 2004
|
|