IEEE CVPR Workshop, June 22nd 2006, New York


Learning, Representation and Context for Human Sensing in Video






Cristian Sminchisescu


crismin [at] nagoyas [dot] uchicago [dot] edu (replace nagoyas with nagoya)


Fernando De la Torre

Carnegie Mellon University

fdtorre [at] cs [dot] cmus [dot] edu (replace cmus with cmu)





Over the past decade, the research in the area of human motion analysis from video (e.g. bodies, faces, gestures) has made progresses that have significantly extended the range of scenes and motions that we can currently interpret. However, many of the existing human tracking systems tend to be complex to build and experiment with, computationally expensive and fragile. The human structural and appearance models used are often built off-line and learned only to a limited extent. The algorithms cannot seamlessly deal with high structural variability, multiple interacting people and severe occlusion or lighting changes, and the resulting full body reconstructions are often qualitative yet not photorealistic. A convincing transition between the laboratory and the real world has not yet been realized.


The goal of this workshop is to encourage a debate towards understanding the current successes and failures in the field, the types of motions, people and scenes that we can successfully analyze, and to what level of detail. Besides identifying major challenges, it appears to be a good time to think about future research directions that are important in order to automatically build the next generation of more flexible and reliable human models and algorithms. Given the complexity of the task, it is probable that learning will play a major role. Central themes are likely to be the choice of representation and its generalization properties, the role of bottom-up and top-down processing, and the importance of efficient search methods. Exploiting the problem structure and the scene context can also be critical in order to limit inferential ambiguities.


In this spirit, participation and discussion is encouraged on the following topics:



n      The role of representation. Methods to automatically extract complex, possibly hierarchical models (of structure, shape, appearance and dynamics) with the optimal level of complexity for various tasks, from typical, supervised and unsupervised datasets. Models that can gracefully handle partial views and multiple levels of detail.


n      Cost functions adapted for learning human models with good generalization properties. Algorithms that can learn reliably from small training sets.


n      Relative advantages of bottom-up (discriminative, conditional) and top-down (generative) models and ways to combine them for initialization and for recovery from tracking failures.


n      Inference methods for multiple people and for scenes where the data association is difficult. Algorithms and models able to handle occlusion, clutter and lighting changes. The relative advantages of 2d and 3d models and ways to jointly use them.  


n      The role of context in resolving ambiguities during state inference. Methods for combining reconstruction and recognition.


The workshop is intended primarily as a series of invited talks, short thematic overviews and panels (for instance Representation and Learning, Inference Algorithms, Bottom-up and Top-down Human Models, etc) at the same time retrospective and prospective.



 Length:  Full day (8  hours)



 Program Committee (Speakers and interested parties):


Simon Baker (CMU)

Michael Black (Brown)

Trevor Darrell (MIT)

David Fleet (UToronto)

Bill Freeman (MIT)

David Forsyth (UIUC)

Pascal Fua (EPFL)

Luc van Gool (ETHZ / ULeuven)

Daniel Huttenlocher (Cornell)

Allan Jepson (UToronto)

Yann Lecun (NYU)

Stan Li (CASIA)

Jitendra Malik (UCBerkeley)

Dimitris Metaxas (Rutgers)

Pietro Perona (Caltech)

Deva Ramanan (TTI-C)

Sami Romdhani (UBasel)

James Rehg (Georgia Tech)

Phil Torr (Oxford Brooks)

Bill Triggs (INRIA)

Song Chun Zhu (UCLA)