Learning to Reconstruct 3D Human Motion from Bayesian Mixture of Experts

*A Probabilistic Discriminative Approach *

The **BM ^{3}E Model**
= A Conditional

__Cristian Sminchisescu, Atul Kanaujia, Zhiguo Li, Dimitris
Metaxas__

**Introduction**

We propose a mixture density propagation algorithm to estimate 3D human
motion in *monocular video sequences*, based on observations encoding the
appearance of image silhouettes. Our approach is **discriminative** rather
than generative, therefore it does not require the probabilistic inversion of a
predictive observation model. Instead, it uses a human motion capture data-base
and a 3D computer graphics human model, to synthesize training pairs of typical
human configurations, together with their realistically rendered 2D silhouettes.
These are used to directly **learn** the conditional state distributions
required for 3D body pose tracking, and thus avoid using the 3D model for **
inference**. We aim for probabilistically motivated tracking algorithms and
for models that can estimate complex multivalued mappings common in inverse,
uncertain perception inferences.

Our work has **three contributions**:

(1) we clarify the assumptions and derive the density propagation rules for
**discriminative inference** in continuous, temporal chain models

(2) we propose flexible representations for **learning** multimodal
conditional state distributions, based on compact Bayesian mixture of experts
models

(3) we present empirical results on real and motion capture-based test sequences, and give comparisons against nearest-neighbor and regression methods

Our results suggest that correct, flexible conditional modeling and optimal density propagation are critical for robust, successful tracking.

**Keywords: ***discriminative models, density propagation, mixture modeling, hierarchical
mixture of experts, 3D human tracking, Bayesian methods, sparse regression.*

Reference Paper

- Cristian Sminchisescu, Atul Kanaujia, Zhiguo Li, Dimitris Metaxas,
**Learning to Reconstruct 3D Human Motion from Bayesian Mixture of Experts, A Probabilistic Discriminative Approach,***Technical Report CSRG-502, University of Toronto, October 2004,***[PDF]**

**Density Propagation for Discriminative Temporal Chain Models**

Tracking involves optimal estimates for the temporal distribution over states **x**,
given sequences of observations **r**. Typically, a generative architecture
like the one on the right is used. This causal model represents and possibly
learns (the hyperparameters of) the observation conditional *p*(**r**|**x**),
and then uses Bayes rule to invert it and infer *p*(**x**|**r**).

**Discriminative Temporal Chain Generative Temporal Chain**

A discriminative approach models and learns *p*(**x|r**)
directly from training data and therefore avoids the more expensive inference
step. A temporal chain architecture of this type is shown in the left figure
above (the model conditions on the observation without spending any effort
modeling it). The difficulty of such an approach resides in the complexity of
modeling and learning complex multimodal distributions. Before showing how we
can obtain these, let's first see what distributions do we
actually need to model, and how should we combine them temporally, for optimal
solutions.

For a **discriminative chain model (left)**, with first order Markov
assumption on temporal states, marginal independence of observations, the
filtering density is:

For a **generative chain model (right)**, the filtering propagation rule
can be derived in terms of discriminative conditionals (to simplify inference):

where:

The above is a direct consequence of the classical generative
propagation rule, as derived in particle filtering. Similar assumptions are
made: first order Markov for the temporal states, conditional independence of
observations given the states. We however work with the discriminative
conditional *p*(**x**|**r**)* *and not with the generative one* * *p*(**r|x**)*. *
Moreover, we need to weight by the equilibrium distribution over states. This
can be obtained, e.g. by integrating the conditional state dynamics. An approximation can be precomputed.

In practice, we do estimation using conditionals represented as Bayesian
mixtures (see next section) and the temporal prior, also propagates as a
Gaussian mixture, each having, say M components. We first integrate M^{2}
pairwise products of Gaussians inside the integral at the r.h.s, analytically. This
requires the linearization of our (generally) non-linear, but parametric (\ie
easily differentiable) state conditionals. The means of this expanded posterior
are clustered and the centers are used to initialize a reduced M-component
approximation that is refined using variational optimization.
See this paper
for detailed derivations of filtering and smoothing for discriminative and
generative models and for a generalization to windows of observations of
arbitrary size.
See this paper for a variational mixture optimization algorithm.

**Modeling Complex Multimodal Conditional Distributions using
Bayesian Mixture of Experts**

The models described above compute the filtering distribution in terms of a
number of simpler conditional distributions. But even these are complex to construct, e.g. *p*(**x**|**r**),
because the mapping from observations **r** to states **x** is multivalued,
in many inverse perception problems. Consider the case where the observations
are silhouette features of a human body and the state a vector of 3D human joint
angles. In the figure below, different 3d interpretations will produce very
similar silhouettes (other interpretations are possible):

This multimodality is contextual. It depends on what it is observed and in
what region of the input domain. The distribution of human joint angles may be
highly ambiguous for some inputs or moderately ambiguous or unambiguous for
others. Each branch of the inverse mapping can be modeled with a simple function
approximator (e.g. perceptron, regressor, say this is an **expert**) and
the overall output will come from a mixture of experts. But which of these are
active for any given input * depends on the input. *This contextual
dependency is an essential aspect of the problem and requires that the mixture
coefficients for each expert (called, say, gates) are themselves functions of the
input.

For instance, consider the following toy problem, where data is sampled from an S-like shape. For any input (on horizontal axis), there are many possible outputs and different branches of the map (green, red, blue) can be modeled by a simple expert, here linear or Gaussian kernel regressors. On the left, we show the experts and the original data, in the middle we show the gates, and on the right samples generated from the mixture model (note the similarity to the input distribution, i.e. the data is well modeled). First row middle also shows the uniform mixture coefficients of the joint distribution (clustering the input output pairs, under the input-output dependency of each assumed local model, e.g. linear or kernelized). The first two rows show models that are based on random regression and joint density (the conditional is obtained from the joint distribution, by applying Bayes' rule, conditioning and marginalization). The bottom row shows results from a directly estimated conditional model. See the paper for details.

We use Bayesian methods and sparse approximations, in order to obtain compact
models and avoid overfitting. Our model is thus a **Bayesian Mixture of Experts**.

**Three-Dimensional ****Human Motion
Reconstruction and Tracking Results**

__ Image features__ We use shape context histograms extracted on the silhouette
and pairwise edge histograms collected inside the silhouette contour. This is
how an affinity matrix look like for simple walking seen form side view, complex
walking and conversations (3d joint angles, 2d shape context silhouette contour
histograms, pairwise edges histograms inside the silhouette -- notice that for
walking seen from a side, the 3d joint angles and the 2d observables correlate
much better that for other, more complex activities):

__ Training Database__ We used about 3000
training examples from a variety of human activities involving running, walking,
conversation, quarrelling, pantomime, acquired using motion capture data (CMU
database). We use a computer graphics human model and generate training data
consisting of 54-dimensional joint angle state vectors paired with
corresponding, realistically rendered, image silhouettes (shape context and
pairwise edge histograms are extracted for these as above). There is ambiguity
in the database. Below, we show data analysis results, where we performed
a fine-grained, independent clustering of the inputs and outputs, and computed
histograms showing the number of joint angle clusters falling within input
clusters. The plots show results for observation versus state and (observation,
state) versus state input pairs. The rightmost plot shows how the multimodality
increases for a higher granularity of input clustering, again observation versus
state. These results show, quantitatively, that the state conditionals are
multimodal, even in the presence of prior knowledge. This justifies why we work
with a conditional mixture model.

**Comparisons on Synthetically Generated Data**

*Walking, Running, Complex Walking, Conversations and
Pantomime*

We compared the prediction of our Bayesian Mixture of Experts (BME) with nearest neighbor (NN) and regression methods (RVM). For these tests, there is no probabilistic tracking, only reconstruction at each time step (but the conditional on the r.h.s has memory). The results show improvements in both average errors and average maximum errors:

Below we show some results and data we tested on:

**Running **(original and reconstruction seen
from 2 different viewpoints)

** Complex Walking with Hand Shaking and Turning **
(original, first and second most probable hypothesis; notice forward-backward
ambiguities)

** Conversations **(original, most probable,
second and least probable reconstruction)

**Bayesian Tracking on Real Image Data**

We also run experiments on reconstructing human motion based on various propagation rules and conditional models for walking, running, dancing and bending and picking. For these sequences, our Bayesian tracker uses 5 hypotheses.

**Walking **

**Dancing**

**Bending and Picking**

References

- Cristian Sminchisescu, Atul Kanaujia, Zhiguo Li, Dimitris Metaxas:
**Learning to Reconstruct 3D Human Motion from Bayesian Mixture of Experts, A Probabilistic Discriminative Approach,***Technical Report CSRG-502, University of Toronto, October 2004.* - Cristian Sminchisescu, Allan Jepson:
**Density Propagation for Continuous Temporal Chains. Generative and Discriminative Models,***Technical Report CSRG-501, University of Toronto, October 2004.* - Cristian Sminchisescu, Allan Jepson:
**Generatie Modeling for Continuous Non-Linearly Embedded Non-Linear Inference**,*International Conference on Machine Learning, ICML 2004.* - Cristian Sminchisescu, Allan Jepson:
**Variational Mixture Smoothing for Non_linear Dynamical Systems**,*IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2004.* - Cristian Sminchisescu, Bill Triggs:
**Kinematic Jump Processes for Monocular 3D Human Tracking**,*IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2003.*