Learning to Reconstruct 3D Human Motion from Bayesian Mixture of Experts
A Probabilistic Discriminative Approach
The BM3E Model = A Conditional Bayesian Mixture of Experts Markov Model
Cristian Sminchisescu, Atul Kanaujia, Zhiguo Li, Dimitris Metaxas
We propose a mixture density propagation algorithm to estimate 3D human motion in monocular video sequences, based on observations encoding the appearance of image silhouettes. Our approach is discriminative rather than generative, therefore it does not require the probabilistic inversion of a predictive observation model. Instead, it uses a human motion capture data-base and a 3D computer graphics human model, to synthesize training pairs of typical human configurations, together with their realistically rendered 2D silhouettes. These are used to directly learn the conditional state distributions required for 3D body pose tracking, and thus avoid using the 3D model for inference. We aim for probabilistically motivated tracking algorithms and for models that can estimate complex multivalued mappings common in inverse, uncertain perception inferences.
Our work has three contributions:
(1) we clarify the assumptions and derive the density propagation rules for discriminative inference in continuous, temporal chain models
(2) we propose flexible representations for learning multimodal conditional state distributions, based on compact Bayesian mixture of experts models
(3) we present empirical results on real and motion capture-based test sequences, and give comparisons against nearest-neighbor and regression methods
Our results suggest that correct, flexible conditional modeling and optimal density propagation are critical for robust, successful tracking.
Keywords: discriminative models, density propagation, mixture modeling, hierarchical mixture of experts, 3D human tracking, Bayesian methods, sparse regression.
Density Propagation for Discriminative Temporal Chain Models
Tracking involves optimal estimates for the temporal distribution over states x, given sequences of observations r. Typically, a generative architecture like the one on the right is used. This causal model represents and possibly learns (the hyperparameters of) the observation conditional p(r|x), and then uses Bayes rule to invert it and infer p(x|r).
Discriminative Temporal Chain Generative Temporal Chain
A discriminative approach models and learns p(x|r) directly from training data and therefore avoids the more expensive inference step. A temporal chain architecture of this type is shown in the left figure above (the model conditions on the observation without spending any effort modeling it). The difficulty of such an approach resides in the complexity of modeling and learning complex multimodal distributions. Before showing how we can obtain these, let's first see what distributions do we actually need to model, and how should we combine them temporally, for optimal solutions.
For a discriminative chain model (left), with first order Markov assumption on temporal states, marginal independence of observations, the filtering density is:
See the paper for proof
For a generative chain model (right), the filtering propagation rule can be derived in terms of discriminative conditionals (to simplify inference):
The above is a direct consequence of the classical generative propagation rule, as derived in particle filtering. Similar assumptions are made: first order Markov for the temporal states, conditional independence of observations given the states. We however work with the discriminative conditional p(x|r) and not with the generative one p(r|x). Moreover, we need to weight by the equilibrium distribution over states. This can be obtained, e.g. by integrating the conditional state dynamics. An approximation can be precomputed.
In practice, we do estimation using conditionals represented as Bayesian mixtures (see next section) and the temporal prior, also propagates as a Gaussian mixture, each having, say M components. We first integrate M2 pairwise products of Gaussians inside the integral at the r.h.s, analytically. This requires the linearization of our (generally) non-linear, but parametric (\ie easily differentiable) state conditionals. The means of this expanded posterior are clustered and the centers are used to initialize a reduced M-component approximation that is refined using variational optimization. See this paper for detailed derivations of filtering and smoothing for discriminative and generative models and for a generalization to windows of observations of arbitrary size. See this paper for a variational mixture optimization algorithm.
Modeling Complex Multimodal Conditional Distributions using Bayesian Mixture of Experts
The models described above compute the filtering distribution in terms of a number of simpler conditional distributions. But even these are complex to construct, e.g. p(x|r), because the mapping from observations r to states x is multivalued, in many inverse perception problems. Consider the case where the observations are silhouette features of a human body and the state a vector of 3D human joint angles. In the figure below, different 3d interpretations will produce very similar silhouettes (other interpretations are possible):
This multimodality is contextual. It depends on what it is observed and in what region of the input domain. The distribution of human joint angles may be highly ambiguous for some inputs or moderately ambiguous or unambiguous for others. Each branch of the inverse mapping can be modeled with a simple function approximator (e.g. perceptron, regressor, say this is an expert) and the overall output will come from a mixture of experts. But which of these are active for any given input depends on the input. This contextual dependency is an essential aspect of the problem and requires that the mixture coefficients for each expert (called, say, gates) are themselves functions of the input. Notice that there is an essential difference between modeling the joint distribution and modeling the conditional. The mixture coefficients of the joint are constant across the input domain (see the first row, middle in the figure below). Therefore the extent to which each expert is good at predicting a given input is unknown.
For instance, consider the following toy problem, where data is sampled from an S-like shape. For any input (on horizontal axis), there are many possible outputs and different branches of the map (green, red, blue) can be modeled by a simple expert, here linear or Gaussian kernel regressors. On the left, we show the experts and the original data, in the middle we show the gates, and on the right samples generated from the mixture model (note the similarity to the input distribution, i.e. the data is well modeled). First row middle also shows the uniform mixture coefficients of the joint distribution (clustering the input output pairs, under the input-output dependency of each assumed local model, e.g. linear or kernelized). The first two rows show models that are based on random regression and joint density (the conditional is obtained from the joint distribution, by applying Bayes' rule, conditioning and marginalization). The bottom row shows results from a directly estimated conditional model. See the paper for details.
We use Bayesian methods and sparse approximations, in order to obtain compact models and avoid overfitting. Our model is thus a Bayesian Mixture of Experts.
Three-Dimensional Human Motion Reconstruction and Tracking Results
Image features We use shape context histograms extracted on the silhouette and pairwise edge histograms collected inside the silhouette contour. This is how an affinity matrix look like for simple walking seen form side view, complex walking and conversations (3d joint angles, 2d shape context silhouette contour histograms, pairwise edges histograms inside the silhouette -- notice that for walking seen from a side, the 3d joint angles and the 2d observables correlate much better that for other, more complex activities):
Training Database We used about 3000 training examples from a variety of human activities involving running, walking, conversation, quarrelling, pantomime, acquired using motion capture data (CMU database). We use a computer graphics human model and generate training data consisting of 54-dimensional joint angle state vectors paired with corresponding, realistically rendered, image silhouettes (shape context and pairwise edge histograms are extracted for these as above). There is ambiguity in the database. Below, we show data analysis results, where we performed a fine-grained, independent clustering of the inputs and outputs, and computed histograms showing the number of joint angle clusters falling within input clusters. The plots show results for observation versus state and (observation, state) versus state input pairs. The rightmost plot shows how the multimodality increases for a higher granularity of input clustering, again observation versus state. These results show, quantitatively, that the state conditionals are multimodal, even in the presence of prior knowledge. This justifies why we work with a conditional mixture model.
Comparisons on Synthetically Generated Data
Walking, Running, Complex Walking, Conversations and Pantomime
We compared the prediction of our Bayesian Mixture of Experts (BME) with nearest neighbor (NN) and regression methods (RVM). For these tests, there is no probabilistic tracking, only reconstruction at each time step (but the conditional on the r.h.s has memory). The results show improvements in both average errors and average maximum errors:
Below we show some results and data we tested on:
Running (original and reconstruction seen from 2 different viewpoints)
Complex Walking with Hand Shaking and Turning (original, first and second most probable hypothesis; notice forward-backward ambiguities)
Conversations (original, most probable, second and least probable reconstruction)
Bayesian Tracking on Real Image Data
We also run experiments on reconstructing human motion based on various propagation rules and conditional models for walking, running, dancing and bending and picking. For these sequences, our Bayesian tracker uses 5 hypotheses.
Bending and Picking