Modeling Facial Displays

We have developed a generative model for constructing temporally and spatially abstract descriptions of sequences The model we use is a mixture of coupled hidden Markov models. and is shown here as a Bayesian network being used to assess a sequence of a person smiling.

See some movies of this model analyzing sequences.

We consider that spatially abstracting a video frame during a human non-verbal display involves modeling both the current configuration and dynamics of the face. Dynamics and configuration complement one another, and are akin to momentum and position in classical dynamics. They can be both useful in describing the way a human face moves. In particular, modeling the configuration of the face disambiguates temporal sequences of dynamics, while the dynamics of the face can predict future configurations. Configuration and dynamics are both useful for tracking.

Our observations consist of the video image regions, I, and the temporal derivatives, f_t, between pairs of images over these regions. We assume here that the image regions are given at each frame. The temporal derivatives (along with spatial derivatives) induce a dense optical flow field, by assuming that the image intensity structure is locally constant across short periods of time (the brightness constancy assumption). The optical flow field is a projection of the 3D scene velocity to the image plane, and gives the motion in the image at each pixel. Thus, the measurements we start from contain simultaneous descriptions of the instantaneous configuration and dynamics of the body. The task is first to spatially summarise both of these quantities, then to temporally compress the entire sequence to a distribution over high level descriptors, D.

The spatial abstraction of images and temporal derivatives occurs in the two vertical chains in the figure above, culminating in distributions over the multivariate random variables, W and X, for images and temporal derivatives, respectively. W and X correspond to classes of instantaneous configuration and dynamics of the region of interest in the training data. For example, the configuration classes may correspond to characteristic facial poses, such as the apex of a smile. The dynamics classes are motion classes, and may correspond to, for example, motion during expansion of the face to a smile.

The same method is used for spatial abstraction of both the configuration and dynamics of the face. Image regions and optical flow fields are each projected to a pre-determined set of basis functions, yielding finite dimensional feature vectors, Z_w and Z_x, respectively. The basis set is complete and orthogonal, such that Z_w and Z_x can be used to reconstruct images and flow fields to an arbitrary degree of accuracy, given sufficient basis projections. The basis functions are ordered by their spatial frequencies, such that low orders represent gross structure in images and flow fields, and higher orders represent more complex structures. Using a pre-determined basis set defers any commitment to particular types of motion to higher levels of processing, without affecting computational efficiency. We use the basis of Zernike polynomials, which have useful properties for modeling flow fields and images (Teh and Chin 1988). Zernike polynomials are defined over a unit disk, and are complete and orthogonal, such that the feature vectors can be used for reconstruction of images or flow fields. The distributions of each of the feature vectors (for configuration, Z_w, and dynamics, Z_x) are modeled by a mixture of Gaussians distribution, where the mixture components are labeled as states of W and X. The mixture models at this stage also include feature weights as priors on the cluster means. These feature weights obviate the need to choose which basis functions are useful for classification.

Read more about this Bayesian network, or read about Zernike polynomials for modeling dynamics and configuration, or see some movies of this model analyzing sequences.