|
Decision Theoretic Learning of Facial Displays and Gestures
Joint work with Jim Little
A rational agent attempts to select actions which maximize their utility function over possible outcomes.
The agent can predict outcomes based upon its beliefs about the current situation (the context) and a learned model of how the
environment reacts to the agent's actions. If this agent is interacting with another rational agent, then the context includes
the strategies, preferences and internal states of these other agents. Further, the outcomes are now dependent upon the joint action
choices of all agents in the interaction. Each decision maker can use observations of other agent's behaviors
to help it predict outcomes of it's actions, and hence help it choose actions which are maximally beneficial.
In particular, many of the other agent's behaviors will be non-verbal indications of directions in which
they want the interaction to proceed. Gestures, body postures, and facial displays are all non-verbal behaviors
which are used extensively by humans in interactions.
Our research is centered on building decision theoretic models of interactions between rational agents and
humans, in which non-verbal behaviors play an important role.
This figure shows two agents interacting:
The agent on the left is considering his action choices. He is trying to predict outcomes (that have utility for him)
based on the context and on what his partner (the agent on the right) is telling him using his non-verbal displays (e.g. facial
displays). See what happens, or watch the whole sequence.
If we assume that only the partner's facial display is unobservable, then the arrows and nodes in the figure above
correspond to links in a partially observable Markov decision process, or
POMDP. This is the model our research focusses on.
While the high-level variables (context, actions, outcomes) can be described by discrete multinomial variables and functions,
the non-verbal displays of the interactant are a much more complex function, since they describe observations which are
entire video sequences, a temporally and spatially abstract set of data.
Read on about this observation function,
jump to POMDPs for facial display understanding.
|