Fahiem Bacchus
Department of Computer Science
University of Waterloo
Waterloo, ON, CANADA N2L 3G1
email: fbacchus@logos.uwaterloo.ca
Craig Boutilier
Department of Computer Science
University of British Columbia
Vancouver, BC, CANADA, V6T 1Z4
email: cebly@cs.ubc.ca
Adam Grove
NEC Research Institute
4 Independence Way
Princeton NJ 08540, USA
email: grove@research.nj.nec.com
Abstract
Markov decision processes (MDPs) are a very popular tool
for decision theoretic planning (DTP), partly because of the
well-developed, expressive theory that includes effective
solution techniques. But the Markov assumption---that dynamics
and rewards depend on the current state only, and not on
history---is often inappropriate. This is especially true
of rewards: we frequently wish to associate rewards with behaviors
that extend over time. Of course, such reward processes can be
encoded in an MDP should we have a rich enough state space
(where states encode enough history). However
it is often difficult to ``hand craft'' suitable state spaces
that encode an appropriate amount of history.
We consider this problem in the case where non-Markovian rewards are
encoded by assigning values to formulas of a temporal logic.
These formulas characterize the value of temporally
extended behaviors. We argue that this allows a
natural representation of many commonly
encountered non-Markovian rewards. The main result is an
algorithm which, given a decision process with non-Markovian
rewards expressed in this manner, automatically
constructs an equivalent MDP (with Markovian reward structure),
allowing optimal policy construction using standard techniques.
To appear, AAAI-96
Return to List of Papers