**Fahiem Bacchus**

Department of Computer Science

University of Waterloo

Waterloo, ON, CANADA N2L 3G1

*email:* `fbacchus@logos.uwaterloo.ca`

**Craig Boutilier**

Department of Computer Science

University of British Columbia

Vancouver, BC, CANADA, V6T 1Z4

*email:* `cebly@cs.ubc.ca`

**Adam Grove **

NEC Research Institute

4 Independence Way

Princeton NJ 08540, USA

*email:* `grove@research.nj.nec.com`

**Abstract**

Markov decision processes (MDPs) are a very popular tool
for decision theoretic planning (DTP), partly because of the
well-developed, expressive theory that includes effective
solution techniques. But the Markov assumption---that dynamics
and rewards depend on the current state only, and not on
history---is often inappropriate. This is especially true
of rewards: we frequently wish to associate rewards with behaviors
that extend over time. Of course, such reward processes can be
encoded in an MDP should we have a rich enough state space
(where states encode enough history). However
it is often difficult to ``hand craft'' suitable state spaces
that encode an appropriate amount of history.
We consider this problem in the case where non-Markovian rewards are
encoded by assigning values to formulas of a temporal logic.
These formulas characterize the value of temporally
extended behaviors. We argue that this allows a
natural representation of many commonly
encountered non-Markovian rewards. The main result is an
algorithm which, given a decision process with non-Markovian
rewards expressed in this manner, *automatically*
constructs an equivalent MDP (with Markovian reward structure),
allowing optimal policy construction using standard techniques.

* To appear, AAAI-96 *

Return to List of Papers