Rewarding Behaviors

Fahiem Bacchus
Department of Computer Science
University of Waterloo
Waterloo, ON, CANADA N2L 3G1
email: fbacchus@logos.uwaterloo.ca

Craig Boutilier
Department of Computer Science
University of British Columbia
Vancouver, BC, CANADA, V6T 1Z4
email: cebly@cs.ubc.ca

Adam Grove
NEC Research Institute
4 Independence Way
Princeton NJ 08540, USA
email: grove@research.nj.nec.com

Abstract
Markov decision processes (MDPs) are a very popular tool for decision theoretic planning (DTP), partly because of the well-developed, expressive theory that includes effective solution techniques. But the Markov assumption---that dynamics and rewards depend on the current state only, and not on history---is often inappropriate. This is especially true of rewards: we frequently wish to associate rewards with behaviors that extend over time. Of course, such reward processes can be encoded in an MDP should we have a rich enough state space (where states encode enough history). However it is often difficult to ``hand craft'' suitable state spaces that encode an appropriate amount of history. We consider this problem in the case where non-Markovian rewards are encoded by assigning values to formulas of a temporal logic. These formulas characterize the value of temporally extended behaviors. We argue that this allows a natural representation of many commonly encountered non-Markovian rewards. The main result is an algorithm which, given a decision process with non-Markovian rewards expressed in this manner, automatically constructs an equivalent MDP (with Markovian reward structure), allowing optimal policy construction using standard techniques.

To appear, AAAI-96

Return to List of Papers