Using Free Energies to Represent Q-values in a Multiagent
  Reinforcement Learning Task
  Brian Sallans
  Department of Computer Science
  University of Toronto
  Toronto M5S 2Z9 Canada
  
  Geoffrey Hinton
  Gatsby Computational Neuroscience Unit
  University College London
  17 Queen Square, London WC1N 3AR, UK
  Abstract
  The problem of reinforcement learning in large factored Markov
  decision processes is explored.  The Q-value of a state-action pair is approximated
  by the free energy of a product of experts network.  Network parameters are learned
  on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values
  of consecutive state-action pairs.  Actions are chosen based on the current value
  estimates by fixing the current state and sampling actions from the network using Gibbs
  sampling.  The algorithm is tested on a co-operative multi-agent task.  The
  product of experts model is found to perform comparably to table-based Q-learning for
  small instances of the task, and continues to perform well when the problem becomes too
  large for a table-based representation.
   Download  [pdf] [ps.gz] 
  Submitted to Advances in Neural Information Processing Systems
  13, MIT Press, Cambridge, MA
  [home page]  [publications]