











• 
What if we use an
approximation to the posterior



distribution over
hidden configurations?




– 
e.g.
assume the posterior factorizes into a product of


distributions
for each separate hidden cause.



• 
If we use the
approximation for learning, there is no



guarantee that
learning will increase the probability that


the model would
generate the observed data.



• 
But there is a
different function, called “variational
free



energy” that is guaranteed to improve:

