Approximate inference
What if we use an approximation to the posterior
distribution over hidden configurations?
e.g. assume the posterior factorizes into a product of
distributions for each separate hidden cause.
If we use the approximation for learning, there is no
guarantee that learning will increase the probability that
the model would generate the observed data.
But there is a different function, called variational free
energy that is guaranteed to improve: