Approximate inference
For models that use distributed non-linear
representations, it is intractable to compute the exact
posterior distribution over hidden configurations. So what
happens if we use a tractable approximation to the
posterior?
e.g. assume the posterior over hidden configurations
for each datavector factorizes into a product of
distributions for each separate hidden cause.
If we use this approximation for learning, there is no
guarantee that learning will increase the probability that
the model would generate the observed data.
But maybe we can find a different and sensible objective
function that is guaranteed to improve at each update of
the parameters.