 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
For models that
use distributed non-linear
|
|
|
representations,
it is intractable to compute the exact
|
|
|
posterior
distribution over hidden configurations. So what
|
|
happens if we
use a tractable approximation to the
|
|
|
posterior?
|
|
|
– |
e.g.
assume the posterior over hidden configurations
|
|
|
for
each datavector factorizes into a product of
|
|
|
distributions
for each separate hidden cause.
|
|
|
| • |
If we use this
approximation for learning, there is no
|
|
|
guarantee that
learning will increase the probability that
|
|
|
the model would
generate the observed data.
|
|
|
| • |
But maybe we can
find a different and sensible objective
|
|
|
function that is
guaranteed to improve at each update of
|
|
|
the parameters.
|
|