







After learning
the first layer of weights:




If we freeze the generative weights that define the



likelihood term
and the recognition weights that define



the distribution
over hidden configurations, we get:




Maximizing the
RHS is equivalent to maximizing the log


prob of
data that occurs with
probability

