Why the hidden configurations should be treated
as data when learning the next layer of weights
• After learning the first layer of weights:
• If we freeze the generative weights that define the
likelihood term and the recognition weights that define
the distribution over hidden configurations, we get:
• Maximizing the RHS is equivalent to maximizing the log
prob of “data”     that occurs with probability