An application to language modeling
 t-1            t
• Use the previous hidden state to transmit
hundreds of bits of long range semantic
information (don’t try this with an HMM)
– The hidden states are only trained to
help model the current word, but this
causes them to contain lots of useful
semantic information.
• Optimize the CRBM to predict the
conditional probability distribution for the
most recent word.
– With 17,000 words and 1000 hiddens
this requires 52,000,000 parameters.
– The corresponding autoregressive
model  requires 578,000,000
parameters.
t-2       t-1        t