lecture 25

An application to language modeling

t-1 t

•

Use the previous hidden state to transmit

hundreds of bits of long range semantic

information (don’t try this with an HMM)

–

The hidden states are only trained to

help model the current word, but this

causes them to contain lots of useful

semantic information.

•

Optimize the CRBM to predict the

conditional probability distribution for the

most recent word.

–

With 17,000 words and 1000 hiddens

this requires 52,000,000 parameters.

–

The corresponding autoregressive

model requires 578,000,000

parameters.

t-2 t-1 t