Comparison with hidden Markov models
Our inference procedure is incorrect because it ignores
the future.
Our learning procedure is slightly wrong because the
inference is wrong and also because we use contrastive
divergence.
But the model is exponentially more powerful than an
HMM because it uses distributed representations.
Given N hidden units, it can use N bits of information
to constrain the future.
An HMM can only use log N bits of history.
This is a huge difference if the data has any kind of
componential structure. It means we need far fewer
parameters than an HMM, so training is actually
easier, even though we do not have an exact
maximum likelihood algorithm.