Comparison with hidden Markov models
The inference procedure is incorrect because it ignores
the future.
The learning procedure is wrong because the inference
is wrong and also because we use contrastive
divergence.
But the model is exponentially more powerful than an
HMM because it uses distributed representations.
Given N hidden units, it can use N bits of information
to constrain the future. An HMM only uses log N bits.
This is a huge difference if the data has any kind of
componential structure. It means we need far fewer
parameters than an HMM, so training is not much
slower, even though we do not have an exact
maximum likelihood algorithm.