 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
Our inference
procedure is incorrect because it ignores
|
|
|
the future.
|
|
|
| • |
Our learning
procedure is slightly wrong because the
|
|
|
inference is
wrong and also because we use contrastive
|
|
divergence.
|
|
|
| • |
But the model is
exponentially more powerful than an
|
|
|
HMM because it
uses distributed representations.
|
|
|
|
– |
Given
N hidden units, it can use N bits of information
|
|
|
to
constrain the future.
|
|
|
|
– |
An
HMM can only use log N bits of history.
|
|
|
|
– |
This
is a huge difference if the data has any kind of
|
|
|
componential
structure. It means we need far fewer
|
|
|
parameters
than an HMM, so training is actually
|
|
|
easier,
even though we do not have an exact
|
|
|
maximum
likelihood algorithm.
|
|