 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
The inference
procedure is incorrect because it ignores
|
|
|
the future.
|
|
|
| • |
The learning
procedure is wrong because the inference
|
|
|
is wrong and also
because we use contrastive
|
|
|
divergence.
|
|
|
| • |
But the model is
exponentially more powerful than an
|
|
|
HMM because it
uses distributed representations.
|
|
|
|
– |
Given
N hidden units, it can use N bits of information
|
|
|
to
constrain the future. An HMM only uses log N bits.
|
|
|
– |
This
is a huge difference if the data has any kind of
|
|
|
componential
structure. It means we need far fewer
|
|
|
parameters
than an HMM, so training is not much
|
|
|
slower,
even though we do not have an exact
|
|
|
maximum
likelihood algorithm.
|
|