Why does the shortcut work?
If we start at the data, the Markov chain wanders away
from them data and towards things that it likes more. We
can see what direction it is wandering in after only a few
steps. It’s a big waste of time to let it go all the way to
equilibrium.
All we need to do is lower the probability of the
“confabulations” it produces and raise the probability
of the data. Then it will stop wandering away.
The learning cancels out once the confabulations and the
data have the same distribution.
We need to worry about regions of the data-space that
the model likes but which are very far from any data.
These regions cause the normalization term to be big
and we cannot sense them if we use the shortcut.