The shortcut
• Instead of taking the negative samples from the equilibrium
distribution, use slight corruptions of the datavectors. Only add random
momentum once, and only follow the dynamics for a few steps.
– Much less variance because a datavector and its confabulation
form a matched pair.
– Gives a very biased estimate of the gradient of the log likelihood.
– Gives a good estimate of the gradient of the contrastive divergence
(i.e. the amount by which F falls during the brief HMC.)
• Its very hard to say anything about what this method does to the log
likelihood because it only looks at rivals in the vicinity of the data.
• Its hard to say exactly what this method does to the contrastive
divergence because the Markov chain defines what we mean by
“vicinity”, and the chain keeps changing as the parameters change.
– But its works well empirically, and it can be proved to work well in
some very simple cases.