Stochastic MDL using the wrong distribution
over codes
If we want to communicate the code for a datavector, the
most efficient method requires us to pick a code
randomly from the posterior distribution over codes.
This is easy if there is only a small number of possible
codes. It is also easy if the posterior distribution has a
nice form (like a Gaussian or a factored distribution)
But what should we do if the posterior is intractable?
This is typical for non-linear distributed representations.
We do not have to use the most efficient coding scheme!
If we use a suboptimal scheme we will get a bigger
description length.
The bigger description length is a bound on the minimal
description length.
Minimizing this bound is a sensible thing to do.
So replace the true posterior distribution by a simpler
distribution.
This is typically a factored distribution.