Unsupervised “pre-training” also helps for
models that have more data and better priors
Ranzato et. al. (NIPS 2006) used an additional
600,000 distorted digits.
They also used convolutional multilayer neural
networks that have some built-in, local
translational invariance.
Back-propagation alone:                  0.49%
Unsupervised layer-by-layer
pre-training followed by backprop:   0.39% (record)