Self-supervised backprop and PCA
If the hidden and output layers are linear, it will
learn hidden units that are a linear function of the
data and minimize the squared reconstruction
error.
The m hidden units will span the same space as
the first m principal components
Their weight vectors may not be orthogonal
They will tend to have equal variances