The square, noise-free case
We eliminate the noise model for each data component,
and we use the same number of factors as data
components.
Given the weight matrix, there is now a one-to-one
mapping between data vectors and hidden activity
vectors.
To make the data probable we want two things:
The hidden activity vectors that correspond to data
vectors should have high prior probabilities.
The mapping from hidden activities to data vectors
should compress the hidden density to get high density
in the data space. i.e. the matrix that maps hidden
activities to data vectors should have a small
determinant. Its inverse should have a big determinant