Supervised Maximum Likelihood Learning
Finding a set of weights, W, that minimizes the
squared errors is exactly the same as finding a W
that maximizes the log probability that the model
would produce the desired outputs on all the
training cases.
We implicitly assume that zero-mean Gaussian
noise is added to the model’s actual output.
We do not need to know the variance of the
noise because we are assuming it’s the same
in all cases. So it just scales the squared error.