Why we maximize sums of log probs
We want to maximize the product of the probabilities of
the outputs on the training cases
Assume the output errors on different training cases,
c, are independent.
Because the log function is monotonic, it does not
change where the maxima are. So we can maximize
sums of log probabilities