The most important tool in understanding training methods is the "mountain landscape" analogy. Let's play with it some more. After training a neural net to read handwritten digits, with a softmax output group, we use it to read new digits as follows: we provide the image as input; we see which of the 10 possible outputs has greatest probability; we declare that output to be the network's guess. No matter what the details of the training method are, what we care about most is that it gives the right answer for as many digits as possible, using the above procedure. Here's an alternative loss function (a.k.a. error measure / error function) for training: simply the number of incorrectly classified training cases. After all, that's closest to what we really care about. If we train with that loss function, then what does the "mountain landscape" for optimization look like? And why won't training work? Explain in terms of the "mountain landscape" analogy. Answer notes: the slope will be zero, everywhere. The momentum method includes a multiplication of the velocity by 0.9 (or so), every iteration. What's the mountain landscape analogue of that? If we omit that, what's the result in mountain landscape terms, and why is that bad? Answer notes: the "ball" will reach the same height at the other side of the valley.