Almost all neural networks are trained using derivatives, as was explained in the video. Why do we use a derivative for deciding on how to change the weights?

Do you think babies use derivatives when they're learning stuff?


Let's talk about training a linear output neuron, in a situation where there is only one training case. Remember that the weight update is computed by taking the learning rate and multiplying it by the gradient of the error function on that one training case. Three questions:

- What happens to the amount of error if we use a learning rate of zero?

- What if we use a learning rate that's much too large?

- What if we use a learning rate that's just right - will that take us to the best weight vector in one step?

A perceptron will stop changing its weights when it gets every training case right, and if it can't find a weight vector that accomplishes that, it'll never stop changing its weights. What condition(s) must be met before a linear neuron with a single training case will stop changing its weights?


Now let's talk about the "batch" learning rule, where there are multiple training cases, and the weight update is the learning rate multiplied by the gradient of the error function, as computed on the entire training set (with a sum). Same questions as before:

- What happens to the amount of error if we use a learning rate of zero?

- What if we use a learning rate that's much too large?

- What if we use a learning rate that's just right - will that take us to the best weight vector in one step?

What condition(s) must be met before a linear neuron with batch learning will stop changing its weights?


There's also the "online" learning rule. Here, the error function (and its gradient) is computed on only one training case from the training set, but for every iteration of the learning procedure it's a different training case (we cycle through the training set, just like a perceptron does).

Can an update that's computed on training case #1 make the output on training case #2 worse? Can it make the prediction on training case #1 itself worse?

What condition(s) must be met before a linear neuron with online learning will stop changing its weights?


Let's talk about training a linear neuron, with two input dimensions and two training cases. For now let's say that there's no bias, to simplify things.

The first training case has input values (2, 0) and intended output value 6. In 2D weight space, draw a vector for that training case, and indicate which weight vectors would produce exactly the right output for that training case.

The second training case has input values (0, 10) and intended output value 20. Indicate which weight vectors would produce the right answer for this training case. Are there weight vectors that give the right answer for both cases?

Roughly sketch the contour lines of the error function. As mentioned in the video, they'll be elliptical.


Remember that we were discussing if perceptrons can ever have a feasible region but not a generously feasible region. It turns out that that's related to whether or not we've appended a 1 to every input vector (as a way to implement the bias). Try to figure out how that's related.