Why would it be a bad idea to give the "distributed encoding of person 1" layer 1 unit instead of 6 units - what would happen? What if we use 24 instead of 6?

Each of the 6 units learned to represent something else. The second one learned to represent nationality. Why was that the second one, i.e. why not for example the fifth one? Why did the neural network choose the particular strategy that it did ended up using?

See diagram of Bengio's language model. Could the two table look-ups be with the same table? Could they each have their own table? Name advantages for both of these approaches.

What are the advantages of distributed representations over a one-of-N encoding? What are the advantages of one-of-N? Give concrete examples of when you'd use one and when you'd use the other.


Let's talk about the softmax cost function, and its derivatives. You've seen this described two ways, and a good way to verify that you understand it (i.e. that you're ready for a midterm) is to describe it yet another way. Let's talk about it in terms of the input a.k.a. logit (z) to a unit, and its output (y). That way, we can connect it to the way we discussed the error backpropagation algorithm last class.

1. The output (y) of one unit in a softmax group is its output probability. You've seen how it is calculated, but start by trying to remember (i.e. without looking it up). Why do we call it a probability?

2. Let's talk about derivatives now. Let's focus on unit #7 of a softmax group of 10 units. Say we want to know d_C / d_z7. What would be the backpropagation-style approach to computing that?

3. Actually it's easier to calculate d_C / d_z7 directly (i.e. without that intermediate). What is the answer, i.e. what is the formula for d_C / d_z7? Doing the derivation yourself is a helpful exercise, but we don't have time for it in class. Try it at home.