Assignment 1

In this assignment, you will run code that trains a simple neural language
model on a dataset of sentences that were culled from a large corpus of
newspaper articles so that the culled sentences would have a highly restricted
vocabulary of only 250 words.

First, download all the files associated with the assignment, including the
data file and the matlab code you will need. Put everything in the same
directory and from within matlab, cd to that directory. The files are:

http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/4grams.mat
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/bprop.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/computeCrossEntropy.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/fprop.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/loadData.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/prepMinibatch.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/train.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/wordDistance.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/zeroOneLoss.m
http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/rawSentences.tar.gz

Before you can run the code, you will need to load the data it expects to find
in the workspace. To do that, execute the following command:

>> [trainbatches, validbatches, testbatches, vocab] = loadData(100, 1000);

The line above creates the training, validation, and test data variables as
well as a variable holding the 250 words in the vocabulary. It will load a
training set with 1000 minibatches of 100 cases each.


The model you will train on these data produces a distribution over the next
word given the previous three words as input. Since the neural network will be
trained on 4-grams extracted from isolated sentences, this means that it will
never be asked to predict any of the first three words in a sentence. The
neural network learns a d-dimensional embedding for the words in the vocabulary
and has a hidden layer with numHid hidden units fully connected to the single
output softmax unit. If we are so inclined, we can view the embedding as
another (earlier) hidden layer with weights shared across the three words.

After you have loaded the data, you can set d and numHid like so:
>> d = 8; numHid = 64;
and then run the main training script,
>> train;
which will train the neural network using the embedding dimensionality and
number of hidden units specified.

The training script monitors the cross entropy error on the validation data and
uses that information to decide when to stop training. Training stops as soon
as the validation error increases and the final weights are set to be the
weights from immediately before this increase. This procedure is a form of
"early stopping" and is a common method for avoiding overfitting in neural
network training.

Here is a list of the variables of interest that the train script puts in the 
workspace:

wordRepsFinal - the learned word embedding

repToHidFinal - the learned embedding-to-hidden unit weights

hidToOutFinal - the learned hidden-to-output unit weights

hidBiasFinal - the learned hidden biases

outBiasFinal - the learned output biases

epochsBeforeVErrUptick - the number of epochs of training before validation
error increased. In other words, the number of training epochs used to produce
the final weights above.

finalTrainCEPerCase - The per case cross entropy error on the training set of 
the weights with the best validation error.

finalValidCEPerCase - The per case cross entropy error on the validation set 
of the final weights (this will always be the best validation error the 
training script has seen).

finalTestCEPerCase - The per case cross entropy error on the test set of 
the final weights.

You must train the model four times, trying all possible combinations of
d=8,d=32 and numHid=64,numHid=256. You must record the final cross entropy
error on the training, validation, and test sets (stored in the appropriate
variables mentioned above) for each of these runs. You must also record the
number of epochs before a validation error increase (stored in
epochsBeforeVErrUptick) for each of the runs.

Select the best configuration that you ran. The function wordDistance has been
provided for you so that you can compute the distance between the learned
representations of two words. The wordDistance function takes two strings, the
wordReps matrix that you learned (use wordRepsFinal unless you have a good
reason not to) and the vocabulary. For example, if you wanted to compute the
distance between the words "and" and "but" you would do the following (after
training the model of course).

>> wordDistance('and', 'but', wordRepsFinal, vocab)

The wordDistance function simply takes the feature vector corresponding to each
word and computes the L2 norm of the difference vector. Because of this, you
can only meaningfully compare the relative distances between two pairs of words
and discover things like "the word 'and' is closer to the word 'but' than it is
to the word 'or' in the learned embedding." If you are especially enterprising,
you can compare the distance between two words to the average distance to each
of those words. Remember that if you want to enter a string that contains the
single quote character in matlab you must escape it with another single
quote. So the string apostrophe s, which is in the vocabulary, would have to be
entered as
>> '''s'
in matlab.

Compute the distances between a few words and look for patterns. See if you can
discover a few interesting things about the learned word embedding by looking
at the distances between various pairs of words. What words would you expect to
be close together? Are they? Think about what factors contribute to words being
given nearby feature vectors. You can access the vocabulary of words from the 
'vocab' variable and the raw sentences from the file rawSentences.txt.gz

What to hand in: You should hand in, along with the results you were instructed
to record above, some clear, well-written, and concise prose providing a brief
commentary on the results. At most two pages will be graded, so anything beyond
two pages will be ignored during marking.

Here are a few suggestions on what you should comment on, but they are
certainly not exhaustive or meant to limit your discussion. You will be
expected to offer some insights beyond what is mentioned below. The questions
below are merely to help guide your thinking, please make your analysis go
beyond them.  You should give your reasoning about why the d and numHid you
found to be the best actually is the best and how you defined "best". Comment a
bit on everything you were asked to record and also what seemed to be happening
during training. Explain all your observations. What did you discover about the
learned word embedding? Why did these properties of the embedding arise? What
things can cause two words to be placed near each other? What can you say about
overfitting and generalization in light of your results?