Assignment 1 In this assignment, you will run code that trains a simple neural language model on a dataset of sentences that were culled from a large corpus of newspaper articles so that the culled sentences would have a highly restricted vocabulary of only 250 words. First, download all the files associated with the assignment, including the data file and the matlab code you will need. Put everything in the same directory and from within matlab, cd to that directory. The files are: http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/4grams.mat http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/bprop.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/computeCrossEntropy.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/fprop.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/loadData.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/prepMinibatch.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/train.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/wordDistance.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/zeroOneLoss.m http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment1/rawSentences.tar.gz Before you can run the code, you will need to load the data it expects to find in the workspace. To do that, execute the following command: >> [trainbatches, validbatches, testbatches, vocab] = loadData(100, 1000); The line above creates the training, validation, and test data variables as well as a variable holding the 250 words in the vocabulary. It will load a training set with 1000 minibatches of 100 cases each. The model you will train on these data produces a distribution over the next word given the previous three words as input. Since the neural network will be trained on 4-grams extracted from isolated sentences, this means that it will never be asked to predict any of the first three words in a sentence. The neural network learns a d-dimensional embedding for the words in the vocabulary and has a hidden layer with numHid hidden units fully connected to the single output softmax unit. If we are so inclined, we can view the embedding as another (earlier) hidden layer with weights shared across the three words. After you have loaded the data, you can set d and numHid like so: >> d = 8; numHid = 64; and then run the main training script, >> train; which will train the neural network using the embedding dimensionality and number of hidden units specified. The training script monitors the cross entropy error on the validation data and uses that information to decide when to stop training. Training stops as soon as the validation error increases and the final weights are set to be the weights from immediately before this increase. This procedure is a form of "early stopping" and is a common method for avoiding overfitting in neural network training. Here is a list of the variables of interest that the train script puts in the workspace: wordRepsFinal - the learned word embedding repToHidFinal - the learned embedding-to-hidden unit weights hidToOutFinal - the learned hidden-to-output unit weights hidBiasFinal - the learned hidden biases outBiasFinal - the learned output biases epochsBeforeVErrUptick - the number of epochs of training before validation error increased. In other words, the number of training epochs used to produce the final weights above. finalTrainCEPerCase - The per case cross entropy error on the training set of the weights with the best validation error. finalValidCEPerCase - The per case cross entropy error on the validation set of the final weights (this will always be the best validation error the training script has seen). finalTestCEPerCase - The per case cross entropy error on the test set of the final weights. You must train the model four times, trying all possible combinations of d=8,d=32 and numHid=64,numHid=256. You must record the final cross entropy error on the training, validation, and test sets (stored in the appropriate variables mentioned above) for each of these runs. You must also record the number of epochs before a validation error increase (stored in epochsBeforeVErrUptick) for each of the runs. Select the best configuration that you ran. The function wordDistance has been provided for you so that you can compute the distance between the learned representations of two words. The wordDistance function takes two strings, the wordReps matrix that you learned (use wordRepsFinal unless you have a good reason not to) and the vocabulary. For example, if you wanted to compute the distance between the words "and" and "but" you would do the following (after training the model of course). >> wordDistance('and', 'but', wordRepsFinal, vocab) The wordDistance function simply takes the feature vector corresponding to each word and computes the L2 norm of the difference vector. Because of this, you can only meaningfully compare the relative distances between two pairs of words and discover things like "the word 'and' is closer to the word 'but' than it is to the word 'or' in the learned embedding." If you are especially enterprising, you can compare the distance between two words to the average distance to each of those words. Remember that if you want to enter a string that contains the single quote character in matlab you must escape it with another single quote. So the string apostrophe s, which is in the vocabulary, would have to be entered as >> '''s' in matlab. Compute the distances between a few words and look for patterns. See if you can discover a few interesting things about the learned word embedding by looking at the distances between various pairs of words. What words would you expect to be close together? Are they? Think about what factors contribute to words being given nearby feature vectors. You can access the vocabulary of words from the 'vocab' variable and the raw sentences from the file rawSentences.txt.gz What to hand in: You should hand in, along with the results you were instructed to record above, some clear, well-written, and concise prose providing a brief commentary on the results. At most two pages will be graded, so anything beyond two pages will be ignored during marking. Here are a few suggestions on what you should comment on, but they are certainly not exhaustive or meant to limit your discussion. You will be expected to offer some insights beyond what is mentioned below. The questions below are merely to help guide your thinking, please make your analysis go beyond them. You should give your reasoning about why the d and numHid you found to be the best actually is the best and how you defined "best". Comment a bit on everything you were asked to record and also what seemed to be happening during training. Explain all your observations. What did you discover about the learned word embedding? Why did these properties of the embedding arise? What things can cause two words to be placed near each other? What can you say about overfitting and generalization in light of your results?