Look at the file raw_sentences.txt. It contains the sentences that we will be using for this assignment. Sequences of 4 adjacent words (called 4-grams) were extracted from them. These sentences are fairly simple ones and cover a vocabulary of only 250 words.
The data set consists of 4-grams extracted from raw_sentences.txt. All the words involved come from a small vocabulary of 250 words.
Open a matlab terminal and load data.mat. It should contain the training, validation and test sets, along with the 250-word vocabulary. 'data.trainData' is a 4 x 372,500 matrix. This means there are 372,500 training cases and 4 words per training case. Each entry is an integer that is the index of a word in the vocabulary. So each column represents a sequence of 4 words. 'data.validData' and 'data.testData' are also similar. They contain 46,500 4-grams each.
You must train the model four times, trying all possible combinations of d=8,d=32 and num_hid=64,num_hid=256. You must record the final cross entropy error on the training, validation, and test sets for each of these runs. You must also record the number of epochs at which the training stops. These four quantities are written into the model struct.
Select the best configuration that you ran. The function word_distance has been provided for you so that you can compute the distance between the learned representations of two words. The word_distance function takes two strings and the model as arguments. For example, if you wanted to compute the distance between the words "and" and "but" you would do the following (after training the model of course).
word_distance('and', 'but', model)The word_distance function simply takes the feature vector corresponding to each word and computes the L2 norm of the difference vector. Because of this, you can only meaningfully compare the relative distances between two pairs of words and discover things like "the word 'and' is closer to the word 'but' than it is to the word 'or' in the learned embedding." If you are especially enterprising, you can compare the distance between two words to the average distance to each of those words. Remember that if you want to enter a string that contains the single quote character in matlab you must escape it with another single quote. So the string apostrophe s, which is in the vocabulary, would have to be entered as '''s' in matlab.
Compute the distances between a few words and look for patterns. See if you can discover a few interesting things about the learned word embedding by looking at the distances between various pairs of words. What words would you expect to be close together? Are they? Think about what factors contribute to words being given nearby feature vectors.
You should hand in, along with the results you were instructed to record above, some clear, well-written, and concise prose providing a brief commentary on the results. At most two pages will be graded, so anything beyond two pages will be ignored during marking.
Here are a few suggestions on what you should comment on, but they are certainly not exhaustive or meant to limit your discussion. You will be expected to offer some insights beyond what is mentioned below. The questions below are merely to help guide your thinking, please make your analysis go beyond them. You should give your reasoning about why the d and num_hid you found to be the best actually is the best and how you defined "best". Comment a bit on everything you were asked to record and also what seemed to be happening during training. Explain all your observations. What did you discover about the learned word embedding? Why did these properties of the embedding arise? What things can cause two words to be placed near each other?