In this assignment we will be using neural networks for classifying handwritten digits. First copy and unzip the archive below into your directory. http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment2.tar.gz Any easy way to do this is to $ wget http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment2.tar.gz $ tar -xf assignment2.tar.gz $ cd assignment2 Then invoke matlab. If you want to keep it simple, you can use matlab without the graphical user interface by typing: matlab -nojvm At the >> prompt, type load assign2data2012.mat This will load the training and test data into your workspace. The training data consists of 150 images of handwritten digits (15 of each class). Each image is of size 16 X 16 pixels and each pixel is represented by a real number between 0 (black) and 1 (white). The test data contains 3000 images (300 of each class). In order to get a feel for what we are dealing with you can see the training data by typing >> showimages(data); The network we use consists of a single hidden layer with sigmoidal activation units and a 10-way softmax output unit. The network tries to minimize cross-entropy error (the error on a case is the negative log probability that the network gives to the correct answer). It uses momentum to speed the learning and weightcost to keep the weights small. It expects you to set several global variables that control these behaviours by hand (see the code file). This makes it easy to set up experiments in which you try many different settings. Now, type restart = 1; maxepoch = 2000; numhid = 60; epsilon = .01; finalmomentum = 0.8; weightcost = 0; classbp2; classbp2 trains the network. It prints out the cross-entropy (E) on the training set and on the test set. It also prints out the number of errors (it chooses the maximum output as the right answer). When it has finished it plots graphs of the number of test errors and of the cross-entropy cost on the test set. The plots will be on top of each other and you will need to move the top window to see the one underneath. You can set the variable errorprintfreq in classbp2 to make it measure the error as frequently as you want. You must always reset the restart variable to 1 whenever you want to redo the training from random initial weights. If you don't the network will start learning from where it left off. If you call showweights(inhid) you can see the input-to-hidden weights as gray-level images. PART 1 (3 points) Using numhid=60 and maxepoch=2000 and weightcost=0, play around with epsilon and finalmomentum to find settings that make tE low after 2000 epochs. Briefly report what you discover. Include the values of epsilon and finalmomentum that work best and say what values they produce for the test errors and the cross-entropy error. If you were instead asked to find the epsilon that produced the best minimum value (not the best final value) for test set cross entropy, would you expect to find a larger or smaller epsilon (assuming the minimum value occurs before the last epoch)? In a sentence, justify your answer. PART 2 (2 points) Using numhid=60 and maxepoch=2000 and finalmomentum=0.8 set epsilon to a sensible value based on your experiments in part 1 and then try various values for weightcost to see how it affects the final value of tE. You may find the file experiment.m useful, but you will have to edit it. Briefly report what you discovered and include a plot of the final value of tE against the weightcost. Your report on Parts 1 and 2 combined should be NOT MORE THAN ONE PAGE long, but graphs and printouts of runs can be attached. PART 3: (5 points) In this part we will try to find a Bayesian solution to a small toy problem. The net looks like this | t O / \ w_4 / | / O | / \ w_3 | w_1 / \ w_2 \ / \ x_1 x_2 Each data point will consist of 2 real valued numbers x_1 and x_2. Each 'O' represents a sigmoidal unit. We are interested in learning the w's. Recall, that in a Bayesian setting, we have a prior probability distribution over parameters w. Each setting of parameters to values constitutes a hypothesis. We have some prior belief about what hypotheses are more likely to be true. One such belief could be that the hypotheses which have small values of w's are more likely (also called a Gaussian Prior). Another belief could be that all hypotheses are equally likely (a Uniform Prior). In this assignment we will be working with a uniform prior over w. We then observe the data and modify our belief by weighting each hypothesis by its likelihood. In other words, the modified belief about hypothesis H is the prior about H multiplied by the likelihood that the data could be explained by H. We then use the weighted average of the hypotheses to make decisions. The model so obtained is called a Bayesian estimate. Note that another way of making decisions could be to choose the hypothesis with the highest modified belief (instead of taking a weighted average of all of them). This is called the Maximum a-posteriori (MAP) estimate. Still another way could be to choose the hypothesis with the largest likelihood (and ignore our prior beliefs). This is called the Maximum Likelihood (ML) estimate. In this assignment we will be comparing Bayesian and ML/MAP estimators. Since we are using a uniform prior, ML and MAP estimates are same. We will generate our training data randomly. Each dimension of each data point will be a sample drawn from a normal distribution with mean 0 and standard deviation ('inputstandev', currently set to '4'). The weights w will be set to unifrom random values in [-1, 1] and we assert that this is the 'true' net (also called a teacher net). The targets are binary valued and will be sampled from a Bernoulli distribution defined by the predictions made by the teacher net (i.e., the target for input x will be 1 with probability p, where p is the output of the teacher network on input x) To start off, copy the following files: Then type >> makeallvecs; This makes a matrix in which each row is a possible weight vector. That is, we have all possible hypotheses with us in the matrix 'allvecs'. Each w_i can take on 9 possible values (-1 to 1 in intervals of 0.25). The prior is uniform over these 9^4 possible weight vectors. then type >> maketeacher; This makes a teacher network Then type: >> numcases=10; >> bayeswithbest; numcases is the number of training cases that the model will use. The number of testcases is fixed at 100. The code will print out the Bayesian estimate's error on the training and test sets as well as the error of the MAP/ML estimate Figure 1 will show you how well the outputs of the teacher on the training data can be predicted by bayes-averaging the outputs of all possible nets. "Bayes-averaging" means weighting the prediction of each net by the posterior probability of that net given the training data and the prior (which is flat in this example). Each column corresponds to a data point. The first row gives the output of the teacher net. The second row gives the output of the Bayesian estimate. The third row gives the output of the MAP/ML estimate. The size of the square is proportional to the output value. Figure 2 will show the same on the test data Figure 3 shows a histogram of the posterior probability distribution across all 9^4 weight vectors. Notice that the posterior can be very spread out so that even the best net gets a very small posterior probability. Your report should be at most half a page and should describe the effects of changing the number of training cases. You should also try modifying maketeacher to set different kinds of teacher nets and see how the results depend on the particular teacher net.