In this assignment we will be using neural networks for classifying
handwritten digits.

First copy and unzip the archive below into your directory. 

http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment2.tar.gz

Any easy way to do this is to
$ wget http://www.cs.toronto.edu/~hinton/csc321/matlab/assignment2.tar.gz
$ tar -xf assignment2.tar.gz
$ cd assignment2

Then invoke matlab.  If you want to keep it simple, you can use matlab
without the graphical user interface by typing:

matlab -nojvm

At the >> prompt, type 

load assign2data2012.mat

This will load the training and test data into your workspace.
The training data consists of 150 images of handwritten digits (15 of each class).
Each image is of size 16 X 16 pixels and each pixel is represented
by a real number between 0 (black) and 1 (white).  The test data contains 
3000 images (300 of each class). In order to get a feel for what we are
dealing with you can see the training data by typing

>> showimages(data);

The network we use consists of a single hidden layer with sigmoidal
activation units and a 10-way softmax output unit. The network tries to 
minimize cross-entropy error (the error on a case is the negative log 
probability that the network gives to the correct answer). It uses 
momentum to speed the learning and weightcost to keep the weights small. 
It expects you to set several global variables that control these behaviours 
by hand (see the code file). This makes it easy to set up experiments 
in which you try many different settings.

Now, type

restart = 1;
maxepoch = 2000;
numhid = 60;
epsilon = .01;
finalmomentum = 0.8;
weightcost = 0;

classbp2;

classbp2 trains the network. It prints out the cross-entropy (E) on the training set and on
the test set. It also prints out the number of errors (it chooses the
maximum output as the right answer). When it has finished it plots
graphs of the number of test errors and of the cross-entropy cost on
the test set. The plots will be on top of each other and you will need
to move the top window to see the one underneath. You can set the
variable errorprintfreq in classbp2 to make it measure the error as
frequently as you want.

You must always reset the restart variable to 1 whenever you want to 
redo the training from random initial weights. If you don't the network will
start learning from where it left off.

If you call showweights(inhid) you can see the
input-to-hidden weights as gray-level images.

PART 1 (3 points)
   Using numhid=60 and maxepoch=2000 and weightcost=0, play
   around with epsilon and finalmomentum to find settings that make tE low
   after 2000 epochs. Briefly report what you discover. Include the
   values of epsilon and finalmomentum that work best and say what
   values they produce for the test errors and the cross-entropy
   error. If you were instead asked to find the epsilon that produced the best
   minimum value (not the best final value) for test set cross entropy,
   would you expect to find a larger or smaller epsilon (assuming the
   minimum value occurs before the last epoch)?  In a sentence,
   justify your answer. 

PART 2 (2 points) 
   Using numhid=60 and maxepoch=2000 and finalmomentum=0.8 set epsilon
   to a sensible value based on your experiments in part 1 and then try various
   values for weightcost to see how it affects the final value of
   tE. You may find the file experiment.m useful, but you will have to
   edit it. Briefly report what you discovered and include a plot of
   the final value of tE against the weightcost.

Your report on Parts 1 and 2 combined should be NOT MORE THAN ONE PAGE
long, but graphs and printouts of runs can be attached.


PART 3: (5 points)

In this part we will try to find a Bayesian solution to a small toy problem.
The net looks like this

           | t
           O
         /   \ w_4
        /     |
       /      O
      |      / \
 w_3  | w_1 /   \ w_2
       \   /     \
        x_1        x_2

Each data point will consist of 2 real valued numbers x_1 and x_2.
Each 'O' represents a sigmoidal unit. We are interested in learning the w's.

Recall, that in a Bayesian setting, we have a prior probability distribution 
over parameters w. Each setting of parameters to values
constitutes a hypothesis. We have some prior belief about what hypotheses
are more likely to be true. One such belief could be that the hypotheses 
which have small values of w's are more likely (also called a Gaussian Prior).
Another belief could be that all hypotheses are equally likely (a Uniform Prior).
In this assignment we will be working with a uniform prior over w.

We then observe the data and modify our belief by weighting each hypothesis by its
likelihood. In other words, the modified belief about hypothesis H is 
the prior about H multiplied by the likelihood that the data could
be explained by H.

We then use the weighted average of the hypotheses to make decisions. The
model so obtained is called a Bayesian estimate. Note that another way of 
making decisions could be to choose the hypothesis with the highest 
modified belief (instead of taking a weighted average of all of them). 
This is called the Maximum a-posteriori (MAP) estimate. Still another 
way could be to choose the hypothesis with the largest likelihood 
(and ignore our prior beliefs). This is called the Maximum Likelihood (ML) estimate.

In this assignment we will be comparing Bayesian and ML/MAP estimators.
Since we are using a uniform prior, ML and MAP estimates are same.

We will generate our training data randomly.
Each dimension of each data point will be a sample drawn from a normal distribution
with mean 0 and standard deviation ('inputstandev', currently set to '4').
The weights w will be set to unifrom random values in [-1, 1] and we assert that this is 
the 'true' net (also called a teacher net). The targets are binary valued and will 
be sampled from a Bernoulli distribution defined by the predictions made by the teacher net 
(i.e., the target for input x will be 1 with probability p, where p is the output of the 
teacher network on input x)


To start off, copy the following files:

Then type
>> makeallvecs;
This makes a matrix in which each row is a possible weight vector. That is, 
we have all possible hypotheses with us in the matrix 'allvecs'. Each w_i can take on 
9 possible values (-1 to 1 in intervals of 0.25). The prior is uniform over these
9^4 possible weight vectors.

then type
>> maketeacher;
This makes a teacher network

Then type:
>> numcases=10;
>> bayeswithbest;

numcases is the number of training cases that the model will use. The
number of testcases is fixed at 100.

The code will print out the Bayesian estimate's error on the training 
and test sets as well as the error of the MAP/ML estimate

Figure 1 will show you how well the outputs of the teacher on the training
data can be predicted by bayes-averaging the outputs of all possible
nets. "Bayes-averaging" means weighting the prediction of each net by
the posterior probability of that net given the training data and the
prior (which is flat in this example).
Each column corresponds to a data point.
The first row gives the output of the teacher net.
The second row gives the output of the Bayesian estimate.
The third row gives the output of the MAP/ML estimate.
The size of the square is proportional to the output value.


Figure 2 will show the same on the test data

Figure 3 shows a histogram of the posterior probability distribution
across all 9^4 weight vectors. Notice that the posterior can be very
spread out so that even the best net gets a very small posterior
probability.

Your report should be at most half a page and should describe the
effects of changing the number of training cases. You should also try
modifying maketeacher to set different kinds of teacher nets and 
see how the results depend on the particular teacher net.