THE MLP-BGD-1 METHOD

   Regression and classification with multilayer perceptron networks 
                  trained using batch gradient descent

                      Radford M. Neal, 23 July 1997


This method is applicable to regression and classification tasks with
a single target.  For regressions tasks, the network is trained to
minimize squared error; for binary targets, it is trained by maximum
likelihood with a logistic model; for categorical targets, a
generalized logistic ("softmax") model is used.  The method produces a
single guess as its prediction (the same guess for any loss function).
Generalizations to more than one target and to producing predictive
distributions would be possible, but are not currently supported.

The method uses multilayer perceptron networks with one hidden layer
of 20 tanh units, with no direct input-output connections.  An
ensemble of four such networks are trained for each task instance,
based on a division of the training data into four parts.  Each
network in the ensemble is trained on 3/4 of the data, with the
remaining 1/4 used to choose the best network from the training run to
use for predictions.  

For a real-valued target, the prediction is the average of the outputs
of the four networks in the ensemble applied to the inputs in the test
case.  For a binary or categorical target, the prediction is the
target value with highest probability in the predictive distribution
found by averaging the network outputs. The standard DELVE encoding is
used for all inputs and targets, except that categorical targets must
be encoded in the 0-up format (ordering is arbitrary) or 0/1 with an
arbitrary value being passive for a binary target.

Training is by batch gradient descent, using the program for this that
is part of Radford Neal's flexible Bayesian modeling software (release
of 1997-07-22).  The shell files in the Source directory (described
below) contain the detailed commands for training using this software.
The software documentation also provides relevant details.  The
following description is a brief summary.

Weights in the network are initialized randomly to values in the
interval (-0.01,+0.01), with a different random seed being used for
each task instance and each partition of the data.  (It is hoped that
this range is a small enough that the exact size of the range has
little effect on the results.)  Weights are then updated by gradient
descent, with gradients computed based on the entire set of cases
being used for training (3/4 of the total data available).  A fairly
small stepsize is used, which it is hoped will be adequate for most
problems, given that the data is normalized in the default DELVE
manner.  The stability of the learning (in terms of training error)
should be checked in any particular application, and the stepsize
reduced if required, but this has not proved necessary in the tests on
DELVE datasets done to date.  Training is continued for 20,000
iterations, which appeared to be sufficient in preliminary tests on
artificial non-DELVE datasets, and on the kin family (ie, the network
selected for use as described below is often from an early iteration).

Networks are saved at 63 iterations, distributed approximately
logarithmically from iteration 1 to iteration 20,000 (see the shell
files for the exact set).  Once training has finished, the best
network from among those saved is chosen based on the error (minus log
likelihood) on the 1/4 of the training set that was not used for
gradient descent training.  Four such networks form the ensemble.

The method is implemented using shell files specific to target type,
with "runr" being used for regression tasks (real-valued target),
"runb" for binary classification tasks, and "runc" for multi-way
classification tasks. The runr and runb shell files take the number of
inputs and the instance number as arguments.  The runc shell file
takes the number of categories, the number of inputs, and the instance
number as arguments.  When applied to instance N, the shell files
apply the method to the training data in the DELVE train.N file and
make predictions for the cases in test.N, storing the predictions in
cguess.N.  The iteration numbers of the four networks selected for use
are stored in find-num.N.