THE MLP-BGD-1 METHOD Regression and classification with multilayer perceptron networks trained using batch gradient descent Radford M. Neal, 23 July 1997 This method is applicable to regression and classification tasks with a single target. For regressions tasks, the network is trained to minimize squared error; for binary targets, it is trained by maximum likelihood with a logistic model; for categorical targets, a generalized logistic ("softmax") model is used. The method produces a single guess as its prediction (the same guess for any loss function). Generalizations to more than one target and to producing predictive distributions would be possible, but are not currently supported. The method uses multilayer perceptron networks with one hidden layer of 20 tanh units, with no direct input-output connections. An ensemble of four such networks are trained for each task instance, based on a division of the training data into four parts. Each network in the ensemble is trained on 3/4 of the data, with the remaining 1/4 used to choose the best network from the training run to use for predictions. For a real-valued target, the prediction is the average of the outputs of the four networks in the ensemble applied to the inputs in the test case. For a binary or categorical target, the prediction is the target value with highest probability in the predictive distribution found by averaging the network outputs. The standard DELVE encoding is used for all inputs and targets, except that categorical targets must be encoded in the 0-up format (ordering is arbitrary) or 0/1 with an arbitrary value being passive for a binary target. Training is by batch gradient descent, using the program for this that is part of Radford Neal's flexible Bayesian modeling software (release of 1997-07-22). The shell files in the Source directory (described below) contain the detailed commands for training using this software. The software documentation also provides relevant details. The following description is a brief summary. Weights in the network are initialized randomly to values in the interval (-0.01,+0.01), with a different random seed being used for each task instance and each partition of the data. (It is hoped that this range is a small enough that the exact size of the range has little effect on the results.) Weights are then updated by gradient descent, with gradients computed based on the entire set of cases being used for training (3/4 of the total data available). A fairly small stepsize is used, which it is hoped will be adequate for most problems, given that the data is normalized in the default DELVE manner. The stability of the learning (in terms of training error) should be checked in any particular application, and the stepsize reduced if required, but this has not proved necessary in the tests on DELVE datasets done to date. Training is continued for 20,000 iterations, which appeared to be sufficient in preliminary tests on artificial non-DELVE datasets, and on the kin family (ie, the network selected for use as described below is often from an early iteration). Networks are saved at 63 iterations, distributed approximately logarithmically from iteration 1 to iteration 20,000 (see the shell files for the exact set). Once training has finished, the best network from among those saved is chosen based on the error (minus log likelihood) on the 1/4 of the training set that was not used for gradient descent training. Four such networks form the ensemble. The method is implemented using shell files specific to target type, with "runr" being used for regression tasks (real-valued target), "runb" for binary classification tasks, and "runc" for multi-way classification tasks. The runr and runb shell files take the number of inputs and the instance number as arguments. The runc shell file takes the number of categories, the number of inputs, and the instance number as arguments. When applied to instance N, the shell files apply the method to the training data in the DELVE train.N file and make predictions for the cases in test.N, storing the predictions in cguess.N. The iteration numbers of the four networks selected for use are stored in find-num.N.