Carl Edward Rasmussen
May 1, 1996
The linear model is one of the most popular models used in data analysis. The reasons for its popularity is that it is both conceptually and computationally simple to fit a linear model and the resulting model is easily given an intuitive representation. The most prominent deficiency of the linear model, is its strong assumptions about the true relationship in the data; for data which do not conform well to a linear model, predictions may be inaccurate.
I have implemented the linear model which I will call lin-1. The strategy of fitting a linear model is well known and the only heuristics necessary is a principled way of handling the situation where the linear system used to determine the parameters of the model is close to singular.
For simplicity, I will initially assume that the targets are scalar; later on a simple extension to handle multiple outputs will be given. The linear model is defined as
(1)
where are the m+1 model parameters,
being
the bias and
are the inputs, augmented by
to take care of the bias. The model is fit to the training data
by maximum likelihood, assuming zero mean
Gaussian noise with variance
. The likelihood is
(2)
The maximum likelihood estimate for the weights is independent of
and is found by minimizing the cost function
(3)
with respect to the model parameters. Notice, that the Gaussian noise assumption gives rise to a squared error cost function; this cost function will always be used regardless of whatever loss function we chose to evaluate the linear model. The solution to this optimization problem is well known from linear algebra; if the solution is unique, then
If is ill-conditioned numerical evaluation of eq. (4) may be
troublesome, and even if fairly accurate solutions could be obtained these
would not necessarily lead to good model predictions. In an attempt to define a
method which has a high degree of reproducibility I propose to chose w such
that directions in input space with insufficient variation in the training set
are ignored by the model. This can conveniently be computed using singular
value decomposition (SVD), see for example [Press et al.1992]. The decomposition is
, where
and
are orthonormal matrices and
is a vector of length m+1
containing the singular values of
. A regularised
can be computed by setting
for those i whose
are
too small in
The exact criterion for regularisation is: set whenever
, ie, whenever
is close to
singular. The constant of
is chosen as a rather conservative estimate
of machine precision which will not interfere when A is well-conditioned.
The condition number of
depends on the scale of the inputs, so the
procedure should always be used in conjunction with the standard normalisations
provided by DELVE. The (modified) maximum likelihood weights can then be
computed as
We now derive the predictive distribution for the model. For simplicity we will derive the predictive distribution for a fixed value of the noise, and only account for uncertainty in the predictions arising from the estimated noise inherent in the data and from uncertainty in the estimate for the weights. The noise is estimated by
where k is the number of parameters in which were fit by the
data; this is m+1 minus 1 for every singular value whose reciprocal was set
to zero in eq. (5). This estimate may break down if there are too few
training cases (if
), in which case we can't compute a predictive
distribution. Since the prior on weights is uniform (non-existent), the
posterior for the weights is proportional to the likelihood. Thus, the
predictive distribution for a test case with input
is
(8)
This Gaussian integral can be solved exactly; we get a Gaussian predictive distribution with mean and variance:
The optimal point predictions for any symmetric loss function are given by
. Consequently, these predictions can be used for both absolute error
and squared error loss functions. For negative log density loss function, we
compute the log of the density of the targets under the predictive Gaussian
distribution
For tasks that have multiple outputs we can re-use the decomposition of A from eq. (5) for every set of weights. The maximum likelihood weights and inherent noise estimates can be found by using eq. (6) and (7) for each target attribute, and point predictions can be obtained from the maximum likelihood weights as before. For the log density predictions we assume that the joint density for the targets is Gaussian with a diagonal covariance matrix, such that the density of the targets can be obtained by summing (in the log domain) contributions of the form in eq. (10).
This completes the definition of the linear model; following this recipe the model will always be uniquely defined from the data. Many elaborations to this basic linear model exist, but they will not be pursued further here.
The computational complexity involved in fitting lin-1 is
, for computing and decomposing the
matrix
respectively. Even for fairly large tasks this can usually be considered
trivial.