Approximating full Bayesian learning in a
neural network
If the neural net only has a few parameters we could put
a grid over the parameter space and evaluate p( W | D )
at each grid-point.
This is expensive, but it does not involve any gradient
descent and there are no local optimum issues.
After evaluating each grid point we use all of them to
make predictions on test data
This is also expensive, but it works much better than
ML learning when the posterior is vague or
multimodal (this happens when data is scarce).