Making predictions once the tree has been
learned
If we are doing regression and our loss function
is the squared error, then average the outputs of
all the experts using the probabilities of paths
through the gating tree as the weights.
This avoids discontinuities at boundaries
between regions, because the probabilities are
soft.
Its like the use of sigmoid units to allow
gradient descent in training feed-forward nets