The story so far
If we use a large set of non-adaptive features, we can
often make the two classes linearly separable.
But if we just fit any old separating plane, it will not
generalize well to new cases.
If we fit the separating plane that maximizes the margin
(the minimum distance to any of the data points), we will
get much better generalization.
Intuitively, by maximizing the margin we are
squeezing out all the surplus capacity that came from
using a high-dimensional feature space.
This can be justified by a whole lot of clever mathematics
which shows that
large margin separators have lower VC dimension.
models with lower VC dimension have a smaller gap
between the training and test error rates.