















• 
If we use a
large set of nonadaptive features, we can



often make the
two classes linearly separable.



– 
But
if we just fit any old separating plane, it will not



generalize
well to new cases.


• 
If we fit the
separating plane that maximizes the margin



(the minimum
distance to any of the data points), we will



get much better
generalization.



– 
Intuitively,
by maximizing the margin we are



squeezing
out all the surplus capacity that came from



using
a highdimensional feature space.


• 
This can be
justified by a whole lot of clever mathematics


which shows that



– 
large
margin separators have lower VC dimension.



– 
models
with lower VC dimension have a smaller gap



between
the training and test error rates.

