 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
• |
If we use a
large set of non-adaptive features, we can
|
|
|
often make the
two classes linearly separable.
|
|
|
– |
But
if we just fit any old separating plane, it will not
|
|
|
generalize
well to new cases.
|
|
• |
If we fit the
separating plane that maximizes the margin
|
|
|
(the minimum
distance to any of the data points), we will
|
|
|
get much better
generalization.
|
|
|
– |
Intuitively,
by maximizing the margin we are
|
|
|
squeezing
out all the surplus capacity that came from
|
|
|
using
a high-dimensional feature space.
|
|
• |
This can be
justified by a whole lot of clever mathematics
|
|
which shows that
|
|
|
– |
large
margin separators have lower VC dimension.
|
|
|
– |
models
with lower VC dimension have a smaller gap
|
|
|
between
the training and test error rates.
|
|