 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
| • |
If we use a
large set of non-adaptive features, we can
|
|
|
often make the
two classes linearly separable.
|
|
|
– |
But
if we just fit any old separating plane, it will not
|
|
|
generalize
well to new cases.
|
|
| • |
If we fit the
separating plane that maximizes the margin
|
|
|
(the minimum
distance to any of the data points), we will
|
|
|
get much better
generalization.
|
|
|
– |
Intuitively,
we are squeezing out all the surplus
|
|
|
capacity
that came from using a high-dimensional
|
|
|
feature
space.
|
|
| • |
This can be
justified by a whole lot of clever mathematics
|
|
which shows that
|
|
|
– |
large
margin separators have lower VC dimension.
|
|
|
– |
models
with lower VC dimension have a smaller gap
|
|
|
between
the training and test error rates.
|
|