lecture 10: Support Vector Machines

The story so far

•

If we use a large set of non-adaptive features, we can

often make the two classes linearly separable.

–

But if we just fit any old separating plane, it will not

generalize well to new cases.

•

If we fit the separating plane that maximizes the margin

(the minimum distance to any of the data points), we will

get much better generalization.

–

Intuitively, by maximizing the margin we are

squeezing out all the surplus capacity that came from

using a high-dimensional feature space.

•

This can be justified by a whole lot of clever mathematics

which shows that

–

large margin separators have lower VC dimension.

–

models with lower VC dimension have a smaller gap

between the training and test error rates.