lecture 12

The story so far

•

If we use a large set of non-adaptive features, we can

often make the two classes linearly separable.

–

But if we just fit any old separating plane, it will not

generalize well to new cases.

•

If we fit the separating plane that maximizes the margin

(the minimum distance to any of the data points), we will

get much better generalization.

–

Intuitively, we are squeezing out all the surplus

capacity that came from using a high-dimensional

feature space.

•

This can be justified by a whole lot of clever mathematics

which shows that

–

large margin separators have lower VC dimension.

–

models with lower VC dimension have a smaller gap

between the training and test error rates.