















• 
If we use a
neural net to define the features, maybe we



can use convex
optimization for the final layer of weights



and then
backpropagate derivatives to “learn the kernel”.

• 
The convex
optimization is quadratic in the number of



training cases.
So this approach works best when most



of the data is
unlabelled.



– 
Unsupervised
pretraining can then use the



unlabelled
data to learn features that are appropriate



for
the domain.



– 
The
final convex optimization can use these features



as
well as possible and also provide derivatives that



allow
them to be finetuned.



– 
This
seems better than just trying lots of kernels and



selecting
the best ones (which is the current method).

