A hybrid approach
If we use a neural net to define the features, maybe we
can use convex optimization for the final layer of weights
and then backpropagate derivatives to “learn the kernel”.
The convex optimization is quadratic in the number of
training cases. So this approach works best when most
of the data is unlabelled.
Unsupervised pre-training can then use the
unlabelled data to learn features that are appropriate
for the domain.
The final convex optimization can use these features
as well as possible and also provide derivatives that
allow them to be fine-tuned.
This seems better than just trying lots of kernels and
selecting the best ones (which is the current method).