lec2a

•

Greedily learning one layer at a time scales well to really

big networks, especially if we have locality in each layer.

•

Most of the information in the final weights comes from

modeling the distribution of input vectors.

–

The precious information in the labels is only used for

the final fine-tuning.

•

We do not start backpropagation until we already have

sensible weights that already do well at the task.

–

So the learning is well-behaved and quite fast.