CSC2515 Fall 2007
Introduction to Machine
Learning
Lecture 7: Decision Trees and Mixtures of Experts
Decision Trees: Non-linear regression or classification with very little computation
A very fast way to decide if a datapoint lies in a region
How do we decide what tests to use?
Two measures of classification “impurity”
When should we stop adding nodes?
Advantages and disadvantages of decision trees
An alternative to axis-aligned hyper-planes
Computing on which side a training point lies
Partitioning based on input alone versus partitioning based on input-output relationship
A picture of why averaging is bad
Making an error function that encourages specialization instead of cooperation
The mixture of experts architecture
The derivatives of the simple cost function
Another view of mixtures of experts
Giving a whole distribution as output
The probability distribution that is implicitly assumed when using squared error
The probability of the correct answer under a mixture of Gaussians
A natural error measure for a Mixture of Experts
A picture of two imaginary vowels and a mixture of two linear experts after learning
Hierarchical mixtures of experts
The generative model for an HMoE
Making predictions once the tree has been learned
Newton updates for a logistic output with cross-entropy error
Is an HMoE better than a flat MoE?
A different (and better?) type of hierarchy for a mixture of experts