CSC2515 Fall 2007
 Introduction to Machine Learning

Lecture 7: Decision Trees and Mixtures of Experts

Decision Trees: Non-linear regression or classification with very little computation

A very fast way to decide if a datapoint lies in a region

An axis-aligned decision tree

How do we decide what tests to use?

Two measures of classification “impurity”

When should we stop adding nodes?

Advantages and disadvantages of decision trees

An alternative to axis-aligned hyper-planes

Computing on which side a training point lies

A spectrum of models

Multiple local models

Partitioning based on input alone versus partitioning based on input-output relationship

Mixtures of Experts

A picture of why averaging is bad

Making an error function that encourages specialization instead of cooperation

The mixture of experts architecture

The derivatives of the simple cost function

Another view of mixtures of experts

Giving a whole distribution as output

The probability distribution that is implicitly assumed when using squared error

The probability of the correct answer under a mixture of Gaussians

A natural error measure for a Mixture of Experts

What are vowels?

A picture of two imaginary vowels and a mixture of two linear experts after learning

Hierarchical mixtures of experts

The generative model for an HMoE

Making predictions once the tree has been learned

Learning a simplified HMoE

Using EM to fit an HMoE

IRLS

Newton updates for a logistic output with cross-entropy error

Is an HMoE better than a flat MoE?

A different (and better?) type of hierarchy for a mixture of experts

The advantage of “speciation”

Does speciation work better than a standard HMoE?