Boosting

Boosting constructs an additive model, composed of a weighted sum of simpler functions, or base classifiers. A boosting algorithm is primarily defined by three selections: (1). the form of the base classifiers; (2). their weighting, which can be derived based on the overall objective function being minimized; and (3). the weighting of the training examples.

You are expected in this project to try two varieties of base classifiers and two types of weightings, which correspond to different training objectives. For example, some standard choices for the base classifiers are decision stumps (single-node decision trees), small decision trees, and perceptrons. For the objective functions, notable examples include exponential loss (AdaBoost); minimum classification error; and minimum margin (Breiman's "arc-gv"). Also, different varieties of re-weighting can be found in standard boosting methods such as AdaBoost or LogitBoost.

You should compare and contrast these selections by plotting the training error, test error, and minimum margin for your choices of these base classifiers and objectives, on a couple datasets.

A key question that you should address concerns overfitting. There have been a number of attempts to explain boosting's tendency not to overfit the training data. One hypothesis is that the generalization error depends on the minimum margin of the training error, or the strength of agreement of the base classifiers. Analyze the impact of these selections, in order to draw conclusions about the factors that appear to determine generalization in your experiments. You can also experiment with robustness of the methods, by exploring the effect of corrupting the labels of some of the training examples.

Note that this project is ideal for partners, as the work can easily be split due to the algorithm's modularity.

Boosting will be covered in the November 20th lecture. To get a head start on boosting, you can read a tutorial by Rob Schapire. Or check out his video lecture. For datasets, you can choose one from the UCI repository, such as the iononsphere dataset or ocr49 (found here). Ideally you would also find one of your own. Suggestions include: email spam filtering, or face detection.