Straightforward Project: Automatic Data Cleaning

This project is meant to be a manageable extension of the assignments, which should let you leverage your existing code. Again, the focus is on exploring the strengths and weaknesses of different approaches, and putting an idea in context of other techniques.

In real machine learning applications, we often spend a long time cleaning the data - looking for mislabeled or otherwise corrupted data. But the methods we've learned in the class can help with this task!

This project proposes a simple extension of any classifier to allow it to learn from potentially mislabeled data. By using a probabilistic model, we can simultaneously learn to classify data as well as guess which examples are mislabeled.

The proposed model

The standard logistic regression model gives the following probability of a label \(c\) given an example \(x\): \[ p(c | x, w) = \frac{\exp(w_c^T x)}{\sum_{c' = 0}^9 \exp(w_{c'}^T x)} \]

But what if the person building the dataset made errors when writing down the label of each digit sometimes? We could imagine that there is a true label for each image, \(r\), that is written down correctly with some high probability \(p(r = c) = \theta\). There is also a chance of mislabeling - perhaps different for each pair of digits. In general we could write the probability of each recorded label given each true label, \(p(c|r)\), as a matrix.

We've introducted a new latent variable for each example, but since it's discrete, we can simply integrate it out to get back a predictive probability: \[ p(c | x, w) = \sum_{r = 0}^9 p(c | r) p(r | x, w) \]

This new model will take 10 times as long to evaluate, but will be robust to mis-labeled examples.

Proposed content

To turn this idea into a project, you'll need to:

Proposed extensions

For full marks, you should compare between the simplest possible model and some extension, such as: