example-project

Straightforward Project: Automatic Data Cleaning

This project is meant to be a manageable extension of the assignments, which should let you leverage your existing code. Again, the focus is on exploring the strengths and weaknesses of different approaches, and putting an idea in context of other techniques.

In real machine learning applications, we often spend a long time cleaning the data - looking for mislabeled or otherwise corrupted data. But the methods we've learned in the class can help with this task!

This project proposes a simple extension of any classifier to allow it to learn from potentially mislabeled data. By using a probabilistic model, we can simultaneously learn to classify data as well as guess which examples are mislabeled.

The proposed model

The standard logistic regression model gives the following probability of a label \(c\) given an example \(x\): \[ p(c | x, w) = \frac{\exp(w_c^T x)}{\sum_{c' = 0}^9 \exp(w_{c'}^T x)} \]

But what if the person building the dataset made errors when writing down the label of each digit sometimes? We could imagine that there is a true label for each image, \(r\), that is written down correctly with some high probability \(p(r = c) = \theta\). There is also a chance of mislabeling - perhaps different for each pair of digits. In general we could write the probability of each recorded label given each true label, \(p(c|r)\), as a matrix.

We've introducted a new latent variable for each example, but since it's discrete, we can simply integrate it out to get back a predictive probability: \[ p(c | x, w) = \sum_{r = 0}^9 p(c | r) p(r | x, w) \]

This new model will take 10 times as long to evaluate, but will be robust to mis-labeled examples.

Proposed content

To turn this idea into a project, you'll need to:

choose at least one dataset to apply it to (such as MNIST or CIFAR-10)
choose at least one model (such as logistic regression or a neural net)
Do a lit search and briefly write up related work.
Flesh out the math, i.e. showing how to compute \(p(r|x,w, c)\)
Do some illustrative experiments on your dataset, such as showing which examples in the MNIST datset are most likely to be misclassified given your model, or the difference between the learned filters with and without this extension. If these results aren't conclusive, build a toy dataset with deliberately mislabeled examples.

Proposed extensions

For full marks, you should compare between the simplest possible model and some extension, such as:

A version that has a fixed \(\theta\) versus one that learns \(p(r|c)\)
One that has a larger number of true classes than recorded classes.
One that uses a neural network instead of logistic regression.