SML310 Mini-Project 3: Fake News Detection

Overview

For this project, you will build and analyze several algorithms for determining whether a headline is real or fake news.

Data: Fake and real news headlines

You will be working with a dataset consisting of 1298 “fake news” headlines and 1968 “real” news headlines, where the “fake news” headlines are from https://www.kaggle.com/mrisdal/fake-news/data/ and “real news” headlines are from https://www.kaggle.com/therohk/million-headlines/ . The headlines were cleaned by removing words from fake news titles that are not a part of the headline, removing special characters from the headlines, and restricting real news headlines to those after October 2016 containing the word “trump”.

The data is available at:

Real news headlines: clean_real.txt
Fake news headlines: clean_fake.txt

Each headline appears as a single line in the data file. Words in the headline are separated by spaces, so just use str.split() in Python to split the headlines into words.

Part 1: Exploratory analysis (10 pts)

Describe the dataset. In particular, focus on the advantages and the limitations of the dataset if we want to use it as a training set for fake news detection. Include summary statistics and descriptive statistics that you think would be relevant to a data scientist working on this dataset. Include at least one figure that summarizes the contents (i.e., text) of the “fake news” headlines, and the contents of the “real news” headlines.

Part 2 (15 pts)

Implement the Naive Bayes algorithm for predicting whether a headline is real or fake. Tune the parameters of the prior (called $m$ and $\hat{p}$ in the slides) using the validation set (i.e., obtain the values of the parameters which work best on the validation set). Report how you did it, and the result. Report the performance on the training and the test sets that you obtain using the parameters that work best on the validation set. Note that computing products of many small numbers leads to computational issues. Use the fact that
$$a_ 1 a_ 2 ... a_ n = \exp(\log a_1 + \log a_2 + ... \log a_n)$$

to get around this fact.

In your report, explain how you used that fact.

Part 3 (10 pts)

In this part, you will analyze your Naive Bayes classifier.

List the 10 words whose presence most strongly predicts that the news is real.

List the 10 words whose absence most strongly predicts that the news is real.

List the 10 words whose presence most strongly predicts that the news is fake.

List the 10 words whose absence most strongly predicts that the news is fake.

State how you obtained those in terms of the the conditional probabilities used in the Naive Bayes
algorithm.

Compare the influence of presence vs absence of words on predicting whether the headline is real or fake news – in general, which is more important? Explain what evidence from the data you used to make the conclusion.

Part 4 (10 pts)

Does the Naive Bayes assumption hold for the fake/real news dataset? Answer the question, and give empirical evidence for the answer.

Part 5 (10 pts)

Naive Bayes is a generative model. Generate “fake” and “real” headlines using the model you built. In your report, explain how you did that.

Part 6 (10 pts)

Use the dataset of the “real” headlines to build a model that calculates the probability of a new input headline, using the Naive Bayes model. Make a small test set consisting of headlines collected from the web as well as non-headlines, and test your model by computing the probabilities for headlines as well as non-headlines. Report your findings. Explain whether the model that computes the probability of a headline seems to produce good probability estimates. In your report, include both the code and the formula for computing the probability of new text. Explain what the “probability of a new headline” means. (N.B.: the probability of a headline goes down if the headline is long; to make the comparison fair, one thing you could do is compute the log-probability per word rather than the probability. Another is to select headlines and non-headlines that are the same length on average. You should do one of those things.)

Part 7 (15 pts)

Train a logistic regression model for classifying the “fake”/”real” headline dataset. Train a Logistic Regression model on the same dataset. For a single headline $h$ the input to the Logistic Regression model will be a $k$-dimensional vector $v$, where $v[k]=1$ if the $k$-th keyword appears in the headline $h$ and $v[k]=0$ otherwise. The set of keywords consists of all the words that appear in all the headlines. Report the results on the test set. Plot the learning curves.

Part 8 (10 pts)

In this part, repeat the analysis from Part 3, but for the trained logistic regression classifier.

What to submit

Please submit all of your Python code, as well as a report in PDF format. Your report should address every one of the tasks above. Your report should be repoducible: in the report, include the function calls that the TA should make to get the outputs that you are showing in your report. (You do not need to include helper functions in your report.

Report quality (10 pts)

Your report should be readable. 10 pts will be awarded for very readable and professional reports. 5 pts will be awarded for reports that are readable with some effort.