Assignment 5

Deadline: March 7, 9pm

Late Penalty: See Syllabus

TA: Kingsley Chang

In this assignment, we will build a recurrent neural network to classify a SMS text message as "spam" or "not spam". In the process, you will

  1. Clean and process text data for machine learning.
  2. Understand and implement a character-level recurrent neural network.
  3. Use torchtext to build recurrent neural network models.
  4. Understand batching for a recurrent neural network, and use torchtext to implement RNN batching.

What to submit

Submit a PDF file containing all your code and outputs. Do not submit any other files produced by your code.

Completing this assignment using Jupyter Notebook is recommended (though not necessarily for all subsequent assignments). If you are using Jupyter Notebook, you can export a PDF file using the menu option File -> Download As -> PDF via LaTeX (pdf)

In [ ]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

Part 1. Data Cleaning [12 pt]

We will be using the "SMS Spam Collection Data Set" available at http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Download and unzip the Data Folder, and move the file SMSSpamCollection to your working directory. (Same folder as this notebook)

Part (a) [1 pt]

Open up the file in Python. Print out one example of a spam SMS, and one example of a non-spam SMS.

In [ ]:
for line in open('SMSSpamCollection'):
    break

Part (b) [1 pt]

How many spam messages and non-spam messages are there in the data set?

Part (c) [2 pt]

We will be using the package torchtext to load, process, and batch the data. A tutorial to torchtext is available below. This tutorial uses the same Sentiment140 data set that we explored during lecture.

https://medium.com/@sonicboom8/sentiment-analysis-torchtext-55fb57b1fab8

One major difference is that we will be building a character level RNN. That is, we will treat each character as a token in our sequence, rather than each word.

Identify one advantage and one disadvantage of modelling SMS text messages as a sequence of characters rather than a sequence of words.

Part (d) [1 pt]

We will be loading our data set using torchtext.data.TabularDataset. The constructor will read directly from the SMSSpamCollection file.

For the data file to be read successfuly, we need to specify the fields (columns) in the file. In our case, the dataset has two fields:

  • a text field containing the sms messages,
  • a label field which will be converted into a binary label.

Split the dataset into train, valid, and test. Use a 60-20-20 split. You may find this torchtext API page helpful: https://torchtext.readthedocs.io/en/latest/data.html#dataset

In [ ]:
import torchtext

text_field = torchtext.data.Field(sequential=True,      # text sequence
                                  tokenize=lambda x: x, # because are building a character-RNN
                                  include_lengths=True, # to track the length of sequences, for batching
                                  batch_first=True,
                                  use_vocab=True)       # to turn each character into an integer index
label_field = torchtext.data.Field(sequential=False,    # not a sequence
                                   use_vocab=False,     # don't need to track vocabulary
                                   is_target=True,      
                                   batch_first=True,
                                   preprocessing=lambda x: int(x == 'spam')) # convert text to 0 and 1

fields = [('label', label_field), ('sms', text_field)]
dataset = torchtext.data.TabularDataset("SMSSpamCollection", # name of the file
                                        "tsv",               # fields are separated by a tab
                                        fields)
# dataset[0].sms
# dataset[0].label

# train, valid, test =  ...

Part (e) [2 pt]

You saw in part (b) that there are much more non-spam messages than spam messages. This imbalance in our training data will be problematic for training.

We can fix this disparity by duplicating non-spam messages in the training set, so that the training set is roughly balanced.

Explain why having a balanced training set is helpful for training our neural network.

Note: if you are not sure, try removing the below code and train your mode.

In [ ]:
# save the original training examples
old_train_examples = train.examples
# get all the spam messages in `train`
train_spam = []
for item in train.examples:
    if item.label == 1:
        train_spam.append(item)
# duplicate each spam message 6 more times
train.examples = old_train_examples + train_spam * 6

Part (f) [1 pt]

We need to build the vocabulary on the training data by running the below code. This finds all the possible character tokens in the training set.

Explain what the variables text_field.vocab.stoi and text_field.vocab.itos represent.

In [ ]:
text_field.build_vocab(train)
#text_field.vocab.stoi
#text_field.vocab.itos

Part (g) [2 pt]

The tokens <unk> and <pad> were not in our SMS text messages. What do these two values represent?

Part (h) [2 pt]

Since text sequences are of variable length, torchtext provides a BucketIterator data loader, which batches similar length sequences together. The iterator also provides functionalities to pad sequences automatically.

Take a look at ~10 batches in train_iter. What is the maximum length of the input sequence in each batch? How many <pad> tokens are used in each of the ~10 batches?

In [ ]:
train_iter = torchtext.data.BucketIterator(train,
                                           batch_size=32,
                                           sort_key=lambda x: len(x.sms), # to minimize padding
                                           sort_within_batch=True,        # sort within each batch
                                           repeat=True)                   # repeat the iterator for multiple epochs
In [ ]:
for i, batch in enumerate(train_iter):
    if i >= 10:
        break
    #print(batch.sms)
    #print(batch.label)

Part 2. Model Building [10 pt]

Build a recurrent neural network model, using an architecture of your choosing. Use the one-hot embedding of each character as input to your recurrent network. Use one or more fully-connected layers to make the prediction based on your recurrent network output.

Instead of using the RNN output value for the final token, another often used strategy is to max-pool over the entire output array. That is, instead of calling something like:

out, _ = self.rnn(x)
self.fc(out[:, -1, :])

where self.rnn is an nn.RNN or nn.LSTM module, and self.fc is a linear layer, we use:

out, _ = self.rnn(x)
self.fc(torch.max(out, dim=1)[0])

This works reasonably in practice.

In [ ]:
# You mind find this code helpful for obtaining
# pytorch one-hot vectors.

ident = torch.eye(10)
print(ident[0]) # one-hot vector
print(ident[1]) # one-hot vector
x = torch.tensor([[1, 2], [3, 4]])
print(ident[x]) # one-hot vectors
In [ ]:
# This code is here to help you test your model.
# You may beed to change this depending on how your forward
# function is set up.

model = YourModelName()
sample_batch = next(iter(train_iter))
sms = sample_batch.sms[0]
length = sample_batch.sms[1]
y = model(sms)
print(y.shape)

Part 3. Training [15 pt]

Part (a) [8 pt]

Train your model. Plot the training curve of your final model. Your training curve should have the training/validation loss and accuracy plotted periodically. You can use the following code to compute your accuracy.

In [ ]:
def get_accuracy(model, data):
    data_iter = torchtext.data.BucketIterator(data, 
                                              batch_size=64, 
                                              sort_key=lambda x: len(x.sms), 
                                              repeat=False)
    correct, total = 0, 0
    for i, batch in enumerate(data_iter):
        output = model(batch.sms[0]) # You may need to modify this, depending on your model setup
        pred = output.max(1, keepdim=True)[1]
        correct += pred.eq(batch.label.view_as(pred)).sum().item()
        total += batch.sms[1].shape[0]
    return correct / total

Part (b) [5 pt]

Choose at least 4 hyper parameters to tune. Explain how you tuned the hyper parameters. You don't need to include your traing curve for every model you trained. Instead, explain what hyper paremters you tuned, what the validation accuracy was, and the reasoning behind the hyper parameter decisions you made.

For this assignment, you should tune more than just your learning rate and epoch. Choose at least 2 hyper parameters that are unrelated to the optimizer.

Part (c) [2 pt]

Report the final test accuracy of your model. You should be able to obtain fairly good accuracy for this model.

Part 4. Baseline Model [3 pt]

Do you think detecting spam is an easy or difficult task? One way to answer this question is to think of a baseline model: a simple model that is easy to build and inexpensive to run, that we can compare our recurrent neural network model against.

Explain how you might build a simple baseline model. This baseline model can be a simple neural network (with very few weights), a hand-written algorithm, or any other strategy that is easy to build and test.

Since machine learning models are expensive to train and deploy, it is very important to compare our models against baseline models.