GloVe vectors for sentiment analysis

Sentiment Analysis

Sentiment Analysis is the problem of identifying the writer's sentiment given a piece of text. Sentiment Analysis can be applied to movie reviews, feedback of other forms, emails, tweets, course evaluations, and much more.

Rudimentary forms of sentiment analysis might involve scoring each word on a scale from "sad" to "happy", then averaging the "happiness score" of each word in a piece of text. This technique has obvious drawbacks: it won't be able to handle negation, sarcasm, or any complex syntactical form. We can do better. In fact, we will use the sentiment analysis task as an example in the next few lectures.

Today, we'll focus on the problem of classifying tweets as having positive or negative emotions. We use the Sentiment140 data set, which contains tweets with either a positive or negative emoticon. Our goal is to determine whether which type of emoticon the tweet (with the emoticon removed) contained. The dataset was actually collected by a group of students, just like you, who are doing their first machine learning project, just like you will be soon.

You can download the data here: http://help.sentiment140.com/

Let's look at the data:

In [1]:
import csv

def get_data():
    return csv.reader(open("training.1600000.processed.noemoticon.csv", "rt", encoding="latin-1"))

for i, line in enumerate(get_data()):
    if i > 10:
        break
    print(line)
['0', '1467810369', 'Mon Apr 06 22:19:45 PDT 2009', 'NO_QUERY', '_TheSpecialOne_', "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"]
['0', '1467810672', 'Mon Apr 06 22:19:49 PDT 2009', 'NO_QUERY', 'scotthamilton', "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"]
['0', '1467810917', 'Mon Apr 06 22:19:53 PDT 2009', 'NO_QUERY', 'mattycus', '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds']
['0', '1467811184', 'Mon Apr 06 22:19:57 PDT 2009', 'NO_QUERY', 'ElleCTF', 'my whole body feels itchy and like its on fire ']
['0', '1467811193', 'Mon Apr 06 22:19:57 PDT 2009', 'NO_QUERY', 'Karoli', "@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. "]
['0', '1467811372', 'Mon Apr 06 22:20:00 PDT 2009', 'NO_QUERY', 'joy_wolf', '@Kwesidei not the whole crew ']
['0', '1467811592', 'Mon Apr 06 22:20:03 PDT 2009', 'NO_QUERY', 'mybirch', 'Need a hug ']
['0', '1467811594', 'Mon Apr 06 22:20:03 PDT 2009', 'NO_QUERY', 'coZZ', "@LOLTrish hey  long time no see! Yes.. Rains a bit ,only a bit  LOL , I'm fine thanks , how's you ?"]
['0', '1467811795', 'Mon Apr 06 22:20:05 PDT 2009', 'NO_QUERY', '2Hood4Hollywood', "@Tatiana_K nope they didn't have it "]
['0', '1467812025', 'Mon Apr 06 22:20:09 PDT 2009', 'NO_QUERY', 'mimismo', '@twittera que me muera ? ']
['0', '1467812416', 'Mon Apr 06 22:20:16 PDT 2009', 'NO_QUERY', 'erinx3leannexo', "spring break in plain city... it's snowing "]

The columns we care about is the first one and the last one. The first column is the label (the label 0 means "sad" tweet, 4 means "happy" tweet), and the last column contains the tweet. Our task is to predict the sentiment of the tweet given the text.

The approach today is as follows, for each tweet:

  1. We will split the text into words. We will do so by splitting at all whitespace characters. There are better ways to perform the split, but let's keep our dependencies light.
  2. We will look up the GloVe embedding of each word. Words that do not have a GloVe vector will be ignored.
  3. We will sum up all the embeddings, to get an embedding for an entire tweet.
  4. Finally, we will use a fully-connected neural network (a multi-layer peceptron or MLP) to predict whether the tweet has positive or negative sentiment.

First, let's sanity check that there are enough words for us to work with.

In [2]:
import torchtext
glove = torchtext.vocab.GloVe(name="6B", dim=50)

def split_tweet(tweet):
    # separate punctuations
    tweet = tweet.replace(".", " . ") \
                 .replace(",", " , ") \
                 .replace(";", " ; ") \
                 .replace("?", " ? ")
    return tweet.lower().split()

for i, line in enumerate(get_data()):
    if i > 30:
        break
    print(sum(int(w in glove.stoi) for w in split_tweet(line[-1])))
21
23
17
10
22
4
3
21
4
3
9
4
19
15
19
18
18
4
9
13
11
23
8
9
4
11
13
6
23
20
13

Looks like each tweet has at least one word that has an embedding.

Now, steps 1-3 from above can be done ahead of time, just like the transfer learning portion of Lab 3. So, we will write a function that will take the tweets data file, computes the tweet embeddings, and splits the data into train/validation/test.

We will only use $\frac{1}{59}$ of the data in the file, so that this demo runs relatively quickly.

In [3]:
import torch
import torch.nn as nn

def get_tweet_vectors(glove_vector):
    train, valid, test = [], [], []
    for i, line in enumerate(get_data()):
        tweet = line[-1]
        if i % 59 == 0:
            # obtain an embedding for the entire tweet
            tweet_emb = sum(glove_vector[w] for w in split_tweet(tweet))
            # generate a label: 1 = happy, 0 = sad
            label = torch.tensor(int(line[0] == "4")).long()
            # place the data set in either the training, validation, or test set
            if i % 5 < 3:
                train.append((tweet_emb, label)) # 60% training
            elif i % 5 == 4:
                valid.append((tweet_emb, label)) # 20% validation
            else:
                test.append((tweet_emb, label)) # 20% test
    return train, valid, test

I'm making the glove_vector a parameter so that we can test the effect of using a higher dimensional GloVe embedding later. Now, let's get our training, validation, and test set. The format is what torch.utils.data.DataLoader expects.

In [4]:
import torchtext

glove = torchtext.vocab.GloVe(name="6B", dim=50)

train, valid, test = get_tweet_vectors(glove)

train_loader = torch.utils.data.DataLoader(train, batch_size=128, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=True)

Now, our actual training script! Note that will we use CrossEntropyLoss, have two neurons in the final layer of our output layer, and use softmax instead of a sigmoid activation. This is different from our choice in the earlier weeks! Typically, machine learning practitioners will choose to use two output neurons instead of one, even in a binary classification task. The reason is that an extra neuron adds some more parameters to the network, and makes the network a little easier to train (performs better).

In [5]:
import matplotlib.pyplot as plt

def train_network(model, train_loader, valid_loader, num_epochs=5, learning_rate=1e-5):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    losses, train_acc, valid_acc = [], [], []
    epochs = []
    for epoch in range(num_epochs):
        for tweets, labels in train_loader:
            optimizer.zero_grad()
            pred = model(tweets)
            loss = criterion(pred, labels)
            loss.backward()
            optimizer.step()
        losses.append(float(loss))     
        if epoch % 5 == 4:
            epochs.append(epoch)
            train_acc.append(get_accuracy(model, train_loader))
            valid_acc.append(get_accuracy(model, valid_loader))
            print("Epoch %d; Loss %f; Train Acc %f; Val Acc %f" % (
                epoch+1, loss, train_acc[-1], valid_acc[-1]))

    # plotting
    plt.title("Training Curve")
    plt.plot(losses, label="Train")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.show()

    plt.title("Training Curve")
    plt.plot(epochs, train_acc, label="Train")
    plt.plot(epochs, valid_acc, label="Validation")
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.legend(loc='best')
    plt.show()

def get_accuracy(model, data_loader):
    correct, total = 0, 0
    for tweets, labels in data_loader:
        output = model(tweets)
        pred = output.max(1, keepdim=True)[1]
        correct += pred.eq(labels.view_as(pred)).sum().item()
        total += labels.shape[0]
    return correct / total

As for the actual mode, we will start with a 3-layer neural network. We won't create our own class since this is a fairly straightforward neural network, so an nn.Sequential object will do. Let's build and train our network.

In [6]:
mymodel = nn.Sequential(nn.Linear(50, 30),
                        nn.ReLU(),
                        nn.Linear(30, 10),
                        nn.ReLU(),
                        nn.Linear(10, 2))
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
print("Final test accuracy:", get_accuracy(mymodel, test_loader))
Epoch 5; Loss 0.554681; Train Acc 0.650974; Val Acc 0.649705
Epoch 10; Loss 0.616420; Train Acc 0.666032; Val Acc 0.669248
Epoch 15; Loss 0.787748; Train Acc 0.673284; Val Acc 0.675516
Epoch 20; Loss 0.700041; Train Acc 0.672792; Val Acc 0.673488
Epoch 25; Loss 0.606776; Train Acc 0.675435; Val Acc 0.675701
Epoch 30; Loss 0.435568; Train Acc 0.675804; Val Acc 0.675516
Epoch 35; Loss 0.778874; Train Acc 0.677586; Val Acc 0.674594
Epoch 40; Loss 0.577560; Train Acc 0.680474; Val Acc 0.677360
Epoch 45; Loss 0.501676; Train Acc 0.681827; Val Acc 0.675885
Epoch 50; Loss 0.488300; Train Acc 0.684039; Val Acc 0.676622
Epoch 55; Loss 0.629617; Train Acc 0.685207; Val Acc 0.677913
Epoch 60; Loss 0.489515; Train Acc 0.686252; Val Acc 0.677913
Epoch 65; Loss 0.528483; Train Acc 0.687665; Val Acc 0.676438
Epoch 70; Loss 0.548636; Train Acc 0.688341; Val Acc 0.679941
Epoch 75; Loss 0.681070; Train Acc 0.690492; Val Acc 0.680310
Epoch 80; Loss 0.533619; Train Acc 0.686620; Val Acc 0.680494
Epoch 85; Loss 0.542890; Train Acc 0.693258; Val Acc 0.675701
Epoch 90; Loss 0.635204; Train Acc 0.688526; Val Acc 0.676622
Epoch 95; Loss 0.590568; Train Acc 0.692951; Val Acc 0.679388
Epoch 100; Loss 0.502876; Train Acc 0.693196; Val Acc 0.678835
Final test accuracy: 0.6814159292035398
In [7]:
def test_model(model, glove_vector, tweet):
    emb = sum(glove_vector[w] for w in split_tweet(tweet))
    out = mymodel(emb.unsqueeze(0))
    pred = out.max(1, keepdim=True)[1]
    return pred

test_model(mymodel, glove, "very happy")
Out[7]:
tensor([[1]])
In [8]:
test_model(mymodel, glove, "This is a terrible tragedy")
Out[8]:
tensor([[0]])

Note that the model does not perform very well at all, and still misclassify about one third of tweets in the test set. Just for fun, we can try a smaller neural network, even one with just a single layer.

In [9]:
mymodel = nn.Linear(50, 2)
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
print("Final test accuracy:", get_accuracy(mymodel, test_loader))
Epoch 5; Loss 0.822915; Train Acc 0.576301; Val Acc 0.584255
Epoch 10; Loss 0.547365; Train Acc 0.618770; Val Acc 0.622972
Epoch 15; Loss 0.764643; Train Acc 0.643968; Val Acc 0.644174
Epoch 20; Loss 0.716287; Train Acc 0.655461; Val Acc 0.657448
Epoch 25; Loss 0.479809; Train Acc 0.665540; Val Acc 0.668695
Epoch 30; Loss 0.755674; Train Acc 0.667998; Val Acc 0.672013
Epoch 35; Loss 0.649504; Train Acc 0.669965; Val Acc 0.674226
Epoch 40; Loss 0.611969; Train Acc 0.670764; Val Acc 0.674594
Epoch 45; Loss 0.700614; Train Acc 0.670948; Val Acc 0.675147
Epoch 50; Loss 0.416567; Train Acc 0.672669; Val Acc 0.674041
Epoch 55; Loss 0.563433; Train Acc 0.672854; Val Acc 0.674963
Epoch 60; Loss 0.626608; Train Acc 0.671194; Val Acc 0.675332
Epoch 65; Loss 0.630606; Train Acc 0.673530; Val Acc 0.673673
Epoch 70; Loss 0.494006; Train Acc 0.671194; Val Acc 0.675147
Epoch 75; Loss 0.700075; Train Acc 0.671317; Val Acc 0.674779
Epoch 80; Loss 0.707087; Train Acc 0.671133; Val Acc 0.674963
Epoch 85; Loss 0.421029; Train Acc 0.671563; Val Acc 0.673488
Epoch 90; Loss 0.561721; Train Acc 0.672055; Val Acc 0.675701
Epoch 95; Loss 0.555329; Train Acc 0.671686; Val Acc 0.674226
Epoch 100; Loss 0.589026; Train Acc 0.671440; Val Acc 0.673304
Final test accuracy: 0.6803097345132744

We don't have to stick to a 50-dimensional GloVe embedding. To build a potentially stronger model, we can choose a larger GloVe embedding. However, to get really good accuracy, we are better for using a a more powerful architecture.

In [10]:
glove = torchtext.vocab.GloVe(name="6B", dim=100, max_vectors=20000)
train, valid, test = get_tweet_vectors(glove)
train_loader = torch.utils.data.DataLoader(train, batch_size=128, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=True)

mymodel = nn.Sequential(nn.Linear(100, 45), nn.ReLU(), nn.Linear(45, 2))
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
get_accuracy(mymodel, test_loader)
Epoch 5; Loss 0.622846; Train Acc 0.665663; Val Acc 0.669801
Epoch 10; Loss 0.639257; Train Acc 0.678262; Val Acc 0.679388
Epoch 15; Loss 0.742817; Train Acc 0.691353; Val Acc 0.686763
Epoch 20; Loss 0.582444; Train Acc 0.694794; Val Acc 0.690450
Epoch 25; Loss 0.678450; Train Acc 0.700449; Val Acc 0.693768
Epoch 30; Loss 0.559904; Train Acc 0.702968; Val Acc 0.694137
Epoch 35; Loss 0.651240; Train Acc 0.708746; Val Acc 0.693584
Epoch 40; Loss 0.543273; Train Acc 0.709790; Val Acc 0.694690
Epoch 45; Loss 0.519296; Train Acc 0.712310; Val Acc 0.696534
Epoch 50; Loss 0.661541; Train Acc 0.715076; Val Acc 0.696534
Epoch 55; Loss 0.452139; Train Acc 0.717288; Val Acc 0.697271
Epoch 60; Loss 0.616408; Train Acc 0.719071; Val Acc 0.693768
Epoch 65; Loss 0.654103; Train Acc 0.722635; Val Acc 0.695981
Epoch 70; Loss 0.601188; Train Acc 0.724295; Val Acc 0.695243
Epoch 75; Loss 0.422498; Train Acc 0.726569; Val Acc 0.696534
Epoch 80; Loss 0.624082; Train Acc 0.726999; Val Acc 0.698562
Epoch 85; Loss 0.394686; Train Acc 0.726815; Val Acc 0.698746
Epoch 90; Loss 0.430444; Train Acc 0.727614; Val Acc 0.699115
Epoch 95; Loss 0.408311; Train Acc 0.733083; Val Acc 0.695612
Epoch 100; Loss 0.425894; Train Acc 0.732530; Val Acc 0.695059
Out[10]:
0.6911873156342183