Sentiment Analysis is the problem of identifying the writer's sentiment given a piece of text. Sentiment Analysis can be applied to movie reviews, feedback of other forms, emails, tweets, course evaluations, and much more.
Rudimentary forms of sentiment analysis might involve scoring each word on a scale from "sad" to "happy", then averaging the "happiness score" of each word in a piece of text. This technique has obvious drawbacks: it won't be able to handle negation, sarcasm, or any complex syntactical form. We can do better. In fact, we will use the sentiment analysis task as an example in the next few lectures.
Today, we'll focus on the problem of classifying tweets as having positive or negative emotions. We use the Sentiment140 data set, which contains tweets with either a positive or negative emoticon. Our goal is to determine whether which type of emoticon the tweet (with the emoticon removed) contained. The dataset was actually collected by a group of students, just like you, who are doing their first machine learning project, just like you will be soon.
You can download the data here: http://help.sentiment140.com/
Let's look at the data:
import csv
def get_data():
return csv.reader(open("training.1600000.processed.noemoticon.csv", "rt", encoding="latin-1"))
for i, line in enumerate(get_data()):
if i > 10:
break
print(line)
The columns we care about is the first one and the last one. The first column is the
label (the label 0
means "sad" tweet, 4
means "happy" tweet), and the last column
contains the tweet. Our task is to predict the sentiment of the tweet given the text.
The approach today is as follows, for each tweet:
First, let's sanity check that there are enough words for us to work with.
import torchtext
glove = torchtext.vocab.GloVe(name="6B", dim=50)
def split_tweet(tweet):
# separate punctuations
tweet = tweet.replace(".", " . ") \
.replace(",", " , ") \
.replace(";", " ; ") \
.replace("?", " ? ")
return tweet.lower().split()
for i, line in enumerate(get_data()):
if i > 30:
break
print(sum(int(w in glove.stoi) for w in split_tweet(line[-1])))
Looks like each tweet has at least one word that has an embedding.
Now, steps 1-3 from above can be done ahead of time, just like the transfer learning portion of Lab 3. So, we will write a function that will take the tweets data file, computes the tweet embeddings, and splits the data into train/validation/test.
We will only use $\frac{1}{59}$ of the data in the file, so that this demo runs relatively quickly.
import torch
import torch.nn as nn
def get_tweet_vectors(glove_vector):
train, valid, test = [], [], []
for i, line in enumerate(get_data()):
tweet = line[-1]
if i % 59 == 0:
# obtain an embedding for the entire tweet
tweet_emb = sum(glove_vector[w] for w in split_tweet(tweet))
# generate a label: 1 = happy, 0 = sad
label = torch.tensor(int(line[0] == "4")).long()
# place the data set in either the training, validation, or test set
if i % 5 < 3:
train.append((tweet_emb, label)) # 60% training
elif i % 5 == 4:
valid.append((tweet_emb, label)) # 20% validation
else:
test.append((tweet_emb, label)) # 20% test
return train, valid, test
I'm making the glove_vector
a parameter so that we can test the effect
of using a higher dimensional GloVe
embedding later. Now, let's get our training, validation, and test set.
The format is what torch.utils.data.DataLoader
expects.
import torchtext
glove = torchtext.vocab.GloVe(name="6B", dim=50)
train, valid, test = get_tweet_vectors(glove)
train_loader = torch.utils.data.DataLoader(train, batch_size=128, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=True)
Now, our actual training script! Note that will we use CrossEntropyLoss
,
have two neurons in the final layer of our output layer, and use softmax instead of
a sigmoid activation. This is different from our choice in the earlier weeks!
Typically, machine learning practitioners will choose to use two output
neurons instead of one, even in a binary classification task. The reason is that
an extra neuron adds some more parameters to the network, and makes the network
a little easier to train (performs better).
import matplotlib.pyplot as plt
def train_network(model, train_loader, valid_loader, num_epochs=5, learning_rate=1e-5):
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
losses, train_acc, valid_acc = [], [], []
epochs = []
for epoch in range(num_epochs):
for tweets, labels in train_loader:
optimizer.zero_grad()
pred = model(tweets)
loss = criterion(pred, labels)
loss.backward()
optimizer.step()
losses.append(float(loss))
if epoch % 5 == 4:
epochs.append(epoch)
train_acc.append(get_accuracy(model, train_loader))
valid_acc.append(get_accuracy(model, valid_loader))
print("Epoch %d; Loss %f; Train Acc %f; Val Acc %f" % (
epoch+1, loss, train_acc[-1], valid_acc[-1]))
# plotting
plt.title("Training Curve")
plt.plot(losses, label="Train")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
plt.title("Training Curve")
plt.plot(epochs, train_acc, label="Train")
plt.plot(epochs, valid_acc, label="Validation")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(loc='best')
plt.show()
def get_accuracy(model, data_loader):
correct, total = 0, 0
for tweets, labels in data_loader:
output = model(tweets)
pred = output.max(1, keepdim=True)[1]
correct += pred.eq(labels.view_as(pred)).sum().item()
total += labels.shape[0]
return correct / total
As for the actual mode, we will start with a 3-layer neural network.
We won't create our own class since this is a fairly straightforward neural
network, so an nn.Sequential
object will do.
Let's build and train our network.
mymodel = nn.Sequential(nn.Linear(50, 30),
nn.ReLU(),
nn.Linear(30, 10),
nn.ReLU(),
nn.Linear(10, 2))
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
print("Final test accuracy:", get_accuracy(mymodel, test_loader))
def test_model(model, glove_vector, tweet):
emb = sum(glove_vector[w] for w in split_tweet(tweet))
out = mymodel(emb.unsqueeze(0))
pred = out.max(1, keepdim=True)[1]
return pred
test_model(mymodel, glove, "very happy")
test_model(mymodel, glove, "This is a terrible tragedy")
Note that the model does not perform very well at all, and still misclassify about one third of tweets in the test set. Just for fun, we can try a smaller neural network, even one with just a single layer.
mymodel = nn.Linear(50, 2)
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
print("Final test accuracy:", get_accuracy(mymodel, test_loader))
We don't have to stick to a 50-dimensional GloVe embedding. To build a potentially stronger model, we can choose a larger GloVe embedding. However, to get really good accuracy, we are better for using a a more powerful architecture.
glove = torchtext.vocab.GloVe(name="6B", dim=100, max_vectors=20000)
train, valid, test = get_tweet_vectors(glove)
train_loader = torch.utils.data.DataLoader(train, batch_size=128, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=True)
mymodel = nn.Sequential(nn.Linear(100, 45), nn.ReLU(), nn.Linear(45, 2))
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
get_accuracy(mymodel, test_loader)