Last time, we discuss how GloVe vectors are trained for sentiment analysis.
We will use a package called torchtext
, which works with torch, to explore
and use GloVe vectors
import csv
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchtext
import numpy as np
import matplotlib.pyplot as plt
We will use the 6B
version of the GloVe vector. There are several versions of
the embedding that is available. We will start with the smallest one, which is
the 50 dimensional vector. Later on, we will use the 100 dimensional word vectors.
# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", dim=50, max_vectors=20000)
Let's look at what the embedding of the word "car" looks like:
glove['car']
It is troch tensor with dimension (50,)
. It is difficult to determine what each
number in this embedding means, if anything. However, we know that there is structure
in this embedding space. That is, distances in this embedding space is meaningful.
So, let's compute the Euclidean distance between (the embedding of) the word "car" and several other words. The Euclidean distance (or the L2 norm of the distance vector) is computed as √∑i(xi−yi)2 for word vectors x and y.
word = 'car'
other = ['bike', 'girl', 'computer', 'space', 'was', 'kite']
for w in other:
dist = torch.norm(glove[word] - glove[w]) # euclidean distnace
print(w, float(dist))
The Euclidean distance is a distance measure, and the smaller the distance, the closer the embeddings are to each other. The word "bike" is closest to the word "car" out of all the words in that list.
Instead of using the Euclidean distance, we can use a different distance measure. For example, we can compute the cosine similarity, a measure of the angle between the two vectors:
for w in other:
dist = torch.nn.functional.cosine_similarity(glove['car'].unsqueeze(0), glove[w].unsqueeze(0))
print(w, float(dist))
The cosine similiarity is a similarity measure, and the larger the similarity, the "closer" the word embeddings are to each other. The word "bike" is still closest to the word "car" out of all the words in that list using this measure.
We can compare other pairs of words in the same way:
word = 'candy'
other = ['chocolate', 'sugar', 'cartoon', 'computer', 'bike', 'girl', 'was', 'car']
for w in other:
dist = torch.norm(glove[word] - glove[w])
cosdist = torch.nn.functional.cosine_similarity(glove[word].unsqueeze(0), glove[w].unsqueeze(0))
print(w, float(dist), float(cosdist), sep="\t")
word = 'happy'
other = ['good', 'sad', 'tired', 'bad', 'unhappy', 'cry']
for w in other:
dist = torch.norm(glove[word] - glove[w])
cosdist = torch.nn.functional.cosine_similarity(glove[word].unsqueeze(0), glove[w].unsqueeze(0))
print(w, float(dist), float(cosdist), sep="\t")
The second example is interesting, because "happy" and "sad" represent human sentiments that are the opposite ends of the sentiment spectrum. For more examples, see https://lamyiowce.github.io/word2viz/
Sentiment Analysis is the problem of identifying the writer's sentiment given a piece of text. It can be applied to movie reviews, feedback of other forms, emails, tweets, and even course evaluations.
Rudimentary forms of sentiment analysis might involve scoring each word on a scale from "sad" to "happy", then averaging the "happiness score" of each word in a piece of text. This technique has obvious drawbacks: it won't be able to handle negation, sarcasm, or any complex syntactical form. We can do better. In fact, we will use the sentiment analysis task as an example in the next few lectures.
Specifically, we will use the Sentiment140 data set. This dataset contains tweets containing either a positive or negative emoticon. Our goal is to determine whether which type of emoticon the tweet (with the emoticon removed) contained. THe dataset was actually collected by a group of students, just like you, who are doing their first machine learning project.
You can download the data here: http://help.sentiment140.com/for-students
Let's look at the data:
import csv
def get_data():
return csv.reader(open("training.1600000.processed.noemoticon.csv", "rt", encoding="latin-1"))
for i, line in enumerate(get_data()):
if i > 10:
break
print(line)
The the columns we care about is the first one and the last one. The first column is the
label (the label 0
means "sad" tweet, 4
means "happy" tweet), and the last column
contains the tweet. Our task is to predict the sentiment of the tweet given the text.
The appropach today is as follows, for each tweet:
First, let's sanity check that there are enough words for us to work with.
def split_tweet(tweet):
# separate punctuations
tweet = tweet.replace(".", " . ") \
.replace(",", " , ") \
.replace(";", " ; ") \
.replace("?", " ? ")
return tweet.split()
for i, line in enumerate(get_data()):
if i > 30:
break
print(sum(int(w in glove.stoi) for w in split_tweet(line[-1])))
Looks like each tweet has at least one word that has an embedding.
Now, steps 1-3 from above can be done ahead of time, just like in our transfer learning assinment. So, we will write a function that will take the tweets data file, computes the tweet embeddings, and splits the data into train/validation/test.
We will only use 159 of the data in the file, so that this demo runs relatively quickly.
def get_tweet_vectors(glove_vector):
train, valid, test = [], [], []
for i, line in enumerate(get_data()):
tweet = line[-1]
if i % 59 == 0:
vector_sum = sum(glove_vector[w] for w in split_tweet(tweet))
label = torch.tensor(int(line[0] == "4")).long()
if i % 5 < 3:
train.append((vector_sum, label))
elif i % 5 == 4:
valid.append((vector_sum, label))
else:
test.append((vector_sum, label))
return train, valid, test
I'm making the glove_vector
a parameter so that we can use a larger dimensional
embedding later. Now, let's get our training, validation, and test set.
The format is what torch.utils.data.DataLoader
expects.
train, valid, test = get_tweet_vectors(glove)
train_loader = torch.utils.data.DataLoader(train, batch_size=128, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=True)
Now, our actual training script! Note that will we use CrossEntropyLoss
,
have two neurons in the final layer of our output layer, and use softmax instead of
a sigmoid activation. This is different from our choice in the earlier weeks! Typically,
having two neurons instead of one and using softmax performs better than having only
a single neuron.
def train_network(model, train_loader, valid_loader, num_epochs=5, learning_rate=1e-5):
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
losses, train_acc, valid_acc = [], [], []
epochs = []
for epoch in range(num_epochs):
for tweets, labels in train_loader:
optimizer.zero_grad()
pred = model(tweets)
loss = criterion(pred, labels)
loss.backward()
optimizer.step()
losses.append(float(loss))
if epoch % 5 == 4:
epochs.append(epoch)
train_acc.append(get_accuracy(model, train_loader))
valid_acc.append(get_accuracy(model, valid_loader))
print("Epoch %d; Loss %f; Train Acc %f; Val Acc %f" % (
epoch+1, loss, train_acc[-1], valid_acc[-1]))
# plotting
plt.title("Training Curve")
plt.plot(losses, label="Train")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()
plt.title("Training Curve")
plt.plot(epochs, train_acc, label="Train")
plt.plot(epochs, valid_acc, label="Validation")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend(loc='best')
plt.show()
def get_accuracy(model, data_loader):
correct, total = 0, 0
for tweets, labels in data_loader:
output = model(tweets)
pred = output.max(1, keepdim=True)[1]
correct += pred.eq(labels.view_as(pred)).sum().item()
total += labels.shape[0]
return correct / total
As for the actual mode, we will start with a 3-layer neural network.
We won't create our own class since this is a fairly straightforward neural
network, so an nn.Sequential
object will do. Let's build and train our network.
mymodel = nn.Sequential(nn.Linear(50, 40),
nn.ReLU(),
nn.Linear(40, 20),
nn.ReLU(),
nn.Linear(20, 2))
train_network(mymodel, train_loader, valid_loader, num_epochs=50, learning_rate=1e-4)
get_accuracy(mymodel, test_loader)
We can try a smaller network:
mymodel = nn.Linear(50, 2)
train_network(mymodel, train_loader, valid_loader, num_epochs=50, learning_rate=1e-4)
get_accuracy(mymodel, test_loader)
We can also try using a larger dimensional embedding:
glove = torchtext.vocab.GloVe(name="6B", dim=100, max_vectors=20000)
train, valid, test = get_tweet_vectors(glove)
train_loader = torch.utils.data.DataLoader(train, batch_size=128, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid, batch_size=128, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=True)
mymodel = nn.Sequential(nn.Linear(100, 45), nn.ReLU(), nn.Linear(45, 2))
train_network(mymodel, train_loader, valid_loader, num_epochs=100, learning_rate=1e-4)
get_accuracy(mymodel, test_loader)