GloVe Vectors

Last time, we saw how autoencoders are used to learn a latent embedding space: an alternative, low-dimensional representation of a set of data with some appealing properties: for example, we saw that interpolating in the latent space is a way of generating new examples. In particular, interpolation in the latent space generates more compelling examples than, say, interpolating in the raw pixel space.

The idea of learning an alternative representation/features/embeddings of data is a prevalent one in machine learning. You already saw how we used features computed by AlexNet as a component of a model. Good representations will make downstream tasks (like generating new data, clustering, computing distances) perform much better.

GloVe embeddings provides a similar kind of pre-trained embeddings, but for words. The way that GloVe embeddings are generated is related to what we did in Project 2, but somewhat different. The specifics of the algorithm, loss function, and training is beyond the scope of this course. You should think of GloVe embeddings similarly to pre-trained AlexNet weights. More information about GloVe is available here: https://nlp.stanford.edu/projects/glove/

GloVe Embeddings

Just like for AlexNet, PyTorch makes it easy for us to use pre-trained GloVe embeddings. There are several variations of GloVe embeddings available; they differ in the corpus (data) that the embeddings are trained on, and the size (length) of each word embedding vector.

These embeddings were trained by the authors of GloVe (Pennington et al. 2014), and are also available on the website https://nlp.stanford.edu/projects/glove/

To load pre-trained GloVe embeddings, we'll use a package called torchtext. The package torchtext contains other useful tools for working with text that we will see later in the course. The documentation for torchtext GloVe vectors are available at: https://torchtext.readthedocs.io/en/latest/vocab.html#glove

In [1]:
import torch
import torchtext

# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=100)    # embedding size = 50

Let's look at what the embedding of the word "car" looks like:

In [2]:
glove['cat']
Out[2]:
tensor([ 0.2309,  0.2828,  0.6318, -0.5941, -0.5860,  0.6326,  0.2440, -0.1411,
         0.0608, -0.7898, -0.2910,  0.1429,  0.7227,  0.2043,  0.1407,  0.9876,
         0.5253,  0.0975,  0.8822,  0.5122,  0.4020,  0.2117, -0.0131, -0.7162,
         0.5539,  1.1452, -0.8804, -0.5022, -0.2281,  0.0239,  0.1072,  0.0837,
         0.5501,  0.5848,  0.7582,  0.4571, -0.2800,  0.2522,  0.6896, -0.6097,
         0.1958,  0.0442, -0.3114, -0.6883, -0.2272,  0.4618, -0.7716,  0.1021,
         0.5564,  0.0674, -0.5721,  0.2374,  0.4717,  0.8277, -0.2926, -1.3422,
        -0.0993,  0.2814,  0.4160,  0.1058,  0.6220,  0.8950, -0.2345,  0.5135,
         0.9938,  1.1846, -0.1636,  0.2065,  0.7385,  0.2406, -0.9647,  0.1348,
        -0.0072,  0.3302, -0.1236,  0.2719, -0.4095,  0.0219, -0.6069,  0.4076,
         0.1957, -0.4180,  0.1864, -0.0327, -0.7857, -0.1385,  0.0440, -0.0844,
         0.0491,  0.2410,  0.4527, -0.1868,  0.4618,  0.0891, -0.1819, -0.0152,
        -0.7368, -0.1453,  0.1510, -0.7149])

It is a torch tensor with dimension (50,). It is difficult to determine what each number in this embedding means, if anything. However, we know that there is structure in this embedding space. That is, distances in this embedding space is meaningful.

Measuring Distance

To explore the structure of the embedding space, it is necessary to introduce a notion of distance. You are probably already familiar with the notion of the Euclidean distance. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and $y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute the Euclidean distance between $x$ and $y$: $\sqrt{\sum_i (x_i - y_i)^2}$

The PyTorch function torch.norm computes the 2-norm of a vector for us, so we can compute the Euclidean distance between two vectors like this:

In [3]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)
Out[3]:
tensor(2.6811)

An alternative measure of distance is the Cosine Similarity. The cosine similarity measures the angle between two vectors, and has the property that it only considers the direction of the vectors, not their the magnitudes. (We'll use this property next class.)

In [4]:
x = torch.tensor([1., 1., 1.]).unsqueeze(0)
y = torch.tensor([2., 2., 2.]).unsqueeze(0)
torch.cosine_similarity(x, y) # should be one
Out[4]:
tensor([1.])

The cosine similarity is a similarity measure rather than a distance measure: The larger the similarity, the "closer" the word embeddings are to each other.

In [5]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(glove['cat'].unsqueeze(0),
                        glove['dog'].unsqueeze(0))
Out[5]:
tensor([0.8798])

Word Similarity

Now that we have a notion of distance in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".

In [6]:
word = 'cat'
other = ['dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.norm(glove[word] - glove[w]) # euclidean distance
    print(w, float(dist))
dog 2.6811304092407227
bike 6.022563934326172
kitten 4.454165935516357
puppy 3.9275598526000977
kite 5.859299659729004
computer 6.960631847381592
neuron 7.568033218383789

In fact, we can look through our entire vocabulary for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word like "cat". (You did this in project 2!)

Keep in mind that GloVe vectors are trained on word co-occurrences, and so words with similar embeddings will tend to co-occur with other words. For example, "cat" and "dog" tend to occur with similar other words---even more so than "cat" and "kitten" because these two words tend to occur in different contexts!

In [7]:
def print_closest_words(vec, n=5):
    dists = torch.norm(glove.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]: 					       # take the top n
        print(glove.itos[idx], difference)

print_closest_words(glove["cat"], n=10)
dog 2.6811304
rabbit 3.6489706
cats 3.6892
monkey 3.746932
puppy 3.9275599
pet 3.9499722
dogs 4.0555873
rat 4.131533
mouse 4.1978264
spider 4.26968
In [8]:
print_closest_words(glove['nurse'])
doctor 3.818501
nurses 3.915796
dentist 4.050089
midwife 4.1329455
physician 4.172601
In [9]:
print_closest_words(glove['computer'])
computers 3.1775317
software 3.7114491
technology 4.4087934
hardware 4.5274196
pc 4.581429
In [10]:
print_closest_words(glove['white'])
black 3.1696725
brown 3.606763
gray 3.863705
red 4.141746
green 4.151832

We could also look at which words are closest to the midpoints of two words:

In [11]:
print_closest_words((glove['happy'] + glove['sad']) / 2)
happy 2.2137814
sorry 3.2755055
awful 3.473031
glad 3.502146
remember 3.5351613

Analogies

One surprising aspect of GloVe vectors is that the directions in the embedding space can be meaningful. The structure of the GloVe vectors certain analogy-like relationship like this tend to hold:

$$ king - man + woman \approx queen $$

In [12]:
print_closest_words(glove['king'] - glove['man'] + glove['woman'])
queen 4.0810785
monarch 4.6429076
throne 4.905501
elizabeth 4.9215584
prince 4.9811473

We get reasonable answers like "queen", "throne" and the name of our current queen.

We can likewise flip the analogy around:

In [13]:
print_closest_words(glove['queen'] - glove['woman'] + glove['man'])
king 4.0810785
prince 5.034758
royal 5.0694494
majesty 5.189214
crown 5.2737184

Or, try a different but related analogies along the gender axis:

In [14]:
print_closest_words(glove['king'] - glove['prince'] + glove['princess'])
princess 4.092166
king 4.5588813
elizabeth 5.3060274
sister 5.3555794
monarch 5.445486
In [15]:
print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])
grandmother 3.1846805
niece 3.1966188
uncle 3.3640678
mother 3.4163876
daughter 3.478153
In [16]:
print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])
uncle 2.5486362
father 2.6408308
grandmother 3.0479467
brother 3.1756728
nephew 3.2189744
In [17]:
print_closest_words(glove['old'] - glove['young'] + glove['father'])
old 5.0858293
grandfather 5.7471604
son 5.7511573
grandmother 5.998024
mother 6.014745

Biased in Word Vectors

Machine learning models have an air of "fairness" about them, since models make decisions without human intervention. However, models can and do learn whatever bias is present in the training data!

GloVe vectors seems innocuous enough: they are just representations of words in some embedding space. Even so, we'll show that the structure of the GloVe vectors encodes the everyday biases present in the texts that they are trained on.

We'll start with an example analogy:

$$ doctor - man + woman \approx ?? $$

Let's use GloVe vectors to find the answer to the above analogy:

In [18]:
print_closest_words(glove['doctor'] - glove['man'] + glove['woman'])
nurse 4.2283154
physician 4.7054324
woman 4.8734255
dentist 4.969891
pregnant 5.0148487

The $$doctor - man + woman \approx nurse$$ analogy is very concerning. Just to verify, the same result does not appear if we flip the gender terms:

In [19]:
print_closest_words(glove['doctor'] - glove['woman'] + glove['man'])
man 4.8998694
dr. 5.0585294
brother 5.144743
physician 5.152549
taken 5.2571893

We see similar types of gender bias with other professions.

In [20]:
print_closest_words(glove['programmer'] - glove['man'] + glove['woman'])
cosmetologist 4.8504696
salesclerk 4.9466257
psychotherapist 5.0096955
adoptee 5.0135107
hairdresser 5.015074

Beyond the first result, none of the other words are even related to programming! In contrast, if we flip the gender terms, we get very different results:

In [21]:
print_closest_words(glove['programmer'] - glove['woman'] + glove['man'])
programmers 5.2138195
setup 5.2186975
mechanic 5.461222
hacker 5.520135
animators 5.539158

Here are the results for "engineer":

In [22]:
print_closest_words(glove['engineer'] - glove['man'] + glove['woman'])
technician 4.691209
educator 5.208781
contractor 5.2372375
surgeon 5.2675548
pioneer 5.2756505
In [23]:
print_closest_words(glove['engineer'] - glove['woman'] + glove['man'])
mechanic 5.4588532
engineers 5.5377874
master 5.6934342
technician 5.819147
architect 5.878545