# Word2Vec and GloVe Vectors

Last time, we saw how autoencoders are used to learn a latent
**embedding space**: an alternative, low-dimensional representation
of a set of data with some appealing properties:
for example, we saw that interpolating in the latent space
is a way of generating new examples. In particular,
interpolation in the latent space generates more compelling
examples than, say, interpolating in the raw pixel space.

The idea of learning an alternative representation/features/*embeddings* of data
is a prevalent one in machine learning. You already saw how we used features
computed by AlexNet as a component of a model. Good representations will
make downstream tasks (like generating new data, clustering, computing distances)
perform much better.

With autoencoders, we were able to learn a representation of MNIST digits.
In lab 4, we use an autoencoder to learn a representation of a census record.
In both cases, we used a model that looks like this:

- **Encoder**: data -> embedding
- **Decoder**: embedding -> data

This type of architecture works well for certain types of data (e.g. images)
that are easy to generate, and whose meaning is encoded in the input data
representation (e.g. the pixels). Such architectures can and has be used to
learn embeddings for things like faces, books, and even molecules!

But what if we want to train an embedding on words? Words are different
from images or even molecules, in that the meaning of a word is not represented
by the letters that make up the word (the same way that the meaning
of an image is represented by the pixels that make up the pixel). Instead,
the meaning of words comes from how they are used in conjunction with other
words.

## word2vec models

A word2vec model learns embedding of words using the following architecture:

- **Encoder**: word -> embedding
- **Decoder**: embedding -> nearby words (context)

Specific word2vec models differ in the which "nearby words" is predicted
using the decoder: is it the 3 context words that appeared *before*
the input word? Is it the 3 words that appeared *after*? Or is it a combination
of the two words that appeared before and two words that appeared after
the input word?

These models are trained using a large corpus of text: for example the whole
of Wikipedia or a large collection of news articles. We won't train our
own word2vec models in this course, so we won't talk about the many considerations
involved in training a word2vec model.

Instead, we will use a set of pre-trained word embeddings. These are embeddings
that someone else took the time and computational power to train.
One of the most commonly-used pre-trained word embeddings are the **GloVe embeddings**.

GloVe is a variation of a word2vec model. Again, the specifics of the algorithm
and its training will be beyond the scope of this course.
You should think of **GloVe embeddings** similarly to pre-trained AlexNet weights.
More information about GloVe is available here: https://nlp.stanford.edu/projects/glove/


Unlike AlexNet, there are several variations of GloVe embeddings. They
differ in the corpus used to train the embedding, and the *size* of the embeddings.

## GloVe Embeddings

To load pre-trained GloVe embeddings, we'll use a package called `torchtext`.
The package `torchtext` contains other useful tools for working with text
that we will see later in the course. The documentation for torchtext
GloVe vectors are available at: https://torchtext.readthedocs.io/en/latest/vocab.html#glove

We'll begin by loading a set of GloVe embeddings. The first time you run the code
below, Python will download a large file (862MB) containing the pre-trained embeddings.

In [1]:
import torch
import torchtext

# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=50)   # embedding size = 50

Let's look at what the embedding of the word "car" looks like:

In [4]:
glove['cat']

tensor([ 0.4528, -0.5011, -0.5371, -0.0157,  0.2219,  0.5460, -0.6730, -0.6891,
         0.6349, -0.1973,  0.3368,  0.7735,  0.9009,  0.3849,  0.3837,  0.2657,
        -0.0806,  0.6109, -1.2894, -0.2231, -0.6158,  0.2170,  0.3561,  0.4450,
         0.6089, -1.1633, -1.1579,  0.3612,  0.1047, -0.7832,  1.4352,  0.1863,
        -0.2611,  0.8328, -0.2312,  0.3248,  0.1449, -0.4455,  0.3350, -0.9595,
        -0.0975,  0.4814, -0.4335,  0.6945,  0.9104, -0.2817,  0.4164, -1.2609,
         0.7128,  0.2378])

It is a torch tensor with dimension `(50,)`. It is difficult to determine what each
number in this embedding means, if anything. However, we know that there is structure
in this embedding space. That is, distances in this embedding space is meaningful.

## Measuring Distance

To explore the structure of the embedding space, it is necessary to introduce
a notion of *distance*. You are probably already familiar with the notion
of the **Euclidean distance**. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and
$y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute
the Euclidean distance between $x$ and $y$:
$\sqrt{\sum_i (x_i - y_i)^2}$

The PyTorch function `torch.norm` computes the 2-norm of a vector for us, so we 
can compute the Euclidean distance between two vectors like this:

In [5]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)

tensor(1.8846)

In [6]:
torch.norm(glove['good'] - glove['bad'])

tensor(3.3189)

In [7]:
torch.norm(glove['good'] - glove['water'])

tensor(5.3390)

In [8]:
torch.norm(glove['good'] - glove['well'])

tensor(2.7703)

In [11]:
torch.norm(glove['good'] - glove['perfect'])

tensor(2.8834)

In [12]:
torch.norm(glove['good'] - glove['bravo'])

tensor(6.2940)

An alternative measure of distance is the **Cosine Similarity**.
The cosine similarity measures the *angle* between two vectors,
and has the property that it only considers the *direction* of the
vectors, not their the magnitudes. (We'll use this property next class.)

In [14]:
x = torch.tensor([1., 1., 1.]).unsqueeze(0)
y = torch.tensor([2., 2., -2.]).unsqueeze(0)
torch.cosine_similarity(x, y) # should be one

tensor([0.3333])

The cosine similarity is a *similarity* measure rather than a *distance* measure:
The larger the similarity,
the "closer" the word embeddings are to each other.

In [15]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(x.unsqueeze(0), y.unsqueeze(0))

tensor([0.9218])

In [16]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['bad'].unsqueeze(0))

tensor([0.7965])

In [17]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['well'].unsqueeze(0))

tensor([0.8511])

In [18]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['perfect'].unsqueeze(0))

tensor([0.8376])

In [53]:
torch.cosine_similarity(glove['good'].unsqueeze(0), 
                        glove['bravo'].unsqueeze(0))

tensor([0.1991])

In [23]:
x = glove['good']
print(x.shape) # [50]
y = x.unsqueeze(0) # [1, 50]
print(y.shape)

torch.Size([50])
torch.Size([1, 50])


## Word Similarity

Now that we have a notion of distance in our embedding space, we can talk
about words that are "close" to each other in the embedding space.
For now, let's use Euclidean distances to look at how close various words
are to the word "cat".

In [25]:
word = 'cat'
other = ['pet', 'dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.norm(glove[word] - glove[w]) # euclidean distance
    print(w, float(dist))

pet 3.039675712585449
dog 1.8846031427383423
bike 5.048375129699707
kitten 3.5068609714508057
puppy 3.0644655227661133
kite 4.210376262664795
computer 6.030652046203613
neuron 6.228669166564941


In fact, we can look through our entire vocabulary for words that are closest
to a point in the embedding space -- for example, we can look for words
that are closest to another word like "cat".

In [26]:
def print_closest_words(vec, n=5):
    dists = torch.norm(glove.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[1:n+1]:                         # take the top n
        print(glove.itos[idx], difference)

print_closest_words(glove["cat"], n=10)

dog 1.8846031
rabbit 2.4572797
monkey 2.8102052
cats 2.8972247
rat 2.9455352
beast 2.9878407
monster 3.0022194
pet 3.0396757
snake 3.0617998
puppy 3.0644655


In [27]:
print_closest_words(glove['nurse'])

doctor 3.1274529
dentist 3.1306612
nurses 3.26872
pediatrician 3.3212206
counselor 3.3987114


In [28]:
print_closest_words(glove['computer'])

computers 2.4362664
software 2.926823
technology 3.190351
electronic 3.5067408
computing 3.5999784


In [29]:
print_closest_words(glove['elizabeth'])

margaret 2.007497
mary 2.270394
anne 2.300691
catherine 2.6155548
katherine 2.722239


In [31]:
print_closest_words(glove['michael'])

peter 2.922138
moore 2.9317658
david 2.9446106
steven 2.9881783
murphy 3.018417


In [32]:
print_closest_words(glove['bravo'])

marlon 3.796622
dwayne 3.8805976
coco 3.9080122
hooper 3.9350462
lara 4.029812


We could also look at which words are closest to the midpoints of two words:

In [33]:
print_closest_words((glove['happy'] + glove['sad']) / 2)

happy 1.9199749
feels 2.3604643
sorry 2.4984782
hardly 2.52593
imagine 2.5652788


In [34]:
print_closest_words((glove['lake'] + glove['building']) / 2)

surrounding 3.0698414
nearby 3.1112068
bridge 3.1585503
along 3.1610188
shore 3.1618817


In [35]:
print_closest_words((glove['bravo'] + glove['michael']) / 2)

farrell 2.8013926
anderson 2.850686
jacobs 2.8537047
boyle 2.8578227
slater 2.865489


In [36]:
print_closest_words((glove['one'] + glove['ten']) / 2)

ten 1.5737569
only 1.8805301
three 2.0309954
five 2.0468998
four 2.1125531


## Analogies

One surprising aspect of GloVe vectors is that the *directions* in the
embedding space can be meaningful. The structure of the GloVe vectors
certain analogy-like relationship like this tend to hold:

$$ king - man + woman \approx queen $$

In [37]:
print_closest_words(glove['king'] - glove['man'] + glove['woman'])

queen 2.8391209
prince 3.6610038
elizabeth 3.7152522
daughter 3.8317878
widow 3.8493774


We get reasonable answers like "queen", "throne" and the name of
our current queen.

We can likewise flip the analogy around:

In [38]:
print_closest_words(glove['queen'] - glove['woman'] + glove['man'])

king 2.8391209
prince 3.2508988
crown 3.4485192
knight 3.5587437
coronation 3.6198905


Or, try a different but related analogies along the gender axis:

In [39]:
print_closest_words(glove['king'] - glove['prince'] + glove['princess'])

queen 3.1845968
king 3.9103293
bride 4.285721
lady 4.299571
sister 4.421178


In [40]:
print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])

grandmother 2.323353
aunt 2.3527892
granddaughter 2.3615322
daughter 2.4039288
uncle 2.6026237


In [41]:
print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])

uncle 2.0784423
father 2.0912483
grandson 2.2965577
nephew 2.353551
elder 2.4274695


In [42]:
print_closest_words(glove['old'] - glove['young'] + glove['father'])

father 4.0326614
son 4.4065413
grandfather 4.51851
grandson 4.722089
daughter 4.786716


We can move an embedding towards the direction of "goodness" or "badness":

In [43]:
print_closest_words(glove['good'] - glove['bad'] + glove['programmer'])

versatile 4.3815613
creative 4.569001
entrepreneur 4.6343737
enables 4.717773
intelligent 4.7349977


In [45]:
print_closest_words(glove['bad'] - glove['good'] + glove['programmer'])

hacker 3.8383653
glitch 4.003873
originator 4.041952
hack 4.047719
serial 4.2250676


## Biased in Word Vectors

Machine learning models have an air of "fairness" about them, since models
make decisions without human intervention. However, models can and do learn
whatever bias is present in the training data!

GloVe vectors seems innocuous enough: they are just representations of
words in some embedding space. Even so, we'll show that the structure
of the GloVe vectors encodes the everyday biases present in the texts
that they are trained on.

We'll start with an example analogy:

$$doctor - man + woman \approx ??$$

Let's use GloVe vectors to find the answer to the above analogy:

In [46]:
print_closest_words(glove['doctor'] - glove['man'] + glove['woman'])

nurse 3.1355345
pregnant 3.7805371
child 3.78347
woman 3.8643107
mother 3.922231


The $$doctor - man + woman \approx nurse$$ analogy is very concerning.
Just to verify, the same result does not appear if we flip the gender terms:

In [47]:
print_closest_words(glove['doctor'] - glove['woman'] + glove['man'])

man 3.9335632
colleague 3.975502
himself 3.9847782
brother 3.9997008
another 4.029071


We see similar types of gender bias with other professions.

In [48]:
print_closest_words(glove['programmer'] - glove['man'] + glove['woman'])

prodigy 3.6688528
psychotherapist 3.8069527
therapist 3.8087194
introduces 3.9064546
swedish-born 4.1178856


Beyond the first result, none of the other words are even related to
programming! In contrast, if we flip the gender terms, we get very
different results:

In [49]:
print_closest_words(glove['programmer'] - glove['woman'] + glove['man'])

setup 4.002241
innovator 4.0661883
programmers 4.1729574
hacker 4.2256656
genius 4.3644104


Here are the results for "engineer":

In [50]:
print_closest_words(glove['engineer'] - glove['man'] + glove['woman'])

technician 3.6926973
mechanic 3.9212747
pioneer 4.1543956
pioneering 4.1880875
educator 4.2264576


In [51]:
print_closest_words(glove['engineer'] - glove['woman'] + glove['man'])

builder 4.3523865
mechanic 4.402976
engineers 4.477985
worked 4.5281315
replacing 4.600204
