Last time, we saw how autoencoders are used to learn a latent embedding space: an alternative, low-dimensional representation of a set of data with some appealing properties: for example, we saw that interpolating in the latent space is a way of generating new examples. In particular, interpolation in the latent space generates more compelling examples than, say, interpolating in the raw pixel space.
The idea of learning an alternative representation/features/embeddings of data is a prevalent one in machine learning. You already saw how we used features computed by AlexNet as a component of a model. Good representations will make downstream tasks (like generating new data, clustering, computing distances) perform much better.
With autoencoders, we were able to learn a representation of MNIST digits. In lab 4, we use an autoencoder to learn a representation of a census record. In both cases, we used a model that looks like this:
This type of architecture works well for certain types of data (e.g. images) that are easy to generate, and whose meaning is encoded in the input data representation (e.g. the pixels). Such architectures can and has be used to learn embeddings for things like faces, books, and even molecules!
But what if we want to train an embedding on words? Words are different from images or even molecules, in that the meaning of a word is not represented by the letters that make up the word (the same way that the meaning of an image is represented by the pixels that make up the pixel). Instead, the meaning of words comes from how they are used in conjunction with other words.
A word2vec model learns embedding of words using the following architecture:
Specific word2vec models differ in the which "nearby words" is predicted using the decoder: is it the 3 context words that appeared before the input word? Is it the 3 words that appeared after? Or is it a combination of the two words that appeared before and two words that appeared after the input word?
These models are trained using a large corpus of text: for example the whole of Wikipedia or a large collection of news articles. We won't train our own word2vec models in this course, so we won't talk about the many considerations involved in training a word2vec model.
Instead, we will use a set of pre-trained word embeddings. These are embeddings that someone else took the time and computational power to train. One of the most commonly-used pre-trained word embeddings are the GloVe embeddings.
GloVe is a variation of a word2vec model. Again, the specifics of the algorithm and its training will be beyond the scope of this course. You should think of GloVe embeddings similarly to pre-trained AlexNet weights. More information about GloVe is available here: https://nlp.stanford.edu/projects/glove/
Unlike AlexNet, there are several variations of GloVe embeddings. They differ in the corpus used to train the embedding, and the size of the embeddings.
To load pre-trained GloVe embeddings, we'll use a package called torchtext
.
The package torchtext
contains other useful tools for working with text
that we will see later in the course. The documentation for torchtext
GloVe vectors are available at: https://torchtext.readthedocs.io/en/latest/vocab.html#glove
We'll begin by loading a set of GloVe embeddings. The first time you run the code below, Python will download a large file (862MB) containing the pre-trained embeddings.
import torch
import torchtext
# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
dim=50) # embedding size = 100
Let's look at what the embedding of the word "car" looks like:
glove['cat']
It is a torch tensor with dimension (50,)
. It is difficult to determine what each
number in this embedding means, if anything. However, we know that there is structure
in this embedding space. That is, distances in this embedding space is meaningful.
To explore the structure of the embedding space, it is necessary to introduce a notion of distance. You are probably already familiar with the notion of the Euclidean distance. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and $y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute the Euclidean distance between $x$ and $y$: $\sqrt{\sum_i (x_i - y_i)^2}$
The PyTorch function torch.norm
computes the 2-norm of a vector for us, so we
can compute the Euclidean distance between two vectors like this:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)
An alternative measure of distance is the Cosine Similarity. The cosine similarity measures the angle between two vectors, and has the property that it only considers the direction of the vectors, not their the magnitudes. (We'll use this property next class.)
x = torch.tensor([1., 1., 1.]).unsqueeze(0)
y = torch.tensor([2., 2., 2.]).unsqueeze(0)
torch.cosine_similarity(x, y) # should be one
The cosine similarity is a similarity measure rather than a distance measure: The larger the similarity, the "closer" the word embeddings are to each other.
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(glove['cat'].unsqueeze(0),
glove['dog'].unsqueeze(0))
Now that we have a notion of distance in our embedding space, we can talk about words that are "close" to each other in the embedding space. For now, let's use Euclidean distances to look at how close various words are to the word "cat".
word = 'cat'
other = ['dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
dist = torch.norm(glove[word] - glove[w]) # euclidean distance
print(w, float(dist))
In fact, we can look through our entire vocabulary for words that are closest to a point in the embedding space -- for example, we can look for words that are closest to another word like "cat".
def print_closest_words(vec, n=5):
dists = torch.norm(glove.vectors - vec, dim=1) # compute distances to all words
lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
for idx, difference in lst[1:n+1]: # take the top n
print(glove.itos[idx], difference)
print_closest_words(glove["cat"], n=10)
print_closest_words(glove['nurse'])
print_closest_words(glove['computer'])
We could also look at which words are closest to the midpoints of two words:
print_closest_words((glove['happy'] + glove['sad']) / 2)
One surprising aspect of GloVe vectors is that the directions in the embedding space can be meaningful. The structure of the GloVe vectors certain analogy-like relationship like this tend to hold:
$$ king - man + woman \approx queen $$
print_closest_words(glove['king'] - glove['man'] + glove['woman'])
We get reasonable answers like "queen", "throne" and the name of our current queen.
We can likewise flip the analogy around:
print_closest_words(glove['queen'] - glove['woman'] + glove['man'])
Or, try a different but related analogies along the gender axis:
print_closest_words(glove['king'] - glove['prince'] + glove['princess'])
print_closest_words(glove['uncle'] - glove['man'] + glove['woman'])
print_closest_words(glove['grandmother'] - glove['mother'] + glove['father'])
print_closest_words(glove['old'] - glove['young'] + glove['father'])
Machine learning models have an air of "fairness" about them, since models make decisions without human intervention. However, models can and do learn whatever bias is present in the training data!
GloVe vectors seems innocuous enough: they are just representations of words in some embedding space. Even so, we'll show that the structure of the GloVe vectors encodes the everyday biases present in the texts that they are trained on.
We'll start with an example analogy:
$$ doctor - man + woman \approx ?? $$
Let's use GloVe vectors to find the answer to the above analogy:
print_closest_words(glove['doctor'] - glove['man'] + glove['woman'])
The $$doctor - man + woman \approx nurse$$ analogy is very concerning. Just to verify, the same result does not appear if we flip the gender terms:
print_closest_words(glove['doctor'] - glove['woman'] + glove['man'])
We see similar types of gender bias with other professions.
print_closest_words(glove['programmer'] - glove['man'] + glove['woman'])
Beyond the first result, none of the other words are even related to programming! In contrast, if we flip the gender terms, we get very different results:
print_closest_words(glove['programmer'] - glove['woman'] + glove['man'])
Here are the results for "engineer":
print_closest_words(glove['engineer'] - glove['man'] + glove['woman'])
print_closest_words(glove['engineer'] - glove['woman'] + glove['man'])