SML310 Mini-Project 2: Vector Embeddings and the Distributional Hypothesis

Overview

One type of question encountered in tests such as the SAT, GRE, and TOEFL is the “Synonym Question,’‘ where students are asked to pick a synonym of a word out of a list of alternatives. For example:

 1. vexed                (Answer: (a) annoyed)
(a) annoyed
(b) amused
(c) frightened
(d) excited

For this assignment, you will explore using vector embeddings of words in order to solve the synonym question. You will compute a vector embedding for each word, compute the distances between the embeddings that you obtain (using several distance metrics), and then find the word whose vector embedding is the closest to the vector embedding of the query word.

Word embeddings

Each word will correspond to an embedding . A word embedding is an n-dimensional vector. There are many ways to compute word embeddings. Generally good word emebedding scheme will have the property that words that are similar in meaning will have embeddings the distance between which is small.

The Distributional Hypothesis in linguistics posits that words that appear in similar contexts will have similar meanings. We will use the Distributional Hypothesis to construct our word embeddings.

Given a text with n words denoted by \( (w_1, w_2, …, w_n)\) and a word w, let \(desc_w\) be the semantic descriptor vector of \(w\) computed using the text. \(desc_w\) is an n-dimensional vector. The i-th coordinate of \(desc_{w}\) is the number of sentences in which both \(w\) and \(w_i\) occur. (For efficiency’s sake, we will not be actually storing the zeros that correspond to
words which don’t co-occur with w – that is, we will use sparse vectors.)

For example, suppose we are given the following text (the opening of Notes from the Underground by Fyodor Dostoyevsky, translated by Constance Garnett):

I am a sick man. I am a spiteful man. I am an unattractive man. I believe my liver is diseased. However, I know nothing at all about my disease, and do not know for certain what ails me.

The word “man’‘ only appears in the first three sentences. Its embedding would be (represented as a sparse vector):

{"i": 3, "am": 3, "a": 2, "sick": 1, "spiteful": 1, "an": 1, "unattractive": 1}

The word “liver’‘ only occurs in the second sentence, so its embedding is:

{"i": 1, "believe": 1, "my": 1, "is": 1, "diseased": 1}

The input

You will use out-of-copyright novels from Project Gutenberg to compute word embeddings. We recommend that you start with War and Peace by Leo Tolstoy and Swann’s Way by Marcel Proust

http://www.gutenberg.org/cache/epub/7178/pg7178.txt
http://www.gutenberg.org/cache/epub/2600/pg2600.txt

The file test.txt contains sample synonym questions in the format

query correct_answer option1 option2

Tasks

Implementation: Word Embeddings (30%)

Implement a function that computes word embeddings using input text. In your report, indicate how the TA can obtain the word embedding for a word of interest by inputting text files containing large amounts of text.

Word embeddings: Data Pre-Processing (10%)

When implementing the system to compute word embeddings, you should have made choices about data pre-processing. In your report, state what choices you made, and briefly justify them. (For example, did you convert all the texts to lower-case?)

Word embeddings: Exploratory Analysis (10%)

Select several word embeddings that you obtained, and display them in your report (in whole or in part – think of a good way to display and compare word embeddings). Attempt to find at least one interesting pattern, and attempt to explain the pattern. Include the function calls that the TA should make to reproduce your results.

Implementation: Synonym task (30%)

Implement a function that takes in the texts of several novels and a file such as test.txt , uses the cosine distance metric to guess the synonyms, and returns the performance of the system on test.txt .

Exploratory analysis: Synonym task (20%)

Write Python code to compare the performance of the system when it uses the cosine distance, the Euclidean distance metric, and the Euclidean distance between the normalized versions of the embeddings, for different corpus sizes. In your report, display a graph that summarizes your findings. Include the function calls that the TA should make to reproduce your results.

What to submit

Please submit all of your Python code, as well as a report in PDF format. Your report should address every one of the tasks above. Your report should be repoducible: in the report, include the function calls that the TA should make to get the outputs that you are showing in your report. (You do not need to include helper functions in your report.

Hints

A more detailed handout is here , and a substantial amount of code is supplied here . Unless you have little background in programming, I encourage you to not look at those at all and complete this mini-project from scratch. However, do not hesistate to use the handout code and the detailed instructions if you are stuck.