Recommending Movies with Graphs

About the Mini-Project

This mini-project is to be completed individually. The hope is that every student in the class will gain experience with more advanced PyTorch programming.

You must use PyTorch when developing the code for this mini-project. You cannot use packages such as pyg – the goal is for you to get used to implementing new models in PyTorch.

Plagiarism warning

Your submissions will be checked for plagiarism. You may discuss general issues with your classmates, but you may not share your code, or look at other people’s code. Please avoid uploading your code publically to Github (and similar services).

What to submit

You will submit a Jupyter notebook, with all your code. The results you report need to be reproducible – every figure/number you report should be computed by the code in the Jupyter notebook you submit.

The Dataset

You will be working with the mv-latest-small.zip version of the MovieLens dataset. You can load the relevant dataset using

import pandas as pd

ratings = pd.read_csv("http://www.cs.toronto.edu/~guerzhoy/324/movielens/ratings.csv")

Training, test, and validation sets

You should split the dataset using timestamps. A common thing to do is to use the older data as the training set and the newer data as the test/validation sets. That way, you get a good estimate for whether your algorithm would be accurate on new data.

The Task

You will implement and train vector embeddings for both the users and the movies. You will then use the embeddings in order to recommend movies to users, by recommeding the movies \(v\) to user \(u\) where \(z_u\cdot z_v\) is highest.

Directly learning vector embeddings for users and movies

Learn \(d\)-dimensional embeddings \(z_u\) and \(z_m\) by maximizing

\[\sum_{u\in users}\left[200\times\sum_{m\in N(u)}\log\sigma(z_u\cdot z_m) - \sum_{m\not\in N(u)} \log\sigma(z_u\cdot z_m)\right]\]

(We upweight the positive examples to make the training converge faster.)

Here, \(N(u)\) is the set of movies for which the user gave 5-star ratings

Report Recall@150 for the training and test sets, for the task of predicting which movies a given user will give 5 stars to.

Learning embeddings using Node2Vec

Implement a learning procedure for Node2Vec embeddings.

Report Recall@150 for the training and test sets, for the task of predicting which movies a given user will give 5 stars to.

Report what you had done in order to improve the performance of Node2Vec over the initial parameters that you had tried. You only need to demonstrate progress in improving Node2Vec, and do not need to beat the performance of embeddings learned directly.

Guidelines and hints

Using nn.Embedding

You can define the embeddings like this:

n = n_users + n_movies

d = 10

device = 'cuda'

embedding = nn.Embedding(n, d).to(device)

You can then optimize embedding.parameters

Fast cost function computation

When performing the learning, you can use a for-loop to loop over users, but not for looping over movies. You must use matrix multiplication, torch.sum, and obtaining information from the adjacency matrix using np.where/torch.where in order to compute the cost function. Not following this guideline will result in a deduction of 20% from your mark.

For example, if you have an adjacency matrix A (where an edge represents a 5-star rating), you can obtain the indices of movies a user rate with 5 stars using np.where(A[u,:]==1).

What to submit

Submit an ipynb file that contains all the code that can reproduce all the results in your report.

Marking scheme:

  • Implementation of directly learning embeddings: 30%
  • Implementation of learning embeddings using Node2Vec: 35%
  • Tuning the parameters of Node2Vec: 15%
  • Evaluation using Recall@150: 15%
  • Report readability: 5%