This mini-project is to be completed individually. The hope is that every student in the class will gain experience with more advanced PyTorch programming.
You must use PyTorch
when developing the code for this mini-project. You cannot use packages such as pyg
– the goal is for you to get used to implementing new models in PyTorch.
Your submissions will be checked for plagiarism. You may discuss general issues with your classmates, but you may not share your code, or look at other people’s code. Please avoid uploading your code publically to Github (and similar services).
You will submit a Jupyter notebook, with all your code. The results you report need to be reproducible – every figure/number you report should be computed by the code in the Jupyter notebook you submit.
You will be working with the mv-latest-small.zip
version of the MovieLens dataset. You can load the relevant dataset using
import pandas as pd
ratings = pd.read_csv("http://www.cs.toronto.edu/~guerzhoy/324/movielens/ratings.csv")
You should split the dataset using timestamps. A common thing to do is to use the older data as the training set and the newer data as the test/validation sets. That way, you get a good estimate for whether your algorithm would be accurate on new data.
You will implement and train vector embeddings for both the users and the movies. You will then use the embeddings in order to recommend movies to users, by recommeding the movies \(v\) to user \(u\) where \(z_u\cdot z_v\) is highest.
Learn \(d\)-dimensional embeddings \(z_u\) and \(z_m\) by maximizing
\[\sum_{u\in users}\left[200\times\sum_{m\in N(u)}\log\sigma(z_u\cdot z_m) - \sum_{m\not\in N(u)} \log\sigma(z_u\cdot z_m)\right]\]
(We upweight the positive examples to make the training converge faster.)
Here, \(N(u)\) is the set of movies for which the user gave 5-star ratings
Report Recall@150
for the training and test sets, for the task of predicting which movies a given user will give 5 stars to.
Implement a learning procedure for Node2Vec embeddings.
Report Recall@150
for the training and test sets, for the task of predicting which movies a given user will give 5 stars to.
Report what you had done in order to improve the performance of Node2Vec over the initial parameters that you had tried. You only need to demonstrate progress in improving Node2Vec, and do not need to beat the performance of embeddings learned directly.
nn.Embedding
You can define the embeddings like this:
n = n_users + n_movies
d = 10
device = 'cuda'
embedding = nn.Embedding(n, d).to(device)
You can then optimize embedding.parameters
When performing the learning, you can use a for-loop to loop over users, but not for looping over movies. You must use matrix multiplication, torch.sum
, and obtaining information from the adjacency matrix using np.where
/torch.where
in order to compute the cost function. Not following this guideline will result in a deduction of 20% from your mark.
For example, if you have an adjacency matrix A
(where an edge represents a 5-star rating), you can obtain the indices of movies a user rate with 5 stars using np.where(A[u,:]==1)
.
Submit an ipynb
file that contains all the code that can reproduce all the results in your report.
Marking scheme:
Recall@150
: 15%