This mini-project is to be completed individually. The hope is that every student in the class will gain experience with more advanced PyTorch programming.

You must use `PyTorch`

when developing the code for this mini-project. You cannot use packages such as `pyg`

– the goal is for you to get used to implementing new models in PyTorch.

Your submissions will be checked for plagiarism. You may discuss general issues with your classmates, but you may not share your code, or look at other people’s code. Please avoid uploading your code publically to Github (and similar services).

You will submit a Jupyter notebook, with all your code. The results you report need to be reproducible – every figure/number you report should be computed by the code in the Jupyter notebook you submit.

You will be working with the `mv-latest-small.zip`

version of the MovieLens dataset. You can load the relevant dataset using

`import pandas as pd`

`ratings = pd.read_csv("http://www.cs.toronto.edu/~guerzhoy/324/movielens/ratings.csv")`

You should split the dataset using timestamps. A common thing to do is to use the older data as the training set and the newer data as the test/validation sets. That way, you get a good estimate for whether your algorithm would be accurate on new data.

You will implement and train vector embeddings for both the users and the movies. You will then use the embeddings in order to recommend movies to users, by recommeding the movies \(v\) to user \(u\) where \(z_u\cdot z_v\) is highest.

Learn \(d\)-dimensional embeddings \(z_u\) and \(z_m\) by maximizing

\[\sum_{u\in users}\left[200\times\sum_{m\in N(u)}\log\sigma(z_u\cdot z_m) - \sum_{m\not\in N(u)} \log\sigma(z_u\cdot z_m)\right]\]

(We upweight the positive examples to make the training converge faster.)

Here, \(N(u)\) is the set of movies for which the user gave 5-star ratings

Report `Recall@150`

for the training and test sets, for the task of predicting which movies a given user will give 5 stars to.

Implement a learning procedure for Node2Vec embeddings.

Report `Recall@150`

for the training and test sets, for the task of predicting which movies a given user will give 5 stars to.

Report what you had done in order to improve the performance of Node2Vec over the initial parameters that you had tried. You only need to demonstrate progress in improving Node2Vec, and do not need to beat the performance of embeddings learned directly.

`nn.Embedding`

You can define the embeddings like this:

`n = n_users + n_movies`

`d = 10`

`device = 'cuda'`

`embedding = nn.Embedding(n, d).to(device)`

You can then optimize `embedding.parameters`

When performing the learning, you can use a for-loop to loop over users, but not for looping over movies. You must use matrix multiplication, `torch.sum`

, and obtaining information from the adjacency matrix using `np.where`

/`torch.where`

in order to compute the cost function. Not following this guideline will result in a deduction of 20% from your mark.

For example, if you have an adjacency matrix `A`

(where an edge represents a 5-star rating), you can obtain the indices of movies a user rate with 5 stars using `np.where(A[u,:]==1)`

.

Submit an `ipynb`

file that contains all the code that can reproduce all the results in your report.

Marking scheme:

- Implementation of directly learning embeddings: 30%
- Implementation of learning embeddings using Node2Vec: 35%
- Tuning the parameters of Node2Vec: 15%
- Evaluation using
`Recall@150`

: 15% - Report readability: 5%