{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CSC321H5 Project 1. Music Millenium Classification\n",
"\n",
"**Deadline**: Thursday, Jan. 30, by 9pm\n",
"\n",
"**Submission**: Submit a PDF export of the completed notebook. \n",
"\n",
"**Late Submission**: Please see the syllabus for the late submission criteria.\n",
"\n",
"To celebrate the start of a new decade, we will build models to predict which\n",
"**century** a piece of music was released. We will be using the \"YearPredictionMSD Data Set\"\n",
"based on the Million Song Dataset. The data is available to download from the UCI \n",
"Machine Learning Repository. Here are some links about the data:\n",
"\n",
"- https://archive.ics.uci.edu/ml/datasets/yearpredictionmsd\n",
"- http://millionsongdataset.com/pages/tasks-demos/#yearrecognition\n",
"\n",
"## Question 1. Data\n",
"\n",
"Start by setting up a Google Colab notebook in which to do your work.\n",
"If you are working with a partner, you might find this link helpful:\n",
"\n",
"- https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb\n",
"\n",
"The recommended way to work together is pair coding, where you and your partner are sitting together and writing code together."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that your notebook is set up, we can load the data into the notebook. The code below provides\n",
"two ways of loading the data: directly from the internet, or through mounting Google Drive.\n",
"The first method is easier but slower, and the second method is a bit involved at first, but\n",
"can save you time later on. You will need to mount Google Drive for later assignments, so we recommend\n",
"figuring how to do that now.\n",
"\n",
"Here are some resources to help you get started:\n",
"\n",
"- http.://colab.research.google.com/notebooks/io.ipynb"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"load_from_drive = False\n",
"\n",
"if not load_from_drive:\n",
" csv_path = \"http://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip\"\n",
"else:\n",
" from google.colab import drive\n",
" drive.mount('/content/gdrive')\n",
" csv_path = '/content/drive/My Drive/YearPredictionMSD.txt.zip' # TODO - UPDATE ME!\n",
"\n",
"t_label = [\"year\"]\n",
"x_labels = [\"var%d\" % i for i in range(1, 91)]\n",
"df = pandas.read_csv(csv_path, names=t_label + x_labels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that the data is loaded to your Colab notebook, you should be able to display the Pandas\n",
"DataFrame `df` as a table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To set up our data for classification, we'll use the \"year\" field to represent\n",
"whether a song was released in the 20-th century. In our case `df[\"year\"]` will be 1 if\n",
"the year was released after 2000, and 0 otherwise."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[\"year\"] = df[\"year\"].map(lambda x: int(x > 2000))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a) -- 2 pts\n",
"\n",
"The data set description text asks us to respect the below train/test split to\n",
"avoid the \"producer effect\". That is, we want to make sure that no song from a single artist\n",
"ends up in both the training and test set.\n",
"\n",
"Explain why it would be problematic to have\n",
"some songs from an artist in the training set, and other songs from the same artist in the\n",
"test set. (Hint: Remember that we want our test accuracy to predict how well the model\n",
"will perform in practice on a song it hasn't learned about.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_train = df[:463715]\n",
"df_test = df[463715:]\n",
"\n",
"# conver to numpy\n",
"train_xs = df_train[x_labels].to_numpy()\n",
"train_ts = df_train[t_label].to_numpy()\n",
"test_xs = df_test[x_labels].to_numpy()\n",
"test_ts = df_test[t_label].to_numpy()\n",
"\n",
"# Write your explanation here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b) -- 1 pts\n",
"\n",
"It can be beneficial to **normalize** the columns, so that each column (feature)\n",
"has the *same* mean and standard deviation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"feature_means = df_train.mean()[1:].to_numpy() # the [1:] removes the mean of the \"year\" field\n",
"feature_stds = df_train.std()[1:].to_numpy()\n",
"\n",
"train_norm_xs = (train_xs - feature_means) / feature_stds\n",
"test_norm_xs = (test_xs - feature_means) / feature_stds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how in our code, we normalized the test set using the *training data means and standard deviations*.\n",
"This is *not* a bug.\n",
"\n",
"Explain why it would be improper to compute and use test set means\n",
"and standard deviations. (Hint: Remember what we want to use the test accuracy to measure.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your explanation here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c) -- 1 pts\n",
"\n",
"Finally, we'll move some of the data in our training set into a validation set.\n",
"\n",
"Explain why we should limit how many times we use the test set, and that we should use the validation\n",
"set during the model building process."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# shuffle the training set\n",
"reindex = np.random.permutation(len(train_xs))\n",
"train_xs = train_xs[reindex]\n",
"train_norm_xs = train_norm_xs[reindex]\n",
"train_ts = train_ts[reindex]\n",
"\n",
"# use the first 50000 elements of `train_xs` as the validation set\n",
"train_xs, val_xs = train_xs[50000:], train_xs[:50000]\n",
"train_norm_xs, val_norm_xs = train_norm_xs[50000:], train_norm_xs[:50000]\n",
"train_ts, val_ts = train_ts[50000:], train_ts[:50000]\n",
"\n",
"# Write your explanation here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2. Classification\n",
"\n",
"We will first build a *classification* model to perform decade classification.\n",
"These helper functions are written for you. All other code that you write in this\n",
"section should be vectorized whenever possible, and you will be penalized for \n",
"not vectorizing your code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def sigmoid(z):\n",
" return 1 / (1 + np.exp(-z))\n",
" \n",
"def cross_entropy(t, y):\n",
" return -t * np.log(y) - (1 - t) * np.log(1 - y)\n",
"\n",
"def cost(y, t):\n",
" return np.mean(cross_entropy(t, y))\n",
"\n",
"def get_accuracy(y, t):\n",
" acc = 0\n",
" N = 0\n",
" for i in range(len(y)):\n",
" N += 1\n",
" if (y[i] >= 0.5 and t[i] == 1) or (y[i] < 0.5 and t[i] == 0):\n",
" acc += 1\n",
" return acc / N"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a) -- 2 pts\n",
"\n",
"Write a function `pred` that computes the prediction `y` based on weights `w` and bias `b`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def pred(w, b, X):\n",
" \"\"\"\n",
" Returns the prediction `y` of the target based on the weights `w` and scalar bias `b`.\n",
"\n",
" Preconditions: np.shape(w) == (90,)\n",
" type(b) == float\n",
" np.shape(X) = (N, 90) for some N\n",
"\n",
" >>> pred(np.zeros(90), 1, np.ones([2, 90]))\n",
" array([0.73105858, 0.73105858]) # It's okay if your output differs in the last decimals\n",
" \"\"\"\n",
" # Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b) -- 3 pts\n",
"\n",
"Write a function `derivative_cost` that computes and returns the gradients \n",
"$\\frac{\\partial\\mathcal{E}}{\\partial {\\bf w}}$ and\n",
"$\\frac{\\partial\\mathcal{E}}{\\partial b}$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def derivative_cost(X, y, t):\n",
" \"\"\"\n",
" Returns a tuple containing the gradients dEdw and dEdb.\n",
"\n",
" Precondition: np.shape(X) == (N, 90) for some N\n",
" np.shape(y) == (N,)\n",
" np.shape(t) == (N,)\n",
"\n",
" Postcondition: np.shape(dEdw) = (90,)\n",
" type(dEdb) = float\n",
" \"\"\"\n",
" # Your code goes here\n",
" # return np.zeros(90), 0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c) -- 2 pts\n",
"\n",
"We can check that our derivative is implemented correctly using the finite difference rule. In 1D, the\n",
"finite difference rule tells us that for small $h$, we should have\n",
"\n",
"$$\\frac{f(x+h) - f(x)}{h} \\approx f'(x)$$\n",
"\n",
"Prove to yourself (and your TA) that $\\frac{\\partial\\mathcal{E}}{\\partial b}$ is implement correctly\n",
"by comparing the result from `derivative_cost` with the value of `(pred(w, b + h, X) - pred(w, b, X)) / h`.\n",
"Justify your choice of `w`, `b`, and `X`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (d) -- 2 pts\n",
"\n",
"Prove to yourself (and your TA) that $\\frac{\\partial\\mathcal{E}}{\\partial {\\bf w}}$ is implement correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code goes here. You might find this below code helpful: but it's\n",
"# up to you to figure out how/why, and how to modify the code\n",
"h = 0.000001\n",
"H = np.zeros(90)\n",
"H[0] = h"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (e) -- 4 pts\n",
"\n",
"Now that you have a gradient function that works, we can actually run gradient descent! Complete\n",
"the following code that will run stochastic: gradient descent training:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def run_gradient_descent(w0, b0, alpha=0.1, batch_size=100, max_iters=100):\n",
" \"\"\"Return the values of (w, b) after running gradient descent for max_iters.\n",
" We use:\n",
" - train_norm_xs and train_ts as the training set\n",
" - val_norm_xs and val_ts as the test set\n",
" - alpha as the learning rate\n",
" - (w0, b0) as the initial values of (w, b)\n",
"\n",
" Precondition: np.shape(w0) == (90,)\n",
" type(b0) == float\n",
" \n",
" Postcondition: np.shape(w) == (90,)\n",
" type(b) == float\n",
" \"\"\"\n",
" w = w0\n",
" b = b0\n",
" iter = 0\n",
"\n",
" while iter < max_iters:\n",
" # shuffle the training set (there is code above for how to do this)\n",
" \n",
" for i in range(0, len(train_norm_xs), batch_size): # iterate over each minibatch\n",
" # minibatch that we are working with:\n",
" X = train_norm_xs[i:(i + batch_size)]\n",
" t = train_ts[i:(i + batch_size), 0]\n",
"\n",
" # since len(train_norm_xs) does not divide batch_size evenly, we will skip over\n",
" # the \"last\" minibatch\n",
" if np.shape(X)[0] != batch_size:\n",
" continue\n",
"\n",
" # compute the prediction\n",
" # y = ...\n",
"\n",
" # update w and b\n",
" # dw, db = ...\n",
" # w = ...\n",
" # b = ...\n",
"\n",
" # increment the iteration count\n",
" iter += 1\n",
"\n",
" # compute and plot the *validation* loss and accuracy\n",
" #if (iter % 10 == 0):\n",
" #train_cost = ...\n",
" #val_y = ...\n",
" #val_cost = ...\n",
" #val_acc = ...\n",
" #print(\"Iter %d. [Val Acc %.0f%%, Loss %f] [Train Loss %f]\" % (\n",
" # iter, val_acc * 100, val_cost, train_cost))\n",
"\n",
" if iter >= max_iters:\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (f) -- 2 pts\n",
"\n",
"Call `run_gradient_descent` with the weights and biases all initialized to zero.\n",
"Show that if `alpha` is too small, then convergence is slow.\n",
"Also, show that if `alpha` is too large, then we do not converge at all!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"w0 = np.zeros(90)\n",
"b0 = 0.\n",
"# Write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (g) -- 2 pts\n",
"\n",
"Find the optimial value of ${\\bf w}$ and $b$ using your code. Explain how you chose\n",
"the learning rate $\\alpha$ and the batch size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"w0 = np.zeros(90)\n",
"b0 = 0.\n",
"# Write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (h) -- 4 pts\n",
"\n",
"Using the values of `w` and `b` from part (g), compute your training accuracy, validation accuracy,\n",
"and test accuracy. Are there any differences between those three values? If so, why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (i) -- 4 pts\n",
"\n",
"Writing a classifier like this is instructive, and helps you understand what happens when\n",
"we train a model. However, in practice, we rarely write model building and training code\n",
"from scratch. Instead, we typically use one of the well-tested libraries available in a package.\n",
"\n",
"Use `sklearn.linear_model.LogisticRegression` to build a linear classifier, and make predictions about the test set. Start by reading the\n",
"[API documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
"\n",
"Compute the training, validation and test accuracy of this model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#import sklearn.linear_model\n",
"#model = sklearn.linear_model.LogisticRegression()\n",
"#model.fit ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3. Nearest Neighbour\n",
"\n",
"We will compare the nearest neighbour model with the model we built in the earlier parts.\n",
"\n",
"To make predictions for a new data point using k-nearest neighbour, we will need to:\n",
"\n",
"1. Compute the distance from this new data point to every element in the training set\n",
"2. Select the top $k$ closest neighbour in the training set\n",
"3. Find the most common label among those neighbours\n",
"\n",
"We'll use the validation test to select $k$. That is, we'll select the $k$ that gives the highest\n",
"validation accuracy.\n",
"\n",
"Since we have a fairly large data set, computing the distance between a point in the validation\n",
"set and all points in the training set will require more RAM than Google Colab provides.\n",
"To make the comptuations tractable, we will:\n",
"\n",
"1. Use only a subset of the training set (only the first 100,000 elements)\n",
"2. Use only a subset of the validation set (only the first 1000 elements)\n",
"3. We will use the **cosine similarity** rather than Euclidean distance. We will also pre-scale\n",
" each element in training set and the validation set to be a unit vector, so that computing\n",
" the cosine similarity is equivalent to computing the dot product. To see this, recall that \n",
" $$cos(\\theta) = \\frac{v \\cdot w}{||v|| ||w||}$$. But if both ||v|| and ||w|| are zero, then\n",
" only the dot product remains."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we'll need to take the first 100000 element of `train_norm_xs`\n",
"# and scale each of its rows to be unit length\n",
"xs = train_norm_xs[:100000]\n",
"# compute the norms:\n",
"norms = np.linalg.norm(xs, axis=1)\n",
"# divide the xs by the norms. Because of numpy's broadcasting rules, we need to\n",
"# transpose the matrices a couple of times:\n",
"# https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html\n",
"xs = (xs.T / norms).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a) -- 1 pt\n",
"\n",
"Create a numpy matrix `val_xs` that contains the first 1000 elements of `val_norm_xs`, scaled\n",
"so that each of its rows is unit length. Follow the code above."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# val_xs = ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b) -- 1 pt\n",
"\n",
"Our goal now is to compute the validation accuracy for a choice of $k$. This will\n",
"require computing the distance between each song in the training set and each\n",
"song in the validation set.\n",
"\n",
"This is actually quite straightforward, and can be done using one matrix\n",
"computation operation!\n",
"\n",
"Compute all the distances between elements of `xs` and those of `val_xs`\n",
"using a single call to `np.dot`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#val_distances = np.dot ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c) -- 3 pt\n",
"\n",
"Now that we have the distance pairs, we can use the matrix `val_distances`\n",
"to find the set of neighbours for each point in our validation set and \n",
"\n",
"Find the validation accuracy assuming that we use $k = 10$. You may\n",
"use the below helper function if you want, and the `get_accuracy` helper\n",
"from the last section.\n",
"\n",
"You might also find it helpful to do parts (c) and (d) together."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_nearest_neighbours(i, k=10):\n",
" \"\"Return the indices of the top k-element of `xs` that are closests to\n",
" element `i` of the validation set `val_xs`.\n",
" \"\"\n",
" # sort the element of the training set by distance to the i-th\n",
" # element of val_xs\n",
" neighbours = sorted(enumerate(val_distances[:, i]),\n",
" key=lambda r: r[1],\n",
" reverse=True)\n",
" # obtain the top k closest index and return it\n",
" neighbour_indices = [index for (index, dist) in neighbours[:k]]\n",
" return neighbour_indices \n",
"\n",
"def get_train_ts(indices):\n",
" \"\"\"Return the labels of the corresponding elements in the training set `xs`.\n",
" Note that `xs` is the first 100,000 elements of `train_xs`, so we can\n",
" simply index `train_ts`.\n",
" \"\"\"\n",
" return train_ts[indices]\n",
"\n",
"# Write your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (d) -- 2 pts\n",
"\n",
"Compute the validation accuracy for $k = 50, 100, and 1000$.\n",
"Which $k$ provides the best results? In other words, which kNN model would you deploy?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code and solution here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (e) -- 4 pt\n",
"\n",
"Compute the test accuracy for the $k$ that you chose in the previous part.\n",
"Use only a sample of 1000 elements from the test set to keep the problem tractable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code and solution here"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}