{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CSC321 Tutorial 6: Optimization and Convolutional Neural Networks\n",
    "\n",
    "In lecture 5, we talk about different issues that may arise when \n",
    "training an artificial neural network. Today, we'll explore some of these\n",
    "issues, and explore different ways that we can optimize a neural network's\n",
    "cost function.\n",
    "\n",
    "In lecture 6, we will cover convolutional neural networks. Since this is\n",
    "the last tutorial before reading week, we will also train some CNN's today.\n",
    "If you are in the Tuesday lecture section, don't worry! Think of CNN's as a\n",
    "neural network with a slightly different architecture, or that the weights\n",
    "are \"wired\" differently. These weights (parameters) still can be optimized\n",
    "via gradient descent, and we will still use the back-propagation algorithm.\n",
    "\n",
    "Please note that because there is stochasticity in the way we initalize the\n",
    "neural network weights, so we will get different results (final training/validation\n",
    "accuracies) if we run the initialization + training multiple times.\n",
    "You will need to run some of the provided code multiple times to make\n",
    "a conclusion about what optimization methods work well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "We'll use the MNIST data set, the same data set that we introduced in\n",
    "Tutorial 4.\n",
    "The MNIST dataset contains black and white, hand-written (numerical) digits\n",
    "that are 28x28 pixels large.\n",
    "As in tutorial 4, we'll only use the first 2500 images in the MNIST dataset.\n",
    "The first time you run this code, we will download the MNIST dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchvision import datasets, transforms\n",
    "\n",
    "mnist_train = datasets.MNIST('data',\n",
    "                             train=True,\n",
    "                             download=True,\n",
    "                             transform=transforms.ToTensor())\n",
    "mnist_train = list(mnist_train)[:2500]\n",
    "\n",
    "mnist_train, mnist_val = mnist_train[:2000], mnist_train[2000:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Models in PyTorch\n",
    "\n",
    "We'll work with two models: a MLP and a convolutional neural network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Multi-layer perceptron\n",
    "class MLP(nn.Module):\n",
    "    def __init__(self, num_hidden):\n",
    "        super(MLP, self).__init__()\n",
    "        self.layer1 = nn.Linear(28 * 28, num_hidden)\n",
    "        self.layer2 = nn.Linear(num_hidden, 10)\n",
    "        self.num_hidden = num_hidden\n",
    "    def forward(self, img):\n",
    "        flattened = img.view(-1, 28 * 28) # flatten the image\n",
    "        activation1 = self.layer1(flattened)\n",
    "        activation1 = torch.relu(activation1)\n",
    "        activation2 = self.layer2(activation1)\n",
    "        return activation2\n",
    "\n",
    "# You should understand this after Lecture 6\n",
    "class CNN(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(CNN, self).__init__()\n",
    "        self.conv1 = nn.Conv2d(in_channels=1,\n",
    "                               out_channels=4,\n",
    "                               kernel_size=3,\n",
    "                               padding=1)\n",
    "        self.pool = nn.MaxPool2d(2, 2)\n",
    "        self.conv2 = nn.Conv2d(in_channels=4,\n",
    "                               out_channels=8,\n",
    "                               kernel_size=3,\n",
    "                               padding=1)\n",
    "        self.fc = nn.Linear(8 * 7 * 7, 10)\n",
    "    def forward(self, x):\n",
    "        x = self.pool(torch.relu(self.conv1(x)))\n",
    "        x = self.pool(torch.relu(self.conv2(x)))\n",
    "        x = x.view(-1, 8 * 7 * 7)\n",
    "        return self.fc(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One way to gauge the \"complexity\" or the \"capacity\" of the\n",
    "neural network is by looking at the number of parameters that it\n",
    "has."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_num_parameters(model, name=\"model\"):\n",
    "    print(\"Number of parameters in %s\" % name,\n",
    "          sum(p.numel() for p in model.parameters()))\n",
    "\n",
    "print_num_parameters(MLP(200), \"MLP(200)\")\n",
    "print_num_parameters(CNN(), \"the CNN\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the neural network\n",
    "\n",
    "We'll use a fairly configurable training training function that computes\n",
    "both training and validation accuracy in each iteration. This is more"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train(model, data, batch_size=64, weight_decay=0.0,\n",
    "          optimizer=\"sgd\", learning_rate=0.1, momentum=0.9,\n",
    "          data_shuffle=True, num_epochs=10):\n",
    "    # training data\n",
    "    train_loader = torch.utils.data.DataLoader(data,\n",
    "                                               batch_size=batch_size,\n",
    "                                               shuffle=data_shuffle)\n",
    "    # loss function\n",
    "    criterion = nn.CrossEntropyLoss()\n",
    "    # optimizer\n",
    "    assert optimizer in (\"sgd\", \"adam\")\n",
    "    if optimizer == \"sgd\":\n",
    "        optimizer = optim.SGD(model.parameters(),\n",
    "                              lr=learning_rate,\n",
    "                              momentum=momentum,\n",
    "                              weight_decay=weight_decay)\n",
    "    else:\n",
    "        optimizer = optim.Adam(model.parameters(),\n",
    "                               lr=learning_rate,\n",
    "                               weight_decay=weight_decay)\n",
    "    # track learning curve\n",
    "    iters, losses, train_acc, val_acc = [], [], [], []\n",
    "    # training\n",
    "    n = 0 # the number of iterations (for plotting)\n",
    "    for epoch in range(num_epochs):\n",
    "        for imgs, labels in iter(train_loader):\n",
    "            if imgs.size()[0] < batch_size:\n",
    "                continue\n",
    "\n",
    "            model.train() # annotate model for training\n",
    "            out = model(imgs)\n",
    "            loss = criterion(out, labels)\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "            optimizer.zero_grad()\n",
    "\n",
    "            # save the current training information\n",
    "            iters.append(n)\n",
    "            losses.append(float(loss)/batch_size)             # compute *average* loss\n",
    "            train_acc.append(get_accuracy(model, train=True)) # compute training accuracy \n",
    "            val_acc.append(get_accuracy(model, train=False))  # compute validation accuracy\n",
    "            n += 1\n",
    "\n",
    "    # plotting\n",
    "    plt.title(\"Learning Curve\")\n",
    "    plt.plot(iters, losses, label=\"Train\")\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Loss\")\n",
    "    plt.show()\n",
    "\n",
    "    plt.title(\"Learning Curve\")\n",
    "    plt.plot(iters, train_acc, label=\"Train\")\n",
    "    plt.plot(iters, val_acc, label=\"Validation\")\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Training Accuracy\")\n",
    "    plt.legend(loc='best')\n",
    "    plt.show()\n",
    "\n",
    "    print(\"Final Training Accuracy: {}\".format(train_acc[-1]))\n",
    "    print(\"Final Validation Accuracy: {}\".format(val_acc[-1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And of course, we need the `get_accuracy` helper function. To turn the probabilities\n",
    "into a discrete prediction, we will take the digit with the highest probability.\n",
    "Because of the way softmax is computed, the digit with the highest probability is\n",
    "the same as the digit with the (pre-activation) output value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_accuracy(model, train=False):\n",
    "    if train:\n",
    "        data = torch.utils.data.DataLoader(mnist_train, batch_size=4096)\n",
    "    else:\n",
    "        data = torch.utils.data.DataLoader(mnist_val, batch_size=1024)\n",
    "\n",
    "    model.eval() # annotate model for evaluation\n",
    "    correct = 0\n",
    "    total = 0\n",
    "    for imgs, labels in data:\n",
    "        output = model(imgs) # We don't need to run torch.softmax\n",
    "        pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability\n",
    "        correct += pred.eq(labels.view_as(pred)).sum().item()\n",
    "        total += imgs.shape[0]\n",
    "    return correct / total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what the training curve of a multi-layer perceptron looks like.\n",
    "This code will take a couple minutes to run..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(50)\n",
    "train(model, mnist_train, learning_rate=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MLP Hidden Unit Size\n",
    "\n",
    "The first thing we'll explore is the hidden unit size. If we increase the number\n",
    "of hidden units in a MLP, we'll increase its parameters counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print_num_parameters(MLP(50), \"MLP with 50 hidden units\")\n",
    "print_num_parameters(MLP(100), \"MLP with 100 hidden units\")\n",
    "print_num_parameters(MLP(200), \"MLP with 200 hidden units\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With more hidden units, our model has more \"capacity\", and can learn\n",
    "more intricate patterns in the training data. Our training accuracy will\n",
    "therefore be higher. However, the computation time for training and\n",
    "using these networks will also increase.\n",
    "\n",
    "Adding more parameters tend to widen the gap between training and validation\n",
    "accuracy. As we add too many parameters, we could overfit. However, we won't\n",
    "show that here since the computations will take a long time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(100)\n",
    "train(model, mnist_train, learning_rate=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A smaller network will train faster, but may have worse training accuracy.\n",
    "Bear in mind that since the neural networks initialization is random, k/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(30)\n",
    "train(model, mnist_train, learning_rate=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interlude: shuffling the dataset\n",
    "\n",
    "What if don't off `data_shuffle`? That is, what if we use the **same mini-batches**\n",
    "across all of our epochs? Can you explain what's going on in this learning\n",
    "curve?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(30)\n",
    "train(model, mnist_train, learning_rate=0.1, data_shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Conv Net\n",
    "\n",
    "The learning curve for the convolutional network looks similar. This network\n",
    "is a lot more compact with much fewer parameters. The computation time is\n",
    "a bit longer than training MLPs, but we get fairly good results.\n",
    "(The learning rate of 0.1 looks a little high for this CNN, based on the noisiness\n",
    "of the learning curves.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CNN()\n",
    "print_num_parameters(model, \"the CNN\")\n",
    "train(model, mnist_train, batch_size=64, optimizer=\"sgd\", learning_rate=0.1,\n",
    "      momentum=0., num_epochs=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Momentum\n",
    "\n",
    "We'll mainly experiment with the `MLP(30)` model, since it trains the fastest.\n",
    "We'll measure how quickly the model trains by looking at how far we get in \n",
    "first 3 epochs of training. Here's how far our model gets without using momenutm,\n",
    "with a learning rate of 0.1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(30)\n",
    "train(model, mnist_train, learning_rate=0.1, momentum=0., num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With a well-tuned learning-rate and momentum parameter, our training can go faster.\n",
    "(Note: We had to try a few settings before finding one that worked well, and encourage\n",
    "you to try different combinations of the learning rate and momentum. For example,\n",
    "learning rate of 0.1 and momentum of"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(30)\n",
    "train(model, mnist_train, learning_rate=0.05, momentum=0.9, num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The optimizer Adam works well and is the most popular optimizer nowadays.\n",
    "Adam typically requires a smaller learning rate: start at 0.001, then increase/decrease\n",
    "as you see fit. For this example, 0.005 works well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(30)\n",
    "train(model, mnist_train, optimizer=\"adam\", learning_rate=0.005, num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convnets can also be trained using SGD with momentum or with Adam. In particular, our\n",
    "CNN generalizes very well. (Since our validation accuracy is about equal to our\n",
    "training accuracy, we can afford to increase the model capacity if we want to.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = CNN()\n",
    "train(model, mnist_train, optimizer=\"adam\", learning_rate=0.005, num_epochs=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment to run\n",
    "# train(CNN(), mnist_train, learning_rate=0.1, momentum=0.9, num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch Normalization\n",
    "\n",
    "Batch normalization speeds up training significantly!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MLPBN(nn.Module):\n",
    "    def __init__(self, num_hidden):\n",
    "        super(MLPBN, self).__init__()\n",
    "        self.layer1 = nn.Linear(28 * 28, num_hidden)\n",
    "        self.bn = nn.BatchNorm1d(num_hidden)\n",
    "        self.layer2 = nn.Linear(num_hidden, 10)\n",
    "        self.num_hidden = num_hidden\n",
    "    def forward(self, img):\n",
    "        flattened = img.view(-1, 28 * 28) # flatten the image\n",
    "        activation1 = self.layer1(flattened)\n",
    "        activation1 = torch.relu(activation1)\n",
    "        activation1 = self.bn(activation1)\n",
    "        activation2 = self.layer2(activation1)\n",
    "        return activation2\n",
    "\n",
    "mlp_bn = MLPBN(30)\n",
    "train(mlp_bn, mnist_train, optimizer=\"adam\", learning_rate=0.005, num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a debate as to whether the batch-normalization should be applied\n",
    "*before* or *after* the activation. The original batch normalization paper\n",
    "applied the normalization before the ReLU activation, but applying normalization\n",
    "*after* the ReLU performs better in practice. \n",
    "\n",
    "I (Lisa) believe the reason to be as follows:\n",
    "\n",
    "1. If we apply normalization before ReLU, then we are effectively ignoring the\n",
    "   bias parameter of those units, since those unit's activations gets centered\n",
    "   anyways.\n",
    "2. If we apply normalization after ReLU, we will have both positive and negative\n",
    "   information being passed to the next layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MLPBNBeforeReLu(nn.Module):\n",
    "    def __init__(self, num_hidden):\n",
    "        super(MLPBNBeforeReLu, self).__init__()\n",
    "        self.layer1 = nn.Linear(28 * 28, num_hidden)\n",
    "        self.bn = nn.BatchNorm1d(num_hidden)\n",
    "        self.layer2 = nn.Linear(num_hidden, 10)\n",
    "        self.num_hidden = num_hidden\n",
    "    def forward(self, img):\n",
    "        flattened = img.view(-1, 28 * 28) # flatten the image\n",
    "        activation1 = self.layer1(flattened)\n",
    "        activation1 = self.bn(activation1)\n",
    "        activation1 = torch.relu(activation1)\n",
    "        activation2 = self.layer2(activation1)\n",
    "        return activation2\n",
    "    \n",
    "train(MLPBNBeforeReLu(30), mnist_train, optimizer=\"adam\", learning_rate=0.005, num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Batch normalization can be used in CNNs too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CNNBN(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(CNNBN, self).__init__()\n",
    "        self.conv1 = nn.Conv2d(in_channels=1,\n",
    "                               out_channels=4,\n",
    "                               kernel_size=3,\n",
    "                               padding=1)\n",
    "        self.bn1 = nn.BatchNorm2d(4) # num out channels\n",
    "        self.pool = nn.MaxPool2d(2, 2)\n",
    "        self.conv2 = nn.Conv2d(in_channels=4,\n",
    "                               out_channels=8,\n",
    "                               kernel_size=3,\n",
    "                               padding=1)\n",
    "        self.bn2 = nn.BatchNorm2d(8) # num out channels\n",
    "        self.fc = nn.Linear(8 * 7 * 7, 10)\n",
    "    def forward(self, x):\n",
    "        x = self.bn1(torch.relu(self.conv1(x)))\n",
    "        x = self.pool(x)\n",
    "        x = self.bn2(torch.relu(self.conv2(x)))\n",
    "        x = self.pool(x)\n",
    "        x = x.view(-1, 8 * 7 * 7)\n",
    "        return self.fc(x)\n",
    "\n",
    "train(CNNBN(), mnist_train, optimizer=\"adam\", learning_rate=0.005, num_epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Weight Initialization\n",
    "\n",
    "If we initialize weights to zeros, our neural network will be stuck in a\n",
    "saddle point. Since we are using stochastic gradient descent, we will see\n",
    "only noise in the training curve and no progress."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP(30)\n",
    "for p in model.parameters():\n",
    "    nn.init.zeros_(p)\n",
    "train(model, mnist_train, optimizer=\"adam\", learning_rate=0.005, num_epochs=3)"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}