{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Network Training\n",
    "\n",
    "In this chapter, we'll talk about neural network training, plotting the \n",
    "training curve, and tuning hyper-parameters.\n",
    "Let's start with the neural network training code that we wrote\n",
    "previously:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "from torchvision import datasets, transforms\n",
    "import matplotlib.pyplot as plt # for plotting\n",
    "\n",
    "torch.manual_seed(1) # set the random seed\n",
    "\n",
    "class Pigeon(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(Pigeon, self).__init__()\n",
    "        self.layer1 = nn.Linear(28 * 28, 30)\n",
    "        self.layer2 = nn.Linear(30, 1)\n",
    "    def forward(self, img):\n",
    "        flattened = img.view(-1, 28 * 28)\n",
    "        activation1 = self.layer1(flattened)\n",
    "        activation1 = F.relu(activation1)\n",
    "        activation2 = self.layer2(activation1)\n",
    "        return activation2\n",
    "\n",
    "# load the data\n",
    "mnist_train = datasets.MNIST('data', train=True, download=True)\n",
    "mnist_train = list(mnist_train)[:2000]\n",
    "img_to_tensor = transforms.ToTensor()\n",
    "\n",
    "# create a new model, initialize random parameters\n",
    "pigeon = Pigeon()\n",
    "\n",
    "# loss and optimizer\n",
    "criterion = nn.BCEWithLogitsLoss()\n",
    "optimizer = optim.SGD(pigeon.parameters(), lr=0.005, momentum=0.9)\n",
    "\n",
    "# training\n",
    "for (image, label) in mnist_train[:1000]:\n",
    "    # actual ground truth: is the digit less than 3?\n",
    "    actual = (label < 3).reshape([1,1]).type(torch.FloatTensor)\n",
    "    # prediction\n",
    "    out = pigeon(img_to_tensor(image))\n",
    "    # update the parameters based on the loss\n",
    "    loss = criterion(out, actual) # compute the loss\n",
    "    loss.backward()               # compute updates for each parameter\n",
    "    optimizer.step()              # make the updates for each parameter\n",
    "    optimizer.zero_grad()         # a clean up step for PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batching\n",
    "\n",
    "Let's break down the training code again a little more. At each iteration \n",
    "of the main training loop, we:\n",
    "\n",
    "- we use our network to make the predictions for **one image**\n",
    "- we compute the loss for that **one image**\n",
    "- we take a \"step\" to optimize the loss of that **one image**\n",
    "\n",
    "In a biological setting, it makes sense to ask real pigeons to reason\n",
    "about one image at a time. However, with an artificial neural network,\n",
    "we may want to use more than one image at one time. That way, we can\n",
    "compute the *average* loss across a **mini-batch** of *multiple* images, and take a step\n",
    "to optimize the *average* loss. The average loss across multiple training\n",
    "inputs is going to be less \"noisy\" than the loss for a single input, and is\n",
    "less likely to provide \"bad information\" because of a \"bad\" input.\n",
    "\n",
    "So, we are going to do the following at each iteration:\n",
    "\n",
    "- we use our network to make the predictions for **$n$ images**\n",
    "- we compute the *average* loss for those **$n$ images**\n",
    "- we take a \"step\" to optimize the *average* loss of those **$n$ images**\n",
    "\n",
    "The number $n$ is called the **batch size**.\n",
    "\n",
    "In one extreme, we can set $n = 1$, as we have done above. However, having\n",
    "such a small batch size means that we might be optimizing a very different\n",
    "loss function in each iteration. The \"steps\" that we make might cause our\n",
    "parameters to change in different directions. Training might therefore take\n",
    "longer because of the noisiness.\n",
    "\n",
    "In the other extreme, we can set $n$ to be the size of our training set.\n",
    "We would be computing the *average* loss for our *entire* training set at every\n",
    "iteration. When we have a small training set, this strategy might be feasible.\n",
    "When we have a large training set, computing the predictions and loss for every\n",
    "training input becomes expensive. Besides, the average loss might not change very\n",
    "much as we keep increasing our batch size.\n",
    "\n",
    "The actual batch size that we choose depends on many things. We want our batch\n",
    "size to be large enough to not be too \"noisy\", but not so large as to make each\n",
    "iteration too expensive to run.\n",
    "\n",
    "People often choose batch sizes of the form $n=2^k$ so that it is easy to half\n",
    "or double the batch size. We'll choose a batch size of 32 and train the network\n",
    "again.\n",
    "\n",
    "First, we'll use some `PyTorch` helpers to make it easy to sample 32 images at once:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_train = datasets.MNIST('data', train=True,\n",
    "                             transform=img_to_tensor) # apply img_to_tensor to every image\n",
    "train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that items in `mnist_train` looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "img, label = list(mnist_train)[0]\n",
    "print(img.shape)\n",
    "print(label)\n",
    "print(label.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each image is a **tensor** of dimension `1 x 28 x 28`. **Tensors** are n-dimensional\n",
    "analogs of matrices. You can think of these as n-dimensional arrays.\n",
    "\n",
    "Let's take a look at what `train_loader` does for us:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_loader_iter = iter(train_loader)\n",
    "imgs, labels = next(train_loader_iter)\n",
    "print(imgs.shape)\n",
    "print(labels)\n",
    "print(labels.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variable `imgs` now has 32 images instead of just one.\n",
    "\n",
    "By design, neural networks are highly parallel, and heavily use matrix operations.\n",
    "(Remember how to compute activations of a neuron based on the previous layer's neuron?\n",
    "The math looked like this: $f(\\sum_i w_i x_i)$. The input to the non-linearity $f$\n",
    "is really just the dot product of two vectors!)\n",
    "\n",
    "Our new training code actually looks almost identical to our old training code,\n",
    "again because PyTorch does a very good job parallelizing operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for n, (imgs, labels) in enumerate(train_loader):\n",
    "    # stop after 10 iterations\n",
    "    if n >= 10:\n",
    "        break\n",
    "    # actual ground truth: is the digit less than 3?\n",
    "    actual = (labels < 3).reshape([32, 1]).type(torch.FloatTensor)\n",
    "    # prediction\n",
    "    out = pigeon(imgs)\n",
    "    # update the parameters based on the loss\n",
    "    loss = criterion(out, actual) # compute the total loss\n",
    "    loss.backward()               # compute updates for each parameter\n",
    "    optimizer.step()              # make the updates for each parameter\n",
    "    optimizer.zero_grad()         # a clean up step for PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the code above, we are optimizing the total loss across all examples,\n",
    "as opposed to the average loss. They are the same thing up to multiplication\n",
    "by a constant (the batch size).\n",
    "\n",
    "## Iterations and Epochs\n",
    "\n",
    "We've been using term *iteration* as in an *iteration* of the for loop. However,\n",
    "this term has meaning in neural network training aside from the structure of the code.\n",
    "An **iteration** in neural network training is **one parameter update step**. That is,\n",
    "in each *iteration*, each parameter is updated once.\n",
    "\n",
    "In our earlier training code at the top of this section, we trained our neural network\n",
    "for 1000 iterations, and a batch size of 1.\n",
    "In our more recent training code, we trained for 10 iterations. We used a batch size of 32,\n",
    "so the actual number of training images we used is 320.\n",
    "\n",
    "The way `train_loader` works is that it randomly groups the training data into **mini-batches**\n",
    "with the appropriate batch size. Each data point belongs to only one mini-batch. When there\n",
    "are no more mini-batches left, the loop terminates.\n",
    "\n",
    "In general, we may wish to train the network for longer. We may wish to use each training data\n",
    "point more than once. In other words, we may wish to train a neural network for more than\n",
    "**one epoch**. An **epoch** is a measure of the number of times all training data is used\n",
    "once to update the parameters. So far, we haven't even trained on a full epoch of the MNIST\n",
    "data.\n",
    "\n",
    "Both **epochs** and **iterations** are units of measurement for the amount of neural\n",
    "network training. If you know the size of your *training set* and the *batch size*, you\n",
    "can easily convert between the two.\n",
    "\n",
    "We want our neural networks to train quickly. We want the highest accuracy or lowest error\n",
    "with the fewest number of *iterations* or *epochs*, so that we can save time and electricity.\n",
    "We saw that our choice of *batch size* can affect how quickly our neural network\n",
    "can achieve a certain accuracy.\n",
    "\n",
    "## Learning Rate\n",
    "\n",
    "The optimizer settings can also affect the speed of neural network training.\n",
    "In particular, all optimizers have a setting called the **learning rate**, which\n",
    "controls the *size* of each parameter update step we take."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = optim.SGD(pigeon.parameters(), # the parameters that this optimizer optimizes\n",
    "                      lr=0.005,            # the learning rate\n",
    "                      momentum=0.9)        # other optimizer settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we double the learning rate (to `lr=0.01`), then we will take a step twice the size\n",
    "in each iteration -- or each time we call `optimizer.step()`.\n",
    "\n",
    "A learning rate too small would mean that the network parameters don't change very much in\n",
    "each iteration. As a result, the network could take a long time to train.\n",
    "\n",
    "A learning rate too large can mean that we change our parameters a lot, and likely possibly in\n",
    "different directions in different iterations. We might even *overshoot* when moving the\n",
    "parameters in the appropriate direction, going past an optimal point.\n",
    "\n",
    "An appropriate learning rate depends on the batch size, the problem, the particular optimizer\n",
    "used (`optim.SGD` vs a different optimizer), and the stage of training. A large batch size\n",
    "will afford us a larger learning rate, and a smaller batch size requires a smaller learning\n",
    "rate. Practitioners also often *reduce* the learning rate as training progresses, and as we\n",
    "approach a good set of parameter values.\n",
    "\n",
    "## Tracking Training using a Training Curve \n",
    "\n",
    "How do we know when to stop training? How do we know what learning rate and what batch sizes\n",
    "are appropriate? Those are very important and practical questions to answer when training a \n",
    "neural network. We answer those questions by plotting a **training curve**.\n",
    "\n",
    "A **training curve** is a chart that shows:\n",
    "\n",
    "- The **iterations** or **epochs** on the x-axis\n",
    "- The **loss** or **accuracy** on the y-axis.\n",
    "\n",
    "The idea is to track how the loss or accuracy changes as training progresses.\n",
    "\n",
    "Let's plot a training curve for training a new `Pigeon` network on the\n",
    "first 1024 training images. We'll use a batch size of 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# only use the first 1024 images for training\n",
    "mnist_data = datasets.MNIST('data', train=True, transform=img_to_tensor)\n",
    "mnist_data = list(mnist_data)\n",
    "mnist_train = mnist_data[:1024]\n",
    "\n",
    "# create a new network with random weights, and optimizer for the network\n",
    "pigeon = Pigeon()\n",
    "train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=1)\n",
    "optimizer = optim.SGD(pigeon.parameters(), lr=0.005, momentum=0.9)\n",
    "\n",
    "iters  = [] # save the iteration counts here for plotting\n",
    "losses = [] # save the avg loss here for plotting\n",
    "\n",
    "# training\n",
    "for n, (imgs, labels) in enumerate(train_loader):\n",
    "    actual = (labels < 3).reshape([1, 1]).type(torch.FloatTensor)\n",
    "    out = pigeon(imgs)\n",
    "    loss = criterion(out, actual) # compute the loss\n",
    "    loss.backward()               # compute updates for each parameter\n",
    "    optimizer.step()              # make the updates for each parameter\n",
    "    optimizer.zero_grad()         # a clean up step for PyTorch\n",
    "\n",
    "    # save the current training information\n",
    "    iters.append(n)\n",
    "    losses.append(float(loss))\n",
    "\n",
    "# plotting\n",
    "plt.plot(iters, losses)\n",
    "plt.title(\"Training Curve (batch_size=1, lr=0.005)\")\n",
    "plt.xlabel(\"Iterations\")\n",
    "plt.ylabel(\"Loss\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first thing that you might notice is that the loss is very noisy.\n",
    "Some people choose to plot a running average of the loss to remove some\n",
    "of the noise, but I will ask you to just squint for now.\n",
    "\n",
    "The loss tends to improve very quickly near the beginning of training.\n",
    "To be able to better see the loss improvement towards the middle and end\n",
    "of the epoch, we can omit the first few iterations from our training curve:\n",
    "\n",
    "Let's see how the training curve changes as we change the *batch size*\n",
    "and the *learning rate*.\n",
    "We will still plot one epoch of training with 1024 images, so that the\n",
    "comparison with the earlier plots is fair.\n",
    "\n",
    "Since we'll be varying the batch size and learning rate, we'll write\n",
    "a function that plots the training curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_training_curve(batch_size=1, lr=0.005): \n",
    "    \"\"\"\n",
    "    Plots the training curve on one epoch of training of the\n",
    "    Pigeon network trained using the first 1024 images in\n",
    "    the MNIST dataset.\n",
    "    \"\"\"\n",
    "\n",
    "    pigeon = Pigeon()\n",
    "    train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size)\n",
    "    optimizer = optim.SGD(pigeon.parameters(), lr=lr, momentum=0.9)\n",
    "\n",
    "    iters  = []\n",
    "    losses = []\n",
    "\n",
    "    # training\n",
    "    for n, (imgs, labels) in enumerate(train_loader):\n",
    "        actual = (labels < 3).reshape([batch_size, 1]) \\\n",
    "                            .type(torch.FloatTensor)\n",
    "        out = pigeon(imgs)\n",
    "        loss = criterion(out, actual) # compute the total loss\n",
    "        loss.backward()               # compute updates for each parameter\n",
    "        optimizer.step()              # make the updates for each parameter\n",
    "        optimizer.zero_grad()         # a clean up step for PyTorch\n",
    "\n",
    "        # save the current training information\n",
    "        iters.append(n)\n",
    "        losses.append(float(loss)/batch_size) # compute *average* loss\n",
    "\n",
    "    # plotting\n",
    "    plt.plot(iters, losses)\n",
    "    plt.title(\"Training Curve (batch_size={}, lr={})\".format(batch_size, lr))\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Loss\")\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Increasing the Batch Size\n",
    "\n",
    "First, let's try a batch size of 32 and the same learning rate as before"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_training_curve(batch_size=32, lr=0.005)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The loss curve is a lot less noisy than before, and you can see that the\n",
    "network is still improving (the loss is still on a decline). Despite the\n",
    "general downward trend, the training loss can increase from time to time.\n",
    "Recall that in each iteration, we are computing the loss on a *different*\n",
    "mini-batch of training data.\n",
    "\n",
    "### Increasing the Learning Rate\n",
    "\n",
    "Since we increased the batch size, we might be able to get away with a higher\n",
    "learning rate. Let's try."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_training_curve(batch_size=32, lr=0.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The neural network trains much more quickly, and achieves a lower minimum loss.\n",
    "\n",
    "### Decreasing the Learning Rate\n",
    "\n",
    "For comparison, here's what happens when we decrease the learning rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_training_curve(batch_size=32, lr=0.001)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The network trains a little slowly, but is less noisy.\n",
    "\n",
    "## Decreasing Learning Rate with Batch Size 1\n",
    "\n",
    "If we keep the learning rate low, but have our batch size of 1, the noise\n",
    "is also reduced slightly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_training_curve(batch_size=1, lr=0.001)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting Error or Accuracy\n",
    "\n",
    "Instead of plotting the loss on the y-axis, we could also choose\n",
    "to plot the training accuracy or training error. The downside is that\n",
    "we would have to compute the training accuracy, which is extra work\n",
    "that we don't normally do during training.\n",
    "\n",
    "We also have to make a choice:\n",
    "do we compute the accuracy across the *minibatch*, the *entire*\n",
    "training set, or some subset of the training set? To be fair, we could also plot\n",
    "the loss for the entire *training set* instead of for the *minibatch* as well -- but\n",
    "accuracy and error rates tend to be more \"discrete\" than loss values, requiring\n",
    "a larger sample to be meaningful.\n",
    "\n",
    "We could also choose to compute the training accuracy/error every few iterations,\n",
    "as opposed to after every iteration. We would still be able to see the\n",
    "general trend, but without as much cost.\n",
    "\n",
    "For our purposes, let's first write a general helper function that computes\n",
    "the accuracy of a model across some dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get a PyTorch tensor of the entire training set, for computing accuracy\n",
    "tmp_loader = torch.utils.data.DataLoader(mnist_train, batch_size=1024)\n",
    "all_train_imgs, all_train_labels = next(iter(tmp_loader))\n",
    "all_train_actual = (all_train_labels < 3).reshape([1024, 1]).type(torch.FloatTensor)\n",
    "\n",
    "def get_accuracy(model, input=all_train_imgs, actual=all_train_actual):\n",
    "    \"\"\"\n",
    "    Return the accuracy of the model on the input data and actual ground truth.\n",
    "    \"\"\"\n",
    "    prob = torch.sigmoid(model(input))\n",
    "    pred = (prob > 0.5).type(torch.FloatTensor)\n",
    "    correct = (pred == actual).type(torch.FloatTensor)\n",
    "    return float(torch.mean(correct))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's revise the `plot_training_curve` function to plot two curves:\n",
    "one that plots the loss function, and another that plots the accuracy\n",
    "across the *entire* training set at every iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_training_curve_with_acc(batch_size=1, lr=0.005): \n",
    "    \"\"\"\n",
    "    Plots the training curve on one epoch of training\n",
    "    of the Pigeon network trained using the first 1024 images in\n",
    "    the MNIST dataset.\n",
    "    \"\"\"\n",
    "\n",
    "    pigeon = Pigeon()\n",
    "    train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size)\n",
    "    optimizer = optim.SGD(pigeon.parameters(), lr=lr, momentum=0.9)\n",
    "\n",
    "    iters  = []\n",
    "    losses = []\n",
    "    train_acc = []\n",
    "\n",
    "    # training\n",
    "    for n, (imgs, labels) in enumerate(train_loader):\n",
    "        actual = (labels < 3).reshape([batch_size, 1]).type(torch.FloatTensor)\n",
    "        out = pigeon(imgs)\n",
    "        loss = criterion(out, actual) # compute the total loss\n",
    "        loss.backward()               # compute updates for each parameter\n",
    "        optimizer.step()              # make the updates for each parameter\n",
    "        optimizer.zero_grad()         # a clean up step for PyTorch\n",
    "\n",
    "        # save the current training information\n",
    "        iters.append(n)\n",
    "        losses.append(float(loss)/batch_size) # compute *average* loss\n",
    "        train_acc.append(get_accuracy(pigeon)) # compute training accuracy\n",
    "\n",
    "    # plotting\n",
    "    plt.plot(iters, losses)\n",
    "    plt.title(\"Training Curve (batch_size={}, lr={})\".format(batch_size, lr))\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Loss\")\n",
    "    plt.show()\n",
    "    plt.plot(iters, train_acc)\n",
    "    plt.title(\"Training Curve (batch_size={}, lr={})\".format(batch_size, lr))\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Training Accuracy\")\n",
    "    plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's our best model so far:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_training_curve_with_acc(batch_size=32, lr=0.01)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training accuracy curve tells an interesting story. First, the accuracy improves\n",
    "fairly quickly. Then, the accuracy flattens as the loss improves. (This is possible\n",
    "because the loss looks at the continuous *probabilities* that the network produces, rather\n",
    "than the discrete *predictions*.) Finally, towards the end of the epoch, the training\n",
    "accuracy improves again.\n",
    "\n",
    "## Validation Accuracy\n",
    "\n",
    "Towards the end of last week, we discussed how the training accuracy (and, by extension,\n",
    "the training loss) is not a realistic estimate of how well the network will perform on\n",
    "new data. We run into this exact problem with our training curve. At each iteration of\n",
    "training, we know how the network will perform on the data it is trained on, but not on\n",
    "new data. It is difficult to tell whether the network learned to solve the problem at hand,\n",
    "or it simply memorized the training data.\n",
    "\n",
    "We proposed to set aside a **test set** at the end of training to determine how our model\n",
    "is doing. We could do something similar here: we use a set of data *not used for training*\n",
    "to track how well our training is going.\n",
    "\n",
    "However, we don't want to use our test set for this purpose. The reason is that we may\n",
    "make some decisions about our neural network architecture, training, and settings, using\n",
    "the training curve. We don't want to \"contaminate\" our test set by using it to make any\n",
    "decision -- otherwise the test set would no longer help us estimate how well our network\n",
    "will perform on unseen data.\n",
    "\n",
    "We will therefore use a separate dataset called the **validation set**, separate from both\n",
    "the *training set* and the *test set*, to see how well our model is doing in each iteration.\n",
    "The validation set will help us make decisions about the appropriate batch size, learning rate,\n",
    "and the values of other relevant settings. The split between training, validation, and test set \n",
    "is usually 60% training, 20% validation, 20% test. Other splits like 70/15/15, 80/10/10, 50/25/25\n",
    "are also reasonable, depending on how much data is available.\n",
    "\n",
    "We can plot the validation accuracy during training, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get a PyTorch tensor of the entire validation , for computing accuracy\n",
    "mnist_val = mnist_data[1024:2048] # our choice of validation set\n",
    "tmp_loader = torch.utils.data.DataLoader(mnist_val, batch_size=1024)\n",
    "all_val_imgs, all_val_labels = next(iter(tmp_loader))\n",
    "all_val_actual = (all_val_labels < 3).reshape([1024, 1]).type(torch.FloatTensor)\n",
    "\n",
    "def plot_training_curve_with_val(model, batch_size=1, lr=0.005, num_epochs=1): \n",
    "    \"\"\"\n",
    "    Plot the training curve on num_epochs of training\n",
    "    of the model trained using the first 1024 images in\n",
    "    the MNIST dataset.\n",
    "\n",
    "    Return the trained model.\n",
    "    \"\"\"\n",
    "    train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size)\n",
    "    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)\n",
    "\n",
    "    iters  = []\n",
    "    losses = []\n",
    "    train_acc = []\n",
    "    val_acc = []\n",
    "\n",
    "    # training\n",
    "    n = 0 # the number of iterations\n",
    "    for epoch in range(num_epochs):\n",
    "        for imgs, labels in iter(train_loader):\n",
    "            actual = (labels < 3).reshape([batch_size, 1]).type(torch.FloatTensor)\n",
    "            out = model(imgs)\n",
    "            loss = criterion(out, actual) # compute the total loss\n",
    "            loss.backward()               # compute updates for each parameter\n",
    "            optimizer.step()              # make the updates for each parameter\n",
    "            optimizer.zero_grad()         # a clean up step for PyTorch\n",
    "\n",
    "            # save the current training information\n",
    "            iters.append(n)\n",
    "            losses.append(float(loss)/batch_size)  # compute *average* loss\n",
    "            train_acc.append(get_accuracy(model)) # compute training accuracy\n",
    "            val_acc.append(get_accuracy(model,    # compute validation accuracy\n",
    "                                        all_val_imgs,\n",
    "                                        all_val_actual))\n",
    "            # increment the iteration number\n",
    "            n += 1\n",
    "\n",
    "    # plotting\n",
    "    plt.title(\"Training Curve (batch_size={}, lr={})\".format(batch_size, lr))\n",
    "    plt.plot(iters, losses, label=\"Train\")\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Loss\")\n",
    "    plt.show()\n",
    "\n",
    "    plt.title(\"Training Curve (batch_size={}, lr={})\".format(batch_size, lr))\n",
    "    plt.plot(iters, train_acc, label=\"Train\")\n",
    "    plt.plot(iters, val_acc, label=\"Validation\")\n",
    "    plt.xlabel(\"Iterations\")\n",
    "    plt.ylabel(\"Training Accuracy\")\n",
    "    plt.legend(loc='best')\n",
    "    plt.show()\n",
    "\n",
    "    print(\"Final Training Accuracy: {}\".format(train_acc[-1]))\n",
    "    print(\"Final Validation Accuracy: {}\".format(val_acc[-1]))\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the training curve for `batch_size=32, lr=0.005`, for 15 epochs of training.\n",
    "Recall that since our batch size is 32, and our training set has size 1024, each\n",
    "epoch corresponds to `1024 / 32 = 32` iterations of training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pigeon = plot_training_curve_with_val(Pigeon(),\n",
    "                                      batch_size=32,\n",
    "                                      lr=0.005,\n",
    "                                      num_epochs=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyper-parameters\n",
    "\n",
    "There are many other neural network settings that affect how well our network\n",
    "will perform, including:\n",
    "\n",
    "- The size of the network:\n",
    "    - number of layers\n",
    "    - number of neurons in each layer\n",
    "- Choice of activation function\n",
    "- Other optimization settings\n",
    "\n",
    "These settings are called **hyper-parameters**. They are distinct from **parameters**\n",
    "in that they are often *discrete*, and cannot be optimized using an optimizer.\n",
    "\n",
    "There are many, many hyper-parameters. In theory, we have an infinite choice of neural\n",
    "network models to choose from. In practice, we begin by focusing on a few key hyper-parameters\n",
    "that we believe would make the most difference in our model. Then, when we have a model\n",
    "that we think is fairly good, we could run a **hyper-parameter search**: we search through\n",
    "the possible hyper-parameters, training a model for each potential hyper-parameter choice, and\n",
    "see which one performs the best.\n",
    "\n",
    "The way we choose hyper-parameters is using the corresponding trained model's validation\n",
    "accuracy. Again, we don't want to use the test accuracy, because we don't want to make\n",
    "any model decisions using test data.\n",
    "\n",
    "### Choosing Models\n",
    "\n",
    "Let's try tuning the **number of hidden neurons** in our `Pigeon` model.\n",
    "First, we're going to change our `Pigeon` class to accept a parameter called\n",
    "`num_hidden`, which we will use instead of 30."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Pigeon(nn.Module):\n",
    "    def __init__(self, num_hidden):\n",
    "        super(Pigeon, self).__init__()\n",
    "        self.layer1 = nn.Linear(28 * 28, num_hidden)\n",
    "        self.layer2 = nn.Linear(num_hidden, 1)\n",
    "    def forward(self, img):\n",
    "        flattened = img.view(-1, 28 * 28)\n",
    "        activation1 = self.layer1(flattened)\n",
    "        activation1 = F.relu(activation1)\n",
    "        activation2 = self.layer2(activation1)\n",
    "        return activation2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's try training a larger pigeon (with double the hidden units), and\n",
    "a smaller pigeon (with half the hidden units), and look at their training curves.\n",
    "\n",
    "Here's the training curve for the larger pigeon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "larger_pigeon = plot_training_curve_with_val(Pigeon(60),\n",
    "                                             batch_size=32,\n",
    "                                             lr=0.005,\n",
    "                                             num_epochs=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the training curve for the smaller pigeon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smaller_pigeon = plot_training_curve_with_val(Pigeon(15),\n",
    "                                              batch_size=32,\n",
    "                                              lr=0.005,\n",
    "                                              num_epochs=15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, our initial choice of 30 hidden units looks quite reasonable.\n",
    "\n",
    "## Checkpointing\n",
    "\n",
    "Normally, we will train our neural network for not just one epoch, but many.\n",
    "Neural network training typically takes a long time, sometimes days, weeks, or even months.\n",
    "Our training code should therefore be robust to interruptions. That is, we should write\n",
    "our training code so that we can save and re-load weights.\n",
    "\n",
    "It is good to **checkpoint** training progress by saving the neural network parameter values\n",
    "and training curve data to disk, once every few epochs. The frequency of checkpointing depends\n",
    "on many factors, but I recommend checkpointing every 10-30 minutes for large projects, and every\n",
    "few minutes for smaller ones.\n",
    "\n",
    "Another advantage of checkpointing is that we now have one extra hyper-parameter we can\n",
    "tune for free: the **epoch** number! You may not wish to choose neural network parameter values\n",
    "at the end of training, and might opt to choose the parameter values at a different epoch of\n",
    "training.\n",
    "\n",
    "One reason you might opt to do so is to prevent **over-fitting**. If your training loss \n",
    "is decreasing (as training progresses), but your validation loss stays the same, then your\n",
    "network is beginning to learn idiosyncrasies of the training set that do not generalize.\n",
    "Most often, we choose the *earliest* epoch with the *lowest validation loss or error*."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}