{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Neural Network Terminology\n",
    "\n",
    "Last week, we introduced a lot of neural network terminology. Let's \n",
    "review some of these terms, and add some more useful terms to our vocabulary.\n",
    "\n",
    "Most of the figures here are from\n",
    "http://cs231n.github.io/neural-networks-1/\n",
    "\n",
    "## The Artificial Neurons\n",
    "\n",
    "Neurons pass information from one to another using action potentials.\n",
    "They connect with one another at **synapes**, which are junctions\n",
    "between one neuron's axon and another's dendrite.\n",
    "Information flows from:\n",
    "\n",
    "- the **dendrite**\n",
    "- to the **cell body**\n",
    "- through the **axons**\n",
    "- to a **synapse** connecting the axon to the dendrite of the next neuron.\n",
    "\n",
    "<img src=\"imgs/neuron.png\" width=400px />\n",
    "\n",
    "The biological neuron is very complicated, and has many biochemical\n",
    "elements that are difficult to simulate. Instead, we use a simplified\n",
    "model of the neuron that only models the information flow.\n",
    "\n",
    "<img src=\"imgs/neuron_model.jpeg\" width=400px />\n",
    "\n",
    "A neuron receives the activation values $x_i$ from neurons\n",
    "connected to its dendrites. The neuron needs to combine those activations.\n",
    "However, the connections between neurons can be weak or strong.\n",
    "The strength is represented by $w_i$, and the amount of information\n",
    "\"received\" by the neuron is the product $w_i x_i$.\n",
    "The total contributions are\n",
    "combined together, along with a bias $b$, forming the sum $b + \\sum_i w_i x_i$.\n",
    "This value\n",
    "is then passed to a non-linear activation function $f$ to produce\n",
    "the output $f(b+\\sum_i w_i x_i)$. This is the activation of our neuron,\n",
    "and is the value passed to other neurons connected to its axon.\n",
    "\n",
    "## The Activation Function\n",
    "\n",
    "We glossed over the idea of the **activation function** last time. The biological\n",
    "neuron's output is certainly *not* a linear combination of its inputs, so \n",
    "a nonlinear activation function is well motivated.\n",
    "More importantly, if we do not have any kind of non-linearity, then we will only be able\n",
    "to learn linear relationships between input and output: a composition of linear\n",
    "functions is still a linear function. Nonlinear activation functions are a requirement\n",
    "if we want to perform interesting nonlinear tasks.\n",
    "\n",
    "So, what non-linear activation functions do we choose?\n",
    "\n",
    "Recall our code from last time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "from torchvision import datasets, transforms\n",
    "import matplotlib.pyplot as plt # for plotting\n",
    "\n",
    "torch.manual_seed(1) # set the random seed\n",
    "\n",
    "class Pigeon(nn.Module):\n",
    "    def __init__(self):\n",
    "        super(Pigeon, self).__init__()\n",
    "        self.layer1 = nn.Linear(28 * 28, 30)\n",
    "        self.layer2 = nn.Linear(30, 1)\n",
    "    def forward(self, img):\n",
    "        flattened = img.view(-1, 28 * 28)\n",
    "        activation1 = self.layer1(flattened)\n",
    "        activation1 = F.relu(activation1)\n",
    "        activation2 = self.layer2(activation1)\n",
    "        return activation2\n",
    "\n",
    "pigeon = Pigeon()\n",
    "\n",
    "# load the data\n",
    "mnist_train = datasets.MNIST('data', train=True, download=True)\n",
    "mnist_train = list(mnist_train)[:2000]\n",
    "img_to_tensor = transforms.ToTensor()\n",
    "\n",
    "# make predictions for the first 10 images in mnist_train\n",
    "for k, (image, label) in enumerate(mnist_train[:10]):\n",
    "    print(torch.sigmoid(pigeon(img_to_tensor(image))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model `pigeon` is actually a two-layer neural network.\n",
    "Activations in each layer is stored together:\n",
    "for example `activation1` in the `Pigeon.forward` method is a\n",
    "PyTorch tensor of dimension `30`.\n",
    "We used two different activation functions for the two layers:\n",
    "We used the **rectifier function** (`F.relu`) for the outputs of\n",
    "the first layer, and the **sigmoid function** (`torch.sigmoid`)\n",
    "for the output (yes, singular) of the second layer.\n",
    "\n",
    "### Rectifier function\n",
    "\n",
    "A rectifier (or a linear rectifier) looks like this:\n",
    "\n",
    "<img src=\"imgs/relu.jpeg\" width=400px />\n",
    "\n",
    "The function is linear when the activation is above zero, and is equal to zero otherwise.\n",
    "An artificial neuron unit that uses the rectifier function as its non-linearity\n",
    "is called a **rectified linear unit (ReLU)**.\n",
    "\n",
    "Most machine learning practitioners nowadays (early 2019) use ReLU units for\n",
    "intermediate layers of a neural network. The mathematics\n",
    "of the ReLU unit is extremely simple, so networks with ReLU units are\n",
    "easier to optimize than those using *sigmoid activation*.\n",
    "\n",
    "### Sigmoid function\n",
    "\n",
    "A sigmoid function, denoted $\\sigma$, looks like this:\n",
    "\n",
    "<img src=\"imgs/sigmoid.jpeg\" width=400px />\n",
    "\n",
    "The sigmoid function has a tilted \"S\" shape, and \n",
    "its output is always between 0 and 1. \n",
    "In fact, outputs of sigmoid functions are interpretable as probabilities! \n",
    "Practitioners often use sigmoid functions to turn a real number output\n",
    "into a probability. This is exactly what we have done above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "out = pigeon(img_to_tensor(image)) # `out` is an arbitrary floating-point number\n",
    "prob = torch.sigmoid(out)          # `prob` is between 0 and 1\n",
    "print(\"Output:\", out, \"Probability:\", prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tanh function\n",
    "\n",
    "We haven't used the tanh activation function yet in this course.\n",
    "The tanh function is a variation of the sigmoid function. The output\n",
    "of the tanh function is always between -1 and 1 (instead of 0 and 1)\n",
    "\n",
    "<img src=\"imgs/tanh.jpeg\" width=400px />\n",
    "\n",
    "We probably won't use the tanh activation function in this course,\n",
    "but it is an alternative to the ReLU activation.\n",
    "\n",
    "## Parameters\n",
    "\n",
    "The **parameters** of a network are the numbers that can be tuned\n",
    "to train the network. The parameters include the **weights**\n",
    "and **biases**. \n",
    "We can count the number of parameters in our model `pigeon`:\n",
    "\n",
    "- **In the first layer**, there are `28*28` input neurons, and `30` hidden neurons.\n",
    "  This means that there are `28*28*30` weights (one for each input-hidden neuron pair),\n",
    "  and `30` biases.\n",
    "- **In the second layer**, there are `30` hidden neurons, and `1` output neuron.\n",
    "  This means that there are `30*1` weights, and `1` bias.\n",
    "\n",
    "So the total number of parameters of the model `pigeon`  is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(28 * 28 * 30) + 30 + (30 * 1) + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We often use the term **weights** synonymously with the term *parameters*,  \n",
    "to denote both weights and biases of a neural network.\n",
    "This is because biases can be thought of as a **weight** from a neuron that always\n",
    "outputs an activation of 1. So by introducing a variable $x_0 = 1$, we can rewrite\n",
    "$f(b + \\sum_{i=1}^{M} w_i x_i)$ as $f(\\sum_{i=0}^{M} w_i x_i)$.\n",
    "\n",
    "A larger neural network typically has higher **capacity**, meaning that it is\n",
    "capable of solving more complex problems. However, larger capacity neural networks\n",
    "take more resources to train. Not only do we need the memory to store the additional\n",
    "weights, we also need more time and processing power to compute updates for those\n",
    "weights during training.\n",
    "\n",
    "## Neural Network Architecture\n",
    "\n",
    "An **architecture** describes the neurons and their connectivity in the network.\n",
    "The connectivity of biological neurons is highly complex, with each neuron\n",
    "connected to tens of thousands of other neurons! Artificial neural networks are\n",
    "not as well-connected. Moreover, the connections patterns of artificial neurons\n",
    "tend to be simpler. Thus far, we have introduced a **feed-forward, fully-connected\n",
    "network**, also known as a *multi-layer perceptron*.\n",
    "\n",
    "Normally, when we count the **number of layers** in a neural network,\n",
    "we do not count the *input layer*. This way, the number of layers equal \n",
    "to the number of sets of weights and biases.\n",
    "\n",
    "For example, this is a two-layer neural network:\n",
    "\n",
    "<img src=\"imgs/neural_net.jpeg\" width=400px />\n",
    "\n",
    "And this is a three-layer neural network:\n",
    "\n",
    "<img src=\"imgs/neural_net2.jpeg\" width=400px />\n",
    "\n",
    "## Training\n",
    "\n",
    "Recall that we **train** a neural network to adjust its parameters.\n",
    "\n",
    "A **loss function** $L(actual, predicted)$\n",
    "computes how \"bad\" a predictions was, compared to the\n",
    "ground truth value for the input.\n",
    "A large loss means that the network's prediction differs from the ground truth,\n",
    "whereas a small loss means that the network's prediction is similar to the\n",
    "ground truth.\n",
    "\n",
    "As mentioned last time, the loss function transforms a problem of finding\n",
    "good weights to perform a task, into an **optimization problem**: finding\n",
    "the weights that minimize the loss function (or the average value\n",
    "of the loss function across some training data). In each iteration,\n",
    "we are taking one step towards solving the optimization problem:\n",
    "\n",
    "$min_{weights} L(prediction, actual, weights)$\n",
    "\n",
    "The transformation of \n",
    "learning problems into optimization problems will be a recurring theme\n",
    "as you study more machine learning models.\n",
    "\n",
    "An **optimizer** determines, based on the **loss function**,\n",
    "how -- and how much -- each parameter should change. \n",
    "The optimizer solves the **credit assignment problem**: how do we assign\n",
    "credit (blame) to the parameters when the network performs poorly?\n",
    "\n",
    "The solution to the credit assignment problem is **gradient descent**, \n",
    "which we will not talk about in this course. You should know that\n",
    "gradient descent uses the **gradient** of the loss function to\n",
    "compute changes to each parameter. This places restrictions\n",
    "on the kind of loss functions and activation functions we can use.\n",
    "We need to be able to take gradients of the loss function and the\n",
    "activation function with respect to the parameters.\n",
    "\n",
    "We'll have a more thorough discussion about neural network training next time.\n",
    "\n",
    "## Datasets\n",
    "\n",
    "A set of **labelled data** is data whose desired predictions are known.\n",
    "Recall that last time, we used a portion of our labelled data for training,\n",
    "and a different portion of our data for testing. In general, these two portions\n",
    "of data are called the **training set** and the **test set**.\n",
    "\n",
    "We use the **training set** to train the network: to compare our\n",
    "network's predictions against the ground truth, and make adjustments\n",
    "to the weights of the network.\n",
    "\n",
    "We use the **test set** to get a more accurate assessment of how well\n",
    "our network might do on new data that it has never seen before. If a network\n",
    "performs well on the training set, but poorly on the test set, then the\n",
    "network has **overfit** to the training set.\n",
    "\n",
    "For standard data sets like MNIST, there are standard train/test splits\n",
    "that researchers and practitioners share. Although we have been using images\n",
    "1000 images from `mnist_train` as our \"test set\", there is a standard MNIST\n",
    "test set that we can access like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_test = datasets.MNIST('data', train=False, download=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reasons people try to use the same train/test split is because different\n",
    "splits can sometimes have some impact on network performance. (Some test sets\n",
    "might be \"easier\" than others\")."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}