{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Convolutional Neural Networks \n", "\n", "So far, we have used feed-forward neural networks with fully connected layers\n", "to build neural networks.\n", "While fully connected layers are useful, they also have undesirable properties.\n", "\n", "Specifically, fully connected layers require a lot of connections, and thus\n", "many more weights than our problem might need. Suppose\n", "we are trying to determine whether a greyscale image of size $200 \\times 200$ contains\n", "a cat. Our input layer would have $200 \\times 200 = 40000$ units:\n", "one for each pixel.\n", "A fully connected layer between the input and the first hidden layer with, say,\n", " 500 hidden units will require at least $40000 \\times 500 =$ 20 million weights!\n", "\n", "The large number of weights mean several things. First, computing\n", "predictions will require long processing time. Second, \n", "our network very high capacity, and will be prone to overfitting.\n", "We will need a large number of training examples, or be very aggressive\n", "in preventing overfitting (we'll discuss some techniques in a later lecture).\n", "\n", "There are other undesirable properties as well.\n", "What happens if our input image is shifted by one pixel to the left?\n", "Since the content of the image is virtually unchanged, we would like\n", "our prediction to change very little as well.\n", "However, each pixel is now being multiplied by an entirely different \n", "neural network weight. We could get a completely different prediction.\n", "In short, a fully-connected layer does not explicitly take the 2D geometry\n", "of the image into consideration.\n", "\n", "In this chapter, we will introduce the **convolutional neural network** to\n", "solve many of these aforementioned issues.\n", "\n", "## Locally Connected Layers\n", "\n", "In a 2-dimensional images, there is a notion of proximity. If you want to know\n", "what object is at at a particular pixel, you can usually find out by looking at\n", "other pixels nearby. (At the very list, nearby pixels will be more informative than\n", "pixels further away.) In fact, many features that the human eye\n", "can easily detect are **local features**. For example, we can detect\n", "edges, textures, and even shapes using pixel intensities in a small region\n", "of an image.\n", "\n", "If we want a neural network to detect these kinds of local features, we can use\n", "a **locally connected layer**, like this:\n", "\n", " \n", "\n", "Each unit in the (first) hidden layer detects patterns in a *small portion* of the input\n", "image, rather than the *entire* image. This way, we have fewer connections (and therefore weights)\n", "between the input and the first hidden layer. Note that now, the hidden layers will also\n", "have a geometry to them, with the top-left corner of the hidden layers being computed from\n", "the top-left region of the original input image.\n", "\n", "There is actually evidence that the (biological) neural connectivity in \n", "an animal's visual cortex works similarly.\n", "That is, neurons in the visual cortex detect features that occur\n", "in a small region of our receptive field.\n", "Neurons close to the retina detect simple patterns like edges.\n", "Neurons that receive information from these simple cells detect more complicated patterns,\n", "like textures and blobs.\n", "Neurons in even higher layers detect even more complicated patterns, like entire hands\n", "or faces.\n", "\n", " \n", "\n", "## Weight Sharing\n", "\n", "Besides restricting ourselves to only **local connections**, \n", "there is one other optimization we can make: if we wanted to detect a feature\n", "(say, a horizontal edge), we can use the *same* detector\n", "on the bottom-left corner of an image and \n", "on the top right of the image.\n", "That is, if we know how to detect a local feature in one region of the image,\n", "then we know how to detect that feature in all other regions of the image.\n", "In neural networks, \"knowing how to detect a local feature\" means \n", "having the appropriate weights and biases connecting the input neurons to\n", "some hidden neuron.\n", "\n", "We can therefore **reuse** the same weights everywhere else in the image.\n", "This is the idea behind **weight sharing**: we will share\n", "the **same parameters** across different **locations** in an image.\n", "\n", " \n", " \n", "\n", "## Convolutional Arithmetic (forward pass computation)\n", "\n", "Let's look at the *forward pass computation* of a convolutional neural network layer.\n", "That is, let's pretend to be PyTorch and compute the output of a convolutional layer,\n", "given some input.\n", "\n", "The light blue grid (middle) is the *input* that we are given. You can imagine\n", "that this blue grid represents a 5 pixel by 5 pixel greyscale image.\n", "\n", "The grey grid (left) contains the *parameters* of this neural network layer.\n", "This grey grid is also known as a **convolutional kernel**, **convolutional filter**,\n", "or just **kernel** or **filter**. In this case, the **kernel size** or \n", "**filter size** is $3 \\times 3$.\n", "\n", "
\n", " \n", " \n", "
\n", "\n", "To compute the output, we superimpose the kernel on a region of the image. \n", "Let's start at the top left, in the dark blue region. The small numbers in the\n", "bottom right corner of each grid element corresponds to the number in the kernel.\n", "To compute the output at the corresponding location (top left), we \"dot\" the\n", "pixel intensities in the square region with the kernel. That is, we perform\n", "the computation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(3 * 0 + 3 * 1 + 2 * 2) + (0 * 2 + 0 * 2 + 1 * 0) + (3 * 0 + 1 * 1 + 2 * 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The green grid (right) contains the *output* of this convolution layer.\n", "This output is also called an **output feature map**. The terms **feature**,\n", "and **activation** are interchangable. The output value on the top left\n", "of the green grid is consistent with the value we obtained by hand in Python.\n", "\n", "To compute the next activation value (say, one to the right of the previous output),\n", "we will shfit the superimposed kernel over by one pixel:\n", "\n", " \n", "\n", "The dark blue region is moved to the right by one pixel. We again dot\n", "the pixel intensities in this region with the kernel:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(3 * 0 + 2 * 1 + 1 * 2) + (0 * 2 + 1 * 2 + 3 * 0) + (1 * 0 + 2 * 1 + 2 * 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the pattern continues...\n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(2 * 0 + 1 * 1 + 0 * 2) + (1 * 2 + 3 * 2 + 1 * 0) + (2 * 0 + 2 * 1 + 3 * 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One part of the computation that is missing in this picture is the\n", "addition of a **bias** term, which we will discuss later.\n", "\n", "## Filters in Computer Vision\n", "\n", "Are filters actually useful? Can we use filters to detect a variety of useful features?\n", "The answer is yes!\n", "In fact, people have used the idea of convolutional filters in computer vision even before\n", "the rise of machine learning. They hand-coded filters that can detect simple features,\n", "for example to detect edges in various orientations:\n", "\n", " \n", " \n", "\n", "## Convolutions with Multiple Input/Output Channels\n", "\n", "There are two more things we might want out of the convolution operation.\n", "\n", "First, what if we were working with a colour image instead of a greyscale image?\n", "In that case, the kernel will be a **3-dimensional tensor** instead of a 2-dimensional one.\n", "This kernel will move through the input features just like before, and we \"dot\" the \n", "pixel intensities with the kernel at each region, exactly like before:\n", "\n", " \n", "\n", "In case of an RGB image, the size of the new dimension (aside from the width/height of the image)\n", "is 3. This \"size of the 3rd dimension\"\n", "is called the **number of input channels** or **number of input feature maps**.\n", "In the above image example, the number of input channels is 3, and we have a $3 \\times 3 \\times 3$ kernel.\n", "\n", "Second, what if we want to detect multiple features? For example, we may wish to \n", "detect both horizontal edges and vertical edges, or any other learned features?\n", "We would want to learn **many** convolutional filters on the same input. That is,\n", "we would want to make the same computation above using different kernels, like this:\n", "\n", " \n", "\n", "Each circle on the right of the image represents the output of a different kernel dotted\n", "with the highlighted region on the right. So, the output feature is also a 3-dimensional tensor.\n", "The size of the new dimension\n", "is called the **number of output channels** or **number of output feature maps**.\n", "In the picture above, there are 5 output channels.\n", "\n", "## The Convolutional Layers in PyTorch\n", "\n", "Finally, let's create convolutional layers in PyTorch!\n", "\n", "Recall that in PyTorch, we can create a fully-connected layer\n", "between successive layers like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import torch.nn as nn\n", "\n", "# fully-connected layer between a lower layer of size 100, and \n", "# a higher layer of size 30\n", "fc = nn.Linear(100, 30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've applied a layer like this as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "x = torch.randn(100) # create a tensor of shape \n", "y = fc(x) # apply the fully conected layer fc to x\n", "y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be examining the .shape attribute a lot today to help\n", "us ensure that we truly understand what is going on. If we understood,\n", "for example, what the layer fc does, we should have been able to predict\n", "the shape of y before running the above code.\n", "\n", "In PyTorch, we can create a convolutional layer using nn.Conv2d:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conv = nn.Conv2d(in_channels=3, # number of input channels\n", " out_channels=7, # number of output channels\n", " kernel_size=5) # size of the kernel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The conv layer expects as input a tensor in the format \"NCHW\", meaning that\n", "the dimensions of the tensor should follow the order:\n", "\n", "* batch size\n", "* channel\n", "* height\n", "* width\n", "\n", "For example, we can emulate a batch of 32 colour images, each of size 128x128, like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.randn(32, 3, 128, 128)\n", "y = conv(x)\n", "y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output tensor is also in the \"NCHW\" format. We still have 32 images, and 7 channels\n", "(consistent with out_channels of conv), and of size 124x124. If we added the appropriate\n", "padding to conv, namely padding = kernel_size // 2, then our output width and height should\n", "be consistent with the input width and height:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conv2 = nn.Conv2d(in_channels=3,\n", " out_channels=7,\n", " kernel_size=5,\n", " padding=2)\n", "\n", "x = torch.randn(32, 3, 128, 128)\n", "y = conv2(x)\n", "y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To further illustrate the formatting, let's apply the (random, untrained) convolution conv2 to\n", "a real image. First, we load the image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "img = plt.imread(\"imgs/dog_mochi.png\")[:, :, :3]\n", "plt.imshow(img)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we convert the image into a PyTorch tensor of the appropriate shape." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = torch.from_numpy(img) # turn img into a PyTorch tensor\n", "print(x.shape)\n", "x = x.permute(2,0,1) # move the channel dimension to the beginning\n", "print(x.shape)\n", "x = x.reshape([1, 3, 350, 210]) # add a dimension for batching\n", "print(x.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even when our batch size is 1, we still need the first dimension so that the input\n", "follows the \"NCHW\" format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = conv2(x) # apply the convolution\n", "y = y.detach().numpy() # convert the result into numpy\n", "y = y # remove the dimension for batching\n", "\n", "# normalize the result to [0, 1] for plotting\n", "y_max = np.max(y)\n", "y_min = np.min(y)\n", "img_after_conv = y - y_min / (y_max - y_min)\n", "img_after_conv.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot the 7 channels one by one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(14,4))\n", "for i in range(7):\n", " plt.subplot(1, 7, i+1)\n", " plt.imshow(img_after_conv[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we were to run a neural network, these would be the unit outputs (prior to applying the activation function).\n", "\n", "## Parameters of a Convolutional Layer\n", "\n", "Recall that the trainable parameters of a fully-connected \n", "layer includes the network weights and biases. There is one weight\n", "for each connection, and one bias for each output unit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fc = nn.Linear(100, 30)\n", "fc_params = list(fc.parameters())\n", "print(\"len(fc_params)\", len(fc_params))\n", "print(\"Weights:\", fc_params.shape)\n", "print(\"Biases:\", fc_params.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a convolutional layer, the trainable parameters include\n", "the **convolutional kernels** (filters) and also a set of **biases**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conv2 = nn.Conv2d(in_channels=3,\n", " out_channels=7,\n", " kernel_size=5,\n", " padding=2)\n", "conv_params = list(conv2.parameters())\n", "print(\"len(conv_params):\", len(conv_params))\n", "print(\"Filters:\", conv_params.shape)\n", "print(\"Biases:\", conv_params.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is one bias for each output channel. Each bias is added to *every* element\n", "in that output channel. Note that the bias computation was not shown in the above\n", "figures, and are often omitted in other texts describing convolutional arithmetics.\n", "Nevertheless, the biases are there.\n", "\n", "## Pooling Layers\n", "\n", "A pooling layer can be created like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pool = nn.MaxPool2d(kernel_size=2, stride=2)\n", "y = conv2(x)\n", "z = pool(y)\n", "z.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually, the kernel size and the stride length will be equal.\n", "\n", "The pooling layer has no trainable parameters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(pool.parameters())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Convolutional Networks in PyTorch\n", "\n", "In assignment 2, we created the following network. We can understand this network now!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class LargeNet(nn.Module):\n", " def __init__(self):\n", " super(LargeNet, self).__init__()\n", " self.name = \"large\"\n", " self.conv1 = nn.Conv2d(3, 5, 5)\n", " self.pool = nn.MaxPool2d(2, 2)\n", " self.conv2 = nn.Conv2d(5, 10, 5)\n", " self.fc1 = nn.Linear(10 * 5 * 5, 32)\n", " self.fc2 = nn.Linear(32, 1)\n", "\n", " def forward(self, x):\n", " x = self.pool(F.relu(self.conv1(x)))\n", " x = self.pool(F.relu(self.conv2(x)))\n", " x = x.view(-1, 10 * 5 * 5)\n", " x = F.relu(self.fc1(x))\n", " x = self.fc2(x)\n", " x = x.squeeze(1) # Flatten to [batch_size]\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This network has **two** convolutional layers: conv1 and conv2.\n", "\n", "- The first convolutional layer conv1 requires an input with 3 channels,\n", " outputs 5 channels, and has a kernel size of 5x5. We are not adding any zero-padding.\n", "- The second convolutional layer conv1 requires an input with 5 channels,\n", " outputs 10 channels, and has a kernel size of (again) 5x5. We are not adding any zero-padding.\n", "\n", "In the forward function we see that the convolution operations are always \n", "followed by the usual ReLU activation function, and a pooling operation.\n", "The pooling operation used is max pooling, so each pooling operation\n", "reduces the width and height of the neurons in the layer by half.\n", "\n", "Because we are not adding any zero padding, we end up with 10 * 5 * 5 hidden units\n", "after the second convolutional layer. These units are then passed to two fully-connected\n", "layers, with the usual ReLU activation in between.\n", "\n", "Notice that the number of channels **grew** in later convolutional layers! However,\n", "the number of hidden units in each layer is still reduced because of the pooling operation:\n", "\n", "* Initial Image Size: $3 \\times 32 \\times 32 = 3072$\n", "* After conv1: $5 \\times 28 \\times 28$\n", "* After Pooling: $5 \\times 14 \\times 14 = 980$\n", "* After conv2: $10 \\times 10 \\times 10$\n", "* After Pooling: $10 \\times 5 \\times 5 = 250$\n", "* After fc1: $32$\n", "* After fc2: $1$\n", "\n", "This pattern of **doubling the number of channels with every pooling / strided convolution**\n", "is common in modern convolutional architectures. It is used to avoid loss of too much information within\n", "a single reduction in resolution.\n", "\n", "## AlexNet in PyTorch\n", "\n", "Convolutional networks are very commonly used, meaning that there are often alternatives to\n", "training convolutional networks from scratch. In particular, researchers often release both\n", "the architecture and **the weights** of the networks they train.\n", "\n", "As an example, let's look at the AlexNet model, whose trained weights are included in torchvision.\n", "AlexNet was trained to classify images into one of many categories.\n", "The AlexNet can be imported like below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torchvision.models\n", "\n", "alexNet = torchvision.models.alexnet(pretrained=True)\n", "alexNet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the AlexNet model is split into two parts. There is a component that computes\n", "\"features\" using convolutions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alexNet.features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a component that classifies the image based on the computed features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alexNet.classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AlexNet Features\n", "\n", "The first network can be used independently of the second. Specifically, it can be used\n", "to compute a set of **features** that can be used later on. This idea of using neural\n", "network activation *features* to represent images is an extremely important one, so it\n", "is important to understand the idea now.\n", "\n", "If we take our image x from earlier and apply it to the alexNet.features network,\n", "we get some numbers like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = alexNet.features(x)\n", "features.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The set of numbers in features is another way of representing our image x. Recall that\n", "our initial image x was also represented as a tensor, also a set of numbers representing\n", "pixel intensity. Geometrically speaking, we are using points in a high-dimensional space to\n", "represent the images. in our pixel representation, the axes in this high-dimensional space\n", "were different pixels. In our features representation, the axes are not as easily\n", "interpretable.\n", "\n", "But we will want to work with the features representation, because this representation\n", "makes classification easier. This representation organizes images in a more \"useful\" and\n", "\"semantic\" way than pixels.\n", "\n", "Let me be more specific:\n", "this set of features was trained on image classification. It turns out that\n", "**these features can be useful for performing other image-related tasks as well!**\n", "That is, if we want to perform an image classification task of our own (for example,\n", "classifying cancer biopsies, which is nothing like what AlexNet was trained to do),\n", "we might compute these AlexNet features, and then train a small model on top of those\n", "features. We replace the classifier portion of AlexNet, but keep its features\n", "portion intact.\n", "\n", "Somehow, through being trained on one type of image classification problem, AlexNet \n", "learned something general about representing images for the purposes of other\n", "classification tasks.\n", "\n", "### AlexNet First Convolutions\n", "\n", "Since we have a trained model, we might as well visualize outputs of a trained convolution,\n", "to contrast with the untrained convolution we visualized earlier.\n", "\n", "Here is the first convolution of AlexNet, applied to our image of Mochi." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alexNetConv = alexNet.features\n", "y = alexNetConv(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output is a $1 \\times 64 \\times 86 \\times 51$ tensor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = y.detach().numpy()\n", "y = (y - y.min()) / (y.max() - y.min())\n", "y.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can visualize each channel independently." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10,10))\n", "for i in range(64):\n", " plt.subplot(8, 8, i+1)\n", " plt.imshow(y[0, i])" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }