{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Convolutional Neural Networks \n",
"\n",
"So far, we have used feed-forward neural networks with fully connected layers.\n",
"While fully connected layers are useful, they are not always what we want.\n",
"\n",
"Specifically, fully connected layers require a lot of connections. Suppose\n",
"we are trying to determine whether an image of size $200 \\times 200$ contains\n",
"a cat. Our input layer would need to have $200 \\times 200 = 40000$ units,\n",
"one for each pixel.\n",
"A fully connected layer between the input and the first hidden layer with, say,\n",
" 500 hidden units will require a whopping $40000 \\times 500 =$ 20 million connections!\n",
"\n",
"The large number of connections means several things. For one, computing\n",
"predictions will require a lot of processing time. For another, we will\n",
"have a large number of *weights*, making our network very high capacity.\n",
"The high capacity of the network means that we will need a large number of training\n",
"examples to avoid overfitting.\n",
"\n",
"There is also one other issue. What happens if our image is shifted a little,\n",
"say one pixel to the left? Even though the change is minor from our point of view,\n",
"the intensity at each pixel location could change drastically. We can\n",
"get a completely different prediction from our network.\n",
"\n",
"Since there are many well-written resources on convolutional neural networks,\n",
"these notes will be terser than usual.\n",
"\n",
"## Locally Connected Layers\n",
"\n",
"What we will do is look for **local** features. For example, features like\n",
"edges, textures, and other patterns depend only on pixel intensities in a small\n",
"region. Here is an example of a **locally connected layer**:\n",
"\n",
"\n",
"\n",
"Each unit in the (first) hidden layer detects patterns in a small portion of the input\n",
"image. This way, we have fewer connections (and therefore weights) between\n",
"the input and the first hidden layer.\n",
"\n",
"There is biological evidence that the (biological) neural connectivity in the visual\n",
"cortex works the same way, where neurons detect features that occur in a small region\n",
"of our receptive field. Neurons close to the retina detect simple patterns like\n",
"edges. Neurons that receive information from these simple cells detect more complicated\n",
"patterns. Neurons in even higher layers detect even more complicated patterns.\n",
"\n",
"\n",
"\n",
"## Weight Sharing\n",
"\n",
"If we know how to detect a local feature in one region of the image -- say, an edge in a certain orientation --\n",
"then we know how to detect that feature in other regions of the image.\n",
"This is the idea behind **weight sharing**: we will share the **same parameters** across different **locations**\n",
"in an image.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"## Filters in Computer Vision\n",
"\n",
"People have used the idea of convolutional filters in computer vision even before\n",
"the rise of machine learning. They hand-coded filters that can detect simple features,\n",
"for example edges in various orientations:\n",
"\n",
"\n",
"\n",
"\n",
"## The Convolutional Layer\n",
"\n",
"Recall that in PyTorch, we can create a fully-connected layer\n",
"between successive layers like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch.nn as nn\n",
"\n",
"# fully-connected layer between a lower layer of size 100, and \n",
"# a higher layer of size 30\n",
"fc = nn.Linear(100, 30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've applied a layer like this as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"x = torch.randn(100) # create a tensor of shape [100]\n",
"y = fc(x) # apply the fully conected layer `fc` to x\n",
"y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In PyTorch, we can create a convolutional layer using `nn.Conv2d`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conv = nn.Conv2d(in_channels=3, # number of channels in the input (lower layer)\n",
" out_channels=7, # number of channels in the output (next layer)\n",
" kernel_size=5) # size of the kernel or receiptive field"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `conv` layer expects as input a tensor in the format \"NCHW\", meaning that\n",
"the dimensions of the tensor should follow the order:\n",
"\n",
"* batch size\n",
"* channel\n",
"* height\n",
"* width\n",
"\n",
"For example, we can emulate a batch of 32 colour images, each of size 128x128, like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = torch.randn(32, 3, 128, 128)\n",
"y = conv(x)\n",
"y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output tensor is also in the \"NCHW\" format. We still have 32 images, and 7 channels\n",
"(consistent with `out_channels` of `conv`), and of size 124x124. If we added the appropriate\n",
"padding to `conv`, namely `padding = kernel_size // 2`, then our output width and height should\n",
"be consistent with the input width and height:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conv2 = nn.Conv2d(in_channels=3,\n",
" out_channels=7,\n",
" kernel_size=5,\n",
" padding=2)\n",
"\n",
"x = torch.randn(32, 3, 128, 128)\n",
"y = conv2(x)\n",
"y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To further illustrate the formatting, let's apply the (random, untrained) convolution `conv2` to\n",
"a real image. First, we load the image:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"img = plt.imread(\"imgs/dog_mochi.png\")[:, :, :3]\n",
"plt.imshow(img)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we convert the image into a PyTorch tensor of the appropriate shape."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = torch.from_numpy(img) # turn img into a PyTorch tensor\n",
"print(x.shape)\n",
"x = x.permute(2,0,1) # move the channel dimension to the beginning\n",
"print(x.shape)\n",
"x = x.reshape([1, 3, 350, 210]) # add a dimension for batching\n",
"print(x.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Even when our batch size is 1, we still need the first dimension so that the input\n",
"follows the \"NCHW\" format."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y = conv2(x) # apply the convolution\n",
"y = y.detach().numpy() # convert the result into numpy\n",
"y = y[0] # remove the dimension for batching\n",
"\n",
"# normalize the result to [0, 1] for plotting\n",
"y_max = np.max(y)\n",
"y_min = np.min(y)\n",
"img_after_conv = y - y_min / (y_max - y_min)\n",
"img_after_conv.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot the 7 channels one by one:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(14,4))\n",
"for i in range(7):\n",
" plt.subplot(1, 7, i+1)\n",
" plt.imshow(img_after_conv[i])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we were to run a neural network, these would be the unit outputs (prior to applying the activation function).\n",
"\n",
"## Pooling Layers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pool = nn.MaxPool2d(2, 2)\n",
"y = conv2(x)\n",
"z = pool(y)\n",
"z.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convolutional Networks in PyTorch\n",
"\n",
"In assignment 2, we created the following network. We can understand this network now!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class LargeNet(nn.Module):\n",
" def __init__(self):\n",
" super(LargeNet, self).__init__()\n",
" self.name = \"large\"\n",
" self.conv1 = nn.Conv2d(3, 5, 5)\n",
" self.pool = nn.MaxPool2d(2, 2)\n",
" self.conv2 = nn.Conv2d(5, 10, 5)\n",
" self.fc1 = nn.Linear(10 * 5 * 5, 32)\n",
" self.fc2 = nn.Linear(32, 1)\n",
"\n",
" def forward(self, x):\n",
" x = self.pool(F.relu(self.conv1(x)))\n",
" x = self.pool(F.relu(self.conv2(x)))\n",
" x = x.view(-1, 10 * 5 * 5)\n",
" x = F.relu(self.fc1(x))\n",
" x = self.fc2(x)\n",
" x = x.squeeze(1) # Flatten to [batch_size]\n",
" return x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This network has **two** convolutional layers: `conv1` and `conv2`.\n",
"\n",
"- The first convolutional layer `conv1` requires an input with 3 channels,\n",
" outputs 5 channels, and has a kernel size of `5x5`. We are not adding any zero-padding.\n",
"- The second convolutional layer `conv1` requires an input with 5 channels,\n",
" outputs 10 channels, and has a kernel size of (again) `5x5`. We are not adding any zero-padding.\n",
"\n",
"In the `forward` function we see that the convolution operations are always \n",
"followed by the usual ReLU activation function, and a pooling operation.\n",
"The pooling operation used is max pooling, so each pooling operation\n",
"reduces the width and height of the neurons in the layer by half.\n",
"\n",
"Because we are not adding any zero padding, we end up with `10 * 5 * 5` hidden units\n",
"after the second convolutional layer. These units are then passed to two fully-connected\n",
"layers, with the usual ReLU activation in between.\n",
"\n",
"Notice that the number of channels **grew** in later convolutional layers! However,\n",
"the number of hidden units in each layer is still reduced because of the pooling operation:\n",
"\n",
"* Initial Image Size: $3 \\times 32 \\times 32 = 3072$\n",
"* After `conv1`: $5 \\times 28 \\times 28$\n",
"* After Pooling: $5 \\times 14 \\times 14 = 980$\n",
"* After `conv2`: $10 \\times 10 \\times 10$\n",
"* After Pooling: $10 \\times 5 \\times 5 = 250$\n",
"* After `fc1`: $32$\n",
"* After `fc2`: $1$\n",
"\n",
"This pattern of **doubling the number of channels with every pooling / strided convolution**\n",
"is common in convolutional architectures. It is used to avoid loss of too much information within\n",
"a single convolution.\n",
"\n",
"## AlexNet in PyTorch\n",
"\n",
"Convolutional networks are very commonly used, meaning that there are often alternatives to\n",
"training convolutional networks from scratch. In particular, researchers often release both\n",
"the architecture and **the weights** of the networks they train.\n",
"\n",
"As an example, let's look at the AlexNet model, whose trained weights are included in `torchvision`.\n",
"AlexNet was trained to classify images into one of many categories.\n",
"The AlexNet can be imported like below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torchvision.models\n",
"\n",
"alexNet = torchvision.models.alexnet(pretrained=True)\n",
"alexNet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the AlexNet model is split into two parts. There is a component that computes\n",
"\"features\" using convolutions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alexNet.features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is also a component that classifies the image based on the computed features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alexNet.classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### AlexNet Features\n",
"\n",
"The first network can be used independently of the second. Specifically, it can be used\n",
"to compute a set of **features** that can be used later on. This idea of using neural\n",
"network activation *features* to represent images is an extremely important one, so it\n",
"is important to understand the idea now.\n",
"\n",
"If we take our image `x` from earlier and apply it to the `alexNet.features` network,\n",
"we get some numbers like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features = alexNet.features(x)\n",
"features.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The set of numbers in `features` is another way of representing our image `x`. Recall that\n",
"our initial image `x` was also represented as a tensor, also a set of numbers representing\n",
"pixel intensity. Geometrically speaking, we are using points in a high-dimensional space to\n",
"represent the images. In our pixel representation, the axes in this high-dimensional space\n",
"were different pixels. In our `features` representation, the axes are not as easily\n",
"interpretable.\n",
"\n",
"But we will want to work with the `features` representation, because this representation\n",
"makes classification easier. This representation organizes images in a more \"useful\" and\n",
"\"semantic\" way than pixels.\n",
"\n",
"Let me be more specific:\n",
"this set of `features` was trained on image classification. It turns out that\n",
"**these features can be useful for performing other image-related tasks as well!**\n",
"That is, if we want to perform an image classification task of our own (for example,\n",
"classifying cancer biopsies, which is nothing like what AlexNet was trained to do),\n",
"we might compute these AlexNet features, and then train a small model on top of those\n",
"features. We replace the `classifier` portion of `AlexNet`, but keep its `features`\n",
"portion intact.\n",
"\n",
"Somehow, through being trained on one type of image classification problem, AlexNet \n",
"learned something general about representing images.\n",
"\n",
"\n",
"### AlexNet First Convolutions\n",
"\n",
"Since we have a trained model, we might as well visualize outputs of a trained convolution,\n",
"to contrast with the untrained convolution we visualized earlier.\n",
"\n",
"Here is the first convolution of AlexNet, applied to our image of Mochi."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alexNetConv = alexNet.features[0]\n",
"y = alexNetConv(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The output is a $1 \\times 64 \\times 86 \\times 51$ tensor."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y = y.detach().numpy()\n",
"y = (y - y.min()) / (y.max() - y.min())\n",
"y.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can visualize each channel independently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(10,10))\n",
"for i in range(64):\n",
" plt.subplot(8, 8, i+1)\n",
" plt.imshow(y[0, i])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}