{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regularization\n",
"\n",
"In the last few weeks we discussed the idea of **overfitting**, where a neural network model learns \n",
"about the quirks of the training data, rather than information that is generalizable to the task\n",
"at hand. We also briefly discussed idea of **underfitting**, but not in as much depth.\n",
"\n",
"The reason is that nowadays, practioners tend to avoid underfitting altogether by opting for more\n",
"powerful models. Since computation is (relatively) cheap, and overfitting is much easier to detect,\n",
"it is more straightforward to build a high-capacity model and use known techniques to prevent\n",
"overfitting. If you're not overfitting, you can opt for a more high-capacity model, so\n",
"detecting overfitting becomes a more important problem than detecting underfitting.\n",
"\n",
"We've already discussed several strategies for detecting overfitting:\n",
"\n",
"- Use a larger training set\n",
"- Use a smaller network\n",
"- Weight-sharing (as in convolutional neural networks)\n",
"- **Early stopping**\n",
"\n",
"Some of these are more practical than others. For example, collecting a larger training\n",
"set may be impractical or expensive in practice. Using a smaller network means that we need\n",
"to restart training, rather than use what we already know about hyperparameters and appropriate\n",
"weights.\n",
"\n",
"**Early stopping** was introduced in assignment 2, where we did not use the final trained\n",
"weights as our ``final'' model. Instead, we used a model (a set of weights) from a previous\n",
"iteration of training that did not yet overfit.\n",
"\n",
"These are only some of the techniques for preventing overfitting. We'll discuss more techniques today,\n",
"including:\n",
"\n",
"- Data Augmentation\n",
"- Data Normalization\n",
"- Model Averaging\n",
"- Dropout\n",
"\n",
"We will use the MNIST digit recognition problem as a running example. Since we are studying overfitting,\n",
"I will artificially reduce the number of training examples to 200."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import torch.optim as optim\n",
"import matplotlib.pyplot as plt\n",
"from torchvision import datasets, transforms\n",
"\n",
"# for reproducibility\n",
"torch.manual_seed(1)\n",
"\n",
"mnist_data = datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor())\n",
"mnist_data = list(mnist_data)\n",
"mnist_train = mnist_data[:20] # 20 train images\n",
"mnist_val = mnist_data[100:5100] # 2000 validation images"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also use the `MNISTClassifier` from last week as our base model:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class MNISTClassifier(nn.Module):\n",
" def __init__(self):\n",
" super(MNISTClassifier, self).__init__()\n",
" self.layer1 = nn.Linear(28 * 28, 50)\n",
" self.layer2 = nn.Linear(50, 20)\n",
" self.layer3 = nn.Linear(20, 10)\n",
" def forward(self, img):\n",
" flattened = img.view(-1, 28 * 28)\n",
" activation1 = F.relu(self.layer1(flattened))\n",
" activation2 = F.relu(self.layer2(activation1))\n",
" output = self.layer3(activation2)\n",
" return output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And of course, our training code, with minor modifications that we will explain as we go along."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def train(model, train, valid, batch_size=20, num_iters=1, learn_rate=0.01, weight_decay=0):\n",
" train_loader = torch.utils.data.DataLoader(train,\n",
" batch_size=batch_size,\n",
" shuffle=True) # shuffle after every epoch\n",
" criterion = nn.CrossEntropyLoss()\n",
" optimizer = optim.SGD(model.parameters(), lr=learn_rate, momentum=0.9, weight_decay=weight_decay)\n",
"\n",
" iters, losses, train_acc, val_acc = [], [], [], []\n",
"\n",
" # training\n",
" n = 0 # the number of iterations\n",
" while True:\n",
" if n >= num_iters:\n",
" break\n",
" for imgs, labels in iter(train_loader):\n",
" model.train()\n",
" out = model(imgs) # forward pass\n",
" loss = criterion(out, labels) # compute the total loss\n",
" loss.backward() # backward pass (compute parameter updates)\n",
" optimizer.step() # make the updates for each parameter\n",
" optimizer.zero_grad() # a clean up step for PyTorch\n",
"\n",
" # save the current training information\n",
" if n % 10 == 9:\n",
" iters.append(n)\n",
" losses.append(float(loss)/batch_size) # compute *average* loss\n",
" train_acc.append(get_accuracy(model, train)) # compute training accuracy \n",
" val_acc.append(get_accuracy(model, valid)) # compute validation accuracy\n",
" n += 1\n",
"\n",
" # plotting\n",
" plt.figure(figsize=(10,4))\n",
" plt.subplot(1,2,1)\n",
" plt.title(\"Training Curve\")\n",
" plt.plot(iters, losses, label=\"Train\")\n",
" plt.xlabel(\"Iterations\")\n",
" plt.ylabel(\"Loss\")\n",
"\n",
" plt.subplot(1,2,2)\n",
" plt.title(\"Training Curve\")\n",
" plt.plot(iters, train_acc, label=\"Train\")\n",
" plt.plot(iters, val_acc, label=\"Validation\")\n",
" plt.xlabel(\"Iterations\")\n",
" plt.ylabel(\"Training Accuracy\")\n",
" plt.legend(loc='best')\n",
" plt.show()\n",
"\n",
" print(\"Final Training Accuracy: {}\".format(train_acc[-1]))\n",
" print(\"Final Validation Accuracy: {}\".format(val_acc[-1]))\n",
"\n",
"\n",
"train_acc_loader = torch.utils.data.DataLoader(mnist_train, batch_size=100)\n",
"val_acc_loader = torch.utils.data.DataLoader(mnist_val, batch_size=1000)\n",
"\n",
"def get_accuracy(model, data):\n",
" correct = 0\n",
" total = 0\n",
" model.eval()\n",
" for imgs, labels in torch.utils.data.DataLoader(data, batch_size=64):\n",
" output = model(imgs) # We don't need to run F.softmax\n",
" pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability\n",
" correct += pred.eq(labels.view_as(pred)).sum().item()\n",
" total += imgs.shape[0]\n",
" return correct / total"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Without any intervention, our model gets to about 52-53% accuracy on the validation set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = MNISTClassifier()\n",
"train(model, mnist_train, mnist_val, num_iters=500)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Augmentation\n",
"\n",
"While it is often expensive to gather more data, we can make alterations to our existing\n",
"dataset, and treat the altered data set as a new training data point. Common ways of\n",
"obtaining new (image) data include:\n",
"\n",
"- Flipping each image horizontally or vertically (won't work for digit recognition, but might for other tasks)\n",
"- Shifting each pixel a little to the left or right\n",
"- Rotating the images a little\n",
"- Adding noise to the image\n",
"\n",
"... or even a combination of the above. For demonstration purposes, let's randomly\n",
"rotate our digits a little to get new training samples.\n",
"\n",
"Here are the 20 images in our training set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def show20(data):\n",
" plt.figure(figsize=(10,2))\n",
" for n, (img, label) in enumerate(data):\n",
" if n >= 20:\n",
" break\n",
" plt.subplot(2, 10, n+1)\n",
" plt.imshow(img)\n",
"\n",
"mnist_imgs = datasets.MNIST('data', train=True, download=True)\n",
"show20(mnist_imgs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the 20 images in our training set, each rotated randomly, by up to 25 degrees."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mnist_new = datasets.MNIST('data', train=True, download=True, transform=transforms.RandomRotation(25))\n",
"show20(mnist_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we apply the transformation again, we can get images with different rotations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mnist_new = datasets.MNIST('data', train=True, download=True, transform=transforms.RandomRotation(25))\n",
"show20(mnist_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can augment our data set by, say, randomly rotating each training data point 100 times:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"attributes": {
"classes": [
"`"
],
"id": ""
}
},
"outputs": [],
"source": [
"augmented_train_data = []\n",
"\n",
"my_transform = transforms.Compose([\n",
" transforms.RandomRotation(25),\n",
" transforms.ToTensor(),\n",
"])\n",
"\n",
"for i in range(100):\n",
" mnist_new = datasets.MNIST('data', train=True, download=True, transform=my_transform)\n",
" for j, item in enumerate(mnist_new):\n",
" if j >= 20:\n",
" break\n",
" augmented_train_data.append(item)\n",
"\n",
"len(augmented_train_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We obtain a better validation accuracy after training on our expanded dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = MNISTClassifier()\n",
"train(model, augmented_train_data, mnist_val, num_iters=500)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Normalization\n",
"\n",
"Another common practice is to normalize the data so that the mean and standard deviation is constant\n",
"across each channel. For example, in your assignment 2 code, we used the following transform:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This transform standardizes each pixel intensity to have mean 0.5 and standard deviation 0.5.\n",
"\n",
"## Weight Decay\n",
"\n",
"A more interesting technique that prevents overfitting is the idea of weight decay.\n",
"The idea is to **penalize large weights**. We avoid large weights, because large weights\n",
"mean that the prediction relies a lot on the content of one pixel, or on one unit. Intuitively,\n",
"it does not make sense that the classification of an image should depend heavily on the \n",
"content of one pixel, or even a few pixels.\n",
"\n",
"Mathematically, we penalize large weights by adding an extra term to the loss function,\n",
"the term can look like the following:\n",
"\n",
"- $L^1$ regularization: $\\sum_k |w_k|$\n",
" - Mathematically, this term encourages weights to be exactly 0\n",
"- $L^2$ regularization: $\\sum_k w_k^2$ \n",
" - Mathematically, in each iteration the weight is pushed towards 0\n",
"- Combination of $L^1$ and $L^2$ regularization: add a term $\\sum_k |w_k| + w_k^2$ to the loss function.\n",
"\n",
"In PyTorch, weight decay can also be done automatically inside an optimizer. The parameter `weight_decay`\n",
"of `optim.SGD` and most other optimizers uses $L^2$ regularization for weight decay. The value of the\n",
"`weight_decay` parameter is another tunable hyperparameter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = MNISTClassifier()\n",
"train(model, mnist_train, mnist_val, num_iters=500, weight_decay=0.001)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dropout\n",
"\n",
"Yet another way to prevent overfitting is to build **many** models, then average\n",
"their predictions at test time. Each model might have a different set of\n",
"initial weights.\n",
"\n",
"We won't show an example of model averaging here. Instead, we will show another \n",
"idea that sounds drastically different on the surface.\n",
"\n",
"This idea is called **dropout**: we will randomly \"drop out\", \"zero out\", or \"remove\" a portion\n",
"of neurons from each training iteration.\n",
"\n",
"![](imgs/dropout.png)\n",
"\n",
"In different iterations of training, we will drop out a different set of neurons.\n",
"\n",
"The technique has an effect of preventing weights from being overly dependent on\n",
"each other: for example for one weight to be unnecessarily large to compensate for\n",
"another unnecessarily large weight with the opposite sign. Weights are encouraged\n",
"to be \"more independent\" of one another.\n",
"\n",
"During test time though, we will not drop out any neurons; instead we will use\n",
"the entire set of weights. This means that our training time and test time behaviour\n",
"of dropout layers are *different*. In the code for the function `train` and `get_accuracy`,\n",
"we use `model.train()` and `model.eval()` to flag whether we want the model's training behaviour,\n",
"or test time behaviour.\n",
"\n",
"While unintuitive, using all connections is a form\n",
"of model averaging! We are effectively averaging over many different networks\n",
"of various connectivity structures."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class MNISTClassifierWithDropout(nn.Module):\n",
" def __init__(self):\n",
" super(MNISTClassifierWithDropout, self).__init__()\n",
" self.layer1 = nn.Linear(28 * 28, 50)\n",
" self.layer2 = nn.Linear(50, 20)\n",
" self.layer3 = nn.Linear(20, 10)\n",
" self.dropout1 = nn.Dropout(0.2) # drop out layer with 20% dropped out neuron\n",
" self.dropout2 = nn.Dropout(0.2)\n",
" self.dropout3 = nn.Dropout(0.2)\n",
" def forward(self, img):\n",
" flattened = img.view(-1, 28 * 28)\n",
" activation1 = F.relu(self.layer1(self.dropout1(flattened)))\n",
" activation2 = F.relu(self.layer2(self.dropout2(activation1)))\n",
" output = self.layer3(self.dropout3(activation2))\n",
" return output\n",
"\n",
"model = MNISTClassifierWithDropout()\n",
"train(model, mnist_train, mnist_val, num_iters=500)"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}