CSC321 Winter 2015: Introduction to Neural Networks

Lecture notes

Here are some notes to supplement the Coursera videos. Slides from the in-class meetings can be found in the calendar. Thanks to Tijmen Tieleman for the original version of these notes.

Lecture A

Why do we need machine learning? and What are neural networks?
- These videos introduce the motivation and general philosophy of ML.
- Don’t worry if you don’t understand all of the technicalities of e.g. the story about speech recognition. Try to get the big picture of the story.
- An important point is that some things that feel easy to us, like vision, are hard for software, and vice versa (chess).
Some simple models of neurons
- This video introduces some basic neuron types. It shows the formalization of the concepts (connection, activity, etc) into math.
A simple example of learning
- The most important part of this video is the visualization. Visualization of neural networks is difficult but important.
Three types of learning
- Pay extra attention to supervised learning and its mathematical definition, because that’s what we’re doing for the first half of the course.

Lecture B

Types of neural network architectures
- Pay extra attention to feed foward networks, because that’s what we’ll be doing for the first half of the course.
Perceptrons: The first generation of neural networks
- Keep in mind the analogy with neurons and synapses.
- Think about which parts are learned and which aren’t, and ask yourself why, even if you don’t find an answer.
- Try to fully understand why the bias can be implemented as a special input unit.
- Synonyms: neuron; unit; feature.
  - “neuron” emphasizes the analogy with real brains.
  - “unit” emphasizes that it’s one component of a large network.
  - “feature” emphasizes that it represents (implements) a feature detector that’s looking at the input and will turn on iff the sought feature is present in the input.
- Synonyms: a unit’s value; a unit’s activation; a unit’s output. Note that a unit’s input is something else.
  - “value” emphasizes that we can think of it as a variable, or a function of the input.
  - “activation” emphasizes that the unit may be responding or not, or to an extent; it’s most appropriate for logistic units, and it might emphasize the analogy with real brains.
  - “output” emphasizes that it’s different from the input.
A geometrical view of perceptrons
- If you’re not too experienced with geometry and its math, then this is going to challenge your imagination. Take your time. After you understand this video, the other two will be easier than this one.
- It’s about high-dimensional spaces. A few basic facts about those:
  - A point (a.k.a. location) and an arrow from the origin to that point, are often used interchangeably. It can be called a location or a vector.
  - A hyperplane is the high-dimensional equivalent of a plane in 3-D. In 2-D, it’s a line.
  - The slides that show an image of “weight space” use a 2-D weight space, so that it’s easy to draw. The same ideas apply in high-D.
  - The “scalar product” between two vectors is what you get when you multiply them element-wise and then add up those products. It’s also known as “inner product”. The scalar product between two vectors that have an angle of less than 90 degrees between them is positive. For more than 90 degrees it’s negative.
- If you’re not that sure about the story of this video after watching it, watch it again. Understanding it is a prerequisite for the next video.
Why the learning works
- Here, using the geometrical interpretation, a proof is presented of why the perceptron learning algorithm works.
- The details are not all spelled out. After watching the video, try to tell the story to someone else (or to a wall) in your own words, if possible with more details. That’s the best way to study anyway.
What perceptrons can’t do
- This story motivates the need for more powerful networks.
- These ideas will be important in future lectures, when we’re working on moving beyond these limitations.
- Synonyms: “input case”; “training case”; “training example”; “training point”; and sometimes even “input” (that’s definitely wrong though).
  - “input case” and “input” emphasize that this is given to the neural network, instead of being demanded of the network (like the answer to a test case).
  - “input” is ambiguous, because more often, “input” is short for “input neuron”.
  - “training case” is the most commonly used and is quite generic.
  - “training example” emphasizes the analogy with human learning: we learn from examples.
  - “training point” emphasizes that it’s a location in a high-dimensional space.

Lecture C

Learning the weights of a linear neuron
- This video introduces lots of new ideas, and is a big prerequisite for understanding the rest of Lecture C (and in fact the rest of the course).
- This video introduces a different type of output neuron.
- Again, we have a proof of convergence, but it’s a different proof. It doesn’t require the existence of a perfect weight vector.
- “residual error” really means “error” or “residual”: it’s the amount by which we got the answer wrong.
- A very central concept is introduced without being made very explicit: we use derivatives for learning, i.e. for making the weights better. Try to understand why those concepts are indeed very related.
- “on-line” learning means that we change the weights after every training example that we see, and we typically cycle through the collection of available training examples.
The error surface for a linear neuron
- A lot of geometry again, much like in the videos perceptrons. These types of analysis are the best tool that we have for understanding what a learning rule is doing. This is not easy.
- In the image, we use two weights, and two training cases. These numbers need not have been the same, so it’s not like one weight is connected to one training case, and the other weight is connected to the other training case.
Learning the weights of a logistic output neuron
- This one is easier than the other two: it has far fewer new concepts.
- Think about what’s different from the case with linear neurons, and what’s the same.
- The error function is still E = 1/2 * (y-t)^2
- Notice how after Geoff explained what the derivative is for a logistic unit, he considers the job to be done. That’s because the learning rule is always simply some learning rate multiplied by the derivative.
Synonyms: “loss (function)”; “error (function)”; “objective (function) (value)”.
- “loss” emphasizes that we’re minimizing it, without saying much about what the meaning of the number is.
- “error” emphasizes that it’s the extent to which the network gets things wrong.
- “objective function” is very generic. This is the only one where it’s not clear whether we’re minimizing or maximizing it.
The backpropagation algorithm
- Here, we start using hidden layers. To train them, we need the backpropragation algorithm.
- Hidden layers, and this algorithm, are very important in this course. If there’s any confusion about this, it’s worth resolving soon.
- The story of training by perturbations serves mostly as motivation for using backprop, and is not as central as the rest of the video.
- This computation, just like the forward propagation, can be vectorized across mulitple units in every layer, and multiple training cases.
Using the derivatives computed by backpropagation
- Here, two topics (optimization and regularization) are introduced, to be further explored later on in the course.

Lecture D

Another diversion: The softmax output function
- This is not really a diversion: it’s a crucial ingredient of language models, and many other neural networks.
- We’ve seen binary threshold output neurons and logistic output neurons. This video presents a third type. This one only makes sense if we have multiple output neurons.
- The first “problem with squared error” is a problem that shows up when we’re combining the squared error loss function with logistic output units. The logistic has small gradients, if the input is very positive or very negative.
Neuro-probabilistic language models
- This is the first of several applications of neural networks that we’ll studying in some detail, in this course.
Synonyms: word embedding; word feature vector; word encoding.
- All of these describe the learned collection of numbers that is used to represent a word.
- “embedding” emphasizes that it’s a location in a high-dimensional space: it’s where the words are embedded in that space. When we check to see which words are close to each other, we’re thinking about that embedding.
- “feature vector” emphasizes that it’s a vector instead of a scalar, and that it’s componential, i.e. composed of multiple feature values.
- “encoding” is very generic and doesn’t emphasize anything specific.

Lecture E

Overview of mini-batch gradient descent
- Now we’re going to discuss numerical optimization: how best to adjust the weights and biases, using the gradient information from the backprop algorithm.
- This video elaborates on the most standard neural net optimization algorithm (mini-batch gradient descent), which we’ve seen before.
A bag of tricks for mini-batch gradient descent
- Part 1 is about transforming the data to make learning easier.
  - At 1:10, there’s a comment about random weights and scaling. The “it” in that comment is the average size of the input to the unit.
  - At 1:15, the “good principle”: what he means is INVERSELY proportional.
  - At 4:38, Geoff says that the hyperbolic tangent is twice the logistic minus one. This is not true, but it’s almost true. As an exercise, find out’s missing in that equation.
  - At 5:08, Geoffrey suggests that with a hyperbolic tangent unit, it’s more difficult to sweep things under the rug than with a logistic unit. I don’t understand his comment, so if you don’t either, don’t worry. This comment is not essential in this course: we’re never using hyperbolic tangents in this course.
- Part 2 is about changing the stochastic gradient descent algorithm in sophisticated ways. We’ll look into these four methods in more detail, later on in the course.
Jargon: “stochastic gradient descent” is mini-batch or online gradient descent.
- The term emphasizes that it’s not full-batch gradient descent.
- “stochastic” means that it involves randomness. However, this algorithm typically does not involve randomness.
- However, it would be truly stochastic if we would randomly pick 100 training cases from the entire training set, every time we need the next mini-batch.
- We call traditional “stochastic gradient descent” stochastic because it is, in effect, very similar to that truly stochastic version.
Jargon: a “running average” is a weighted average over the recent past, where the most recent past is weighted most heavily.
The momentum method
- Now we’re going to take a more thorough look at some of the tricks suggested in video 6b.
- The biggest challenge in this video is to think of the error surface as a mountain landscape. If you can do that, and you understand the analogy well, this video will be easy.
  - You may have to go back to video 3b, which introduces the error surface.
  - Important concepts in this analogy: “ravine”, “a low point on the surface”, “oscillations”, “reaching a low altitude”, “rolling ball”, “velocity”.
  - All of those have meaning on the “mountain landscape” side of the analogy, as well as on the “neural network learning” side of the analogy.
  - The meaning of “velocity” in the “neural network learning” side of the analogy is the main idea of the momentum method.
- Vocabulary: the word “momentum” can be used with three different meanings, so it’s easy to get confused.
  - It can mean the momentum method for neural network learning, i.e. the idea that’s introduced in this video. This is the most appropriate meaning of the word.
  - It can mean the viscosity constant (typically 0.9), sometimes called alpha, which is used to reduce the velocity.
  - It can mean the velocity. This is not a common meaning of the word.
- Note that one may equivalently choose to include the learning rate in the calculation of the update from the velocity, instead of in the calculation of the velocity.

Lecture F

Modeling sequences: A brief overview
- This video talks about some advanced material that will make a lot more sense after you complete the course: it introduces some generative models for unsupervised learning (see Lecture A), namely Linear Dynamical Systems and Hidden Markov Models. These are neural networks, but they’ve very different in nature from the deterministic feedforward networks that we’ve been studying so far. For now, don’t worry if those two models feel rather mysterious.
- However, Recurrent Neural Networks are the next topic of the course, so make sure that you understand them.
Training RNNs with back propagation
- After watching the video, think about how such a system can be used to implement the brain of a robot as it’s producing a sentence of text, one letter at a time. What would be input; what would be output; what would be the training signal; which units at which time slices would represent the input & output?
A toy example of training an RNN
- Clarification at 3:33: there are two input units. Do you understand what each of those two is used for?
- The hidden units, in this example, as in most neural networks, are logistic. That’s why it’s somewhat reasonable to talk about binary states: those are the extreme states.
Why it is difficult to train an RNN
- This is all about backpropagation with logistic hidden units. If necessary, review video 3d and the example that we studied in class.
- Remember that Geoffrey explained in class how the backward pass is like an extra long linear network? That’s the first slide of this video.
- Echo State Networks: At 6:36, “oscillator” describes the behavior of a hidden unit (i.e. the activity of the hidden unit oscillates), just like we often use the word “feature” to functionally describe a hidden unit.
- Echo State Networks: like when we were studying perceptrons, the crucial question here is what’s learned and what’s not learned. ESNs are like perceptrons with randomly created inputs.
- At 7:42: the idea is good initialization with subsequent learning (using backprop’s gradients and stochastic gradient descent with momentum as the optimizer).
Long-term Short-term-memory
- This video is about a solution to the vanishing or exploding gradient problem. Make sure that you understand that problem first, because otherwise this video won’t make much sense.
- The material in this video is quite advanced.
- In the diagram of the memory cell, there’s a somewhat new type of connection: a multiplicative connection.
  - It’s shown as a triangle.
  - It can be thought of as a connection of which the strength is not a learned parameter, but is instead determined by the rest of the neural network, and is therefore probably different for different training cases.
    - This is the interpretation that Geoffrey uses when he explains backpropagation through time through such a memory cell.
  - That triangle can, alternatively, be thought of as a multiplicative unit: it receives input from two different places, it multiplies those two numbers, and it sends the product somewhere else as its output.
    - Which two of the three lines indicate input and which one indicates output is not shown in the diagram, but is explained.
- In Geoffrey’s explanation of row 4 of the video, “the most active character” means the character that the net, at this time, consider most likely to be the next character in the character string, based on what the pen is doing.

Lecture G

Why object recognition is difficult
- We’re switching to a different application of neural networks: computer vision, i.e. having a computer really understand what an image is showing.
- This video explains why it is difficult for a computer to go from an image (i.e. the color and intensity for each pixel in the image) to an understanding of what it’s an image of.
- Some of this discussion is about images of 2-dimensional objects (writing on paper), but most of it is about photographs of 3-D real-world scenes.
- Make sure that you understand the last slide:
  - It explains how switching age and weight is like an object moving over to a different part of the image (to different pixels).
  - These two might sound like very different situations, but the analogy is in fact quite good: they’re not really very different.
  - Understanding this is prerequisite for especially the next video.
Achieving viewpoint invariance
- “invariant” means, literally, that it doesn’t vary: it doesn’t change as a result of a change of viewpoint.
  - This means that if the neuron for the feature detector is fairly active (say it’s a logistic neuron and it has a value close to 1) for one input image, then if we give the neural network a image of that same scene from a somewhat different viewpoint, that same neuron will still be fairly active. Its activity is invariant under viewpoint changes.
  - “invariant” is a matter of degrees: there’s very little that’s completely invariant, or that has no invariance at all, but some things are more invariant than others.
- The invariant features are things like “there’s a red circle somewhere in the image”, and the neuron for that feature detector should somehow learn to turn on when there is indeed a red circle in the input, and turn off if there isn’t.
- Try to come up with examples of features that are largely invariant under viewpoint changes, and examples of features that don’t have that property.
Convolutional nets for digit recognition
- Like many of the stories which we tell with the application of recognizing handwritten digits, this one, too, is applicable to a great variety of vision tasks.
  - It’s just that handwritten digit recognition is a standard example for neural networks.
- Convolutional nets are still very much used.
- The slide “Backpropagation with weight constraints” can be confusing. Here are some clarifications. (note that not every researcher uses the same definitions)
  - Error Backpropagation (a.k.a. “backpropagation” or “backprop”) is an algorithm that cleverly uses the chain rule to calculate gradients for neural networks. It doesn’t really care about weights constraints.
  - What does care about weight constraints is the optimizer: the system that, bit by bit, changes the weights & biases of the network to reduce the error, and that uses the gradient (obtained by backprop) to figure out in which direction to change the weights.
  - The gradient for two weights will typically not be the same, even if they’re two weights that we’d like to keep equal.
  - The optimizer can keep the “tied” weights the same in at least two ways.
  - One way is to use the sum of the gradients of the various “instances” of the tied weights as if it were the gradient for each of the instances. That’s what the video describes.
  - Another way is to use the mean instead of the sum.
  - Both methods have their advantages.
  - The main point of this is that it’s not the gradients that change if we have convolution; what changes is what we do with the gradients.
  - Another interpretation is to say that there really aren’t two (or more) weights that we’re trying to keep equal, but that there’s really only one parameter that shows up in two (or more) places in the network.
    - That’s the more mathematical interpretation.
    - It favours using the sum of gradients instead of the mean (you can try to figure out why, if you’re feeling mathematical).
    - This interpretation is also closer to what typically happens in the computer program that runs the convolutional neural net.
Convolutional nets for object recognition
- This video is more a collection of interesting success stories than a thorough introduction to new concepts. Sit back and enjoy.
Terminology: “backpropagation” is often used as the name for the combination of two systems:
- System 1: the error backpropagation system that computes gradients.
- System 2: the gradient descent system that uses those gradients to gradually improve the weights and biases of a neural network.
- Most researchers, including Geoffrey, usually mean this combination, when they say “backpropagation”.

Lecture J

Overview of ways to improve generalization
- In the discussion of overfitting, we assume that the bottleneck of our ability to do machine learning is the amount of data that we have; not the amount of training time or computer power that we have.
Limiting the size of the weights
- There is some math in this video. It’s not complicated math. You should make sure to understand it.
Using noise as a regularizer
- First slide
  - This slide serves to show that noise is not a crazy idea.
  - The penalty strength can be thought of as being sigma i squared, or twice that (to compensate for the 1/2 in the weight decay cost function), but that detail is not important here.
- Second slide (the math slide)
  - I don’t entirely like the explanation of this slide, but the formulas are correct.
  - The reason why the middle term is zero is that all of the epsilons have mean zero.
  - You may notice that the result is not exactly like the L2 penalty of the previous video: the factor 1/2 is missing. Or equivalently, the strength of the penalty is not sigma i squared, but twice that. The main point, however, is that this noise is equivalent to an L2 penalty.
Jargon: overfitting, underfitting, generalization, and regularization
- Overfitting can be thought of as the model being too confident about what the data is like: more confident than would be justified, given the limited amount of training data that it was trained on.
- If an alien from outer space would take one look at a street full of cars (each car being a training case), and it so happens that there were only two Volkswagens there, one dark red and one dark blue, then the alien might conclude “all Volkswagens on Earth are of dark colours.” That would be overfitting.
- If, on the other hand, the alien would be so reluctant to draw conclusions that he even fails to conclude that cars typically have four wheels, then that would be underfitting.
- We seek the middle way, where we don’t draw more than a few unjustified conclusions, but we do draw most of the conclusions that really are justified.
- Regularization means forcing the model to draw fewer conclusions, thus limiting overfitting. If we overdo it, we end up underfitting.
Jargon: “generalization” typically means the successful avoidance of both overfitting and underfitting. Since overfitting is harder to avoid, “generalization” often simply means the absence of (severe) overfitting.
- The “accidental regularities” that training data contains are often complicated patterns. However, NNs can learn complicated patterns quite well.
Jargon: “capacity” is learning capacity. It’s the amount of potential (artificial) brain power in a model, and it mostly depends on the number of learned parameters (weights & biases).
Introduction to the full Bayesian approach
- The next two videos are not easy. There’s a lot of math, and not everything is explained in great detail. However, they provide invaluable insights into all regularization techniques. Don’t rush through them; take your time.
- The full Bayesian approach is the ultimate in regularization. The gold standard. However, it takes so much computation time, that we always look for approximations to it.
- The terms “prior”, “likelihood term”, and “posterior” are explained in a more mathematical way at the end of the video, so if you’re confused, just keep in mind that a mathematical explanation follows.
- For the coin example, try not to get confused about the difference between “p” (the probability of seeing heads) and “P” (the abbreviation for “probability”).
- Jargon: “maximum likelihood” means maximizing the likelihood term, without regard to any prior that there may be.
- At 8:22 there’s a slightly incorrect statement in the explanation, though not in the slide. The mean is not at .53 (although it is very close to that). What’s really at .53 is the mode, a.k.a. the peak, a.k.a. the most likely value.
- The Bayesian approach is to average the network’s predictions, at test time, where “average” means that we use network parameters according to the posterior distribution over parameter settings given the training data. Essentially, we’re averaging the predictions from many predictors: each possible parameter setting is a predictor, and the weight for that weighted average is the posterior probability of that parameter setting.
The Bayesian interpretation of weight decay
- In this video, we use Bayesian thinking (which is widely accepted as very reasonable) to justify weight decay (which may sound like an arbitrary hack).
- Maximum A Posteriori (MAP) learning means looking for that setting of the network parameters that has greatest posterior probability given the data.
- As such it’s somewhat different from the simpler “Maximum Likelihood” learning, where we look for the setting of the parameters that has the greatest likelihood term: there, we don’t have a prior over parameter settings, so it’s not very Bayesian at all. Slide 1 introduces Maximum Likelihood learning. Try to understand well what that has to do with the Bayesian “likelihood term”, before going on to the next slide.
- The reason why we use Gaussians for our likelihood and prior is that that makes the math simple, and fortunately it’s not an insane choice to make. However, it is somewhat arbitrary.
- 10:15: Don’t worry about the absence of the factor 1/2 in the weight decay strength. It doesn’t change the story in any essential way.

Lecture K

Why it helps to combine models
- This video is about a very different (and more powerful) method of preventing overfitting.
- There’s, again, a lot of math, although it’s less difficult than in Lecture J. Be sure to understand the formulas before moving on.
- We’re going to combine many models, by using the average of their predictions, at test time.
- 5:38: There’s a mistake in the explanation of why that term disappears.
  - The mistake is that -2(t-ybar) is not a random variable, so it makes no sense to talk about its variance, mean, correlations, etc.
  - The real reason why the term disappears is simply that the right half of the term, i.e. i, is zero, because ybar is the mean of the yi values.
Mixtures of Experts (optional)
- This is a different way of combining multiple models.
- “Nearest neighbor” is a very simple regression method that’s not a neural network.
- 7:22: The formula is confusing.
  - The idea is a weighted average of squared errors (weighted by those probabilities p_i).
  - That can be written as an weighted expectation, with weights p_i, of (t-y_i)^2; or as a sum of p_i * (t-y_i)^2. The formula on the slide mixes those two notations.
  - On the next slide it’s written correctly.
- 10:03: This formula is not trivial to find, but if you differentiate and simplify, you will find it.
The idea of full Bayesian learning (optional)
- In this video you learn what exactly we want to do with that difficult-to-compute posterior distribution.
- This video shows an ideal method, which is so time-consuming that we can never do it for normal-size neural networks. This is a theory video.
- We average the predictions from many weight vectors on test data, with averaging weights coming from the posterior over weight vectors given the training data.
  - That sounds simple and is indeed, in a sense, what happens.
  - However, there’s more to be said about what this “averaging” entails.
  - The Bayesian approach is all about probabilities, so the idea of producing a single number as output has no place in the Bayesian approach.
  - Instead, the output is a distribution, indicating how likely the net considers every possible output value to be.
  - In Lecture J we introduced the idea that the scalar output from a network really is the mean of such a predictive distribution. We need that idea again here.
    - That is what Geoffrey means at 6:37. “Adding noise to the output” is a way of saying that the output is simply the centre of a predictive distribution.
  - What’s averaged is those distributions: the predictive distribution of the Bayesian approach is the weighted mean of all those Gaussian predictive distributions of the various weight vectors.
    - By the way, the result of this averaging of many such Gaussian distributions is not a Gaussian distribution.
  - However, if we’re only interested in the mean of the predictive distribution (which would not be very Bayesian in spirit), then we can simply average the outputs of the networks to get that mean. You can mathematically verify this for yourself.
Making full Bayesian learning practical (optional)
- Maximum Likelihood is the least Bayesian.
- Maximum A Posteriori (i.e. using weight decay) is slightly more Bayesian.
- This video introduces a feasible method that’s even closer to the Bayesian ideal. However, it’s necessarily still an approximation.
- 4:22: “save the weights” means recording the current weight vector as a sampled weight vector.
Dropout (optional)
- This is not Bayesian. This is a specific way of adding noise (that idea was introduced in general in Lecture J). It’s a recent discovery and it works very, very well.
- Dropout can be viewed in different ways:
  - One way to view this method is that we add noise.
  - Another more complicated way, which is introduced first in the video, is about weight sharing and different models.
  - That second way to view it serves as the explanation of why adding noise works so well.
- The first slide in other words: a mixture of models involves taking the arithmetic mean (a.k.a. “the mean”) of the outputs, while a product of models involves taking the geometric mean of the outputs, which is a different kind of mean.