CSC321 Winter 2014 - Lecture notes

These are Tijmen's comments on Geoff's lecture videos.

January 9

Lecture 1a: Why do we need machine learning? and Lecture 1b: What are neural networks?
- These videos introduce the motivation and general philosophy of ML.
- Don't worry if you don't understand all of the technicalities of e.g. the story about speech recognition. Try to get the big picture of the story.
- An important point is that some things that feel easy to us, like vision, are hard for software, and vice versa (chess).
Lecture 1c: Some simple models of neurons
- This video introduces some basic neuron types. It shows the formalization of the concepts (connection, activity, etc) into math.
Lecture 1d: A simple example of learning
- The most important part of this video is the visualization. Visualization of neural networks is difficult but important.

January 14

Lecture 1e: Three types of learning
- Pay extra attention to supervised learning and its mathematical definition, because that's what we're doing for the first half of the course.
Lecture 2a: Types of neural network architectures
- Pay extra attention to feed foward networks, because that's what we'll be doing for the first half of the course.
Lecture 2b: Perceptrons: The first generation of neural networks
- Keep in mind the analogy with neurons and synapses.
- Think about which parts are learned and which aren't, and ask yourself why, even if you don't find an answer.
- Try to fully understand why the bias can be implemented as a special input unit.
Synonyms: neuron; unit; feature.
- "neuron" emphasizes the analogy with real brains.
- "unit" emphasizes that it's one component of a large network.
- "feature" emphasizes that it represents (implements) a feature detector that's looking at the input and will turn on iff the sought feature is present in the input.
Synonyms: a unit's value; a unit's activation; a unit's output. Note that a unit's input is something else.
- "value" emphasizes that we can think of it as a variable, or a function of the input.
- "activation" emphasizes that the unit may be responding or not, or to an extent; it's most appropriate for logistic units, and it might emphasize the analogy with real brains.
- "output" emphasizes that it's different from the input.

January 16

Lecture 2c: A geometrical view of perceptrons
- If you're not too experienced with geometry and its math, then this is going to challenge your imagination. Take your time. After you understand this video, the other two will be easier than this one.
- It's about high-dimensional spaces. A few basic facts about those:
  - A point (a.k.a. location) and an arrow from the origin to that point, are often used interchangeably. It can be called a location or a vector.
  - A hyperplane is the high-dimensional equivalent of a plane in 3-D. In 2-D, it's a line.
  - The slides that show an image of "weight space" use a 2-D weight space, so that it's easy to draw. The same ideas apply in high-D.
  - The "scalar product" between two vectors is what you get when you multiply them element-wise and then add up those products. It's also known as "inner product". The scalar product between two vectors that have an angle of less than 90 degrees between them is positive. For more than 90 degrees it's negative.
- If you're not that sure about the story of this video after watching it, watch it again. Understanding it is a prerequisite for the next video.
Lecture 2d: Why the learning works
- Here, using the geometrical interpretation, a proof is presented of why the perceptron learning algorithm works.
- The details are not all spelled out. After watching the video, try to tell the story to someone else (or to a wall) in your own words, if possible with more details. That's the best way to study anyway.
Lecture 2e: What perceptrons can't do
- This story motivates the need for more powerful networks.
- These ideas will be important in future lectures, when we're working on moving beyond these limitations.
Synonyms: "input case"; "training case"; "training example"; "training point"; and sometimes even "input" (that's definitely wrong though).
- "input case" and "input" emphasize that this is given to the neural network, instead of being demanded of the network (like the answer to a test case).
  - "input" is ambiguous, because more often, "input" is short for "input neuron".
- "training case" is the most commonly used and is quite generic.
- "training example" emphasizes the analogy with human learning: we learn from examples.
- "training point" emphasizes that it's a location in a high-dimensional space.

January 21

Lecture 3a: Learning the weights of a linear neuron
- This video introduces lots of new ideas, and is a big prerequisite for understanding the other two videos (and in fact the rest of the course).
- This video introduces a different type of output neuron.
- Again, we have a proof of convergence, but it's a different proof. It doesn't require the existence of a perfect weight vector.
- "residual error" really means "error" or "residual": it's the amount by which we got the answer wrong.
- A very central concept is introduced without being made very explicit: we use derivatives for learning, i.e. for making the weights better. Try to understand why those concepts are indeed very related.
- "on-line" learning means that we change the weights after every training example that we see, and we typically cycle through the collection of available training examples.
Lecture 3b: The error surface for a linear neuron
- A lot of geometry again, much like in video 2c about perceptrons. These types of analysis are the best tool that we have for understanding what a learning rule is doing. This is not easy.
- In the image, we use two weights, and two training cases. These numbers need not have been the same, so it's not like one weight is connected to one training case, and the other weight is connected to the other training case.
Lecture 3c: Learning the weights of a logistic output neuron
- This one is easier than the other two: it has far fewer new concepts.
- Think about what's different from the case with linear neurons, and what's the same.
- The error function is still E = 1/2 * (y-t)^2
- Notice how after Geoff explained what the derivative is for a logistic unit, he considers the job to be done. That's because the learning rule is always simply some learning rate multiplied by the derivative.
Synonyms: "loss (function)"; "error (function)"; "objective (function) (value)".
- "loss" emphasizes that we're minimizing it, without saying much about what the meaning of the number is.
- "error" emphasizes that it's the extent to which the network gets things wrong.
- "objective function" is very generic. This is the only one where it's not clear whether we're minimizing or maximizing it.

January 23

Lecture 3d: The backpropagation algorithm
- Here, we start using hidden layers. To train them, we need the backpropragation algorithm.
- Hidden layers, and this algorithm, are very important in this course. If there's any confusion about this, it's worth resolving soon.
- The story of training by perturbations serves mostly as motivation for using backprop, and is not as central as the rest of the video.
- This computation, just like the forward propagation, can be vectorized across mulitple units in every layer, and multiple training cases.
Lecture 3e: Using the derivatives computed by backpropagation
- Here, two topics (optimization and regularization) are introduced, to be further explored later on in the course.

January 28

Lecture 4a: Learning to predict the next word
- Now that we have the basic method for creating hidden layers (backprop), we're going to see what can be achieved with them. We start to ask how the network learns to use its hidden units, with a toy application to family trees and a real application to language modeling.
- This material forms the basis of assignment 1.
- This video introduces distributed representations. It's not actually about predicting words, but it's building up to that.
- It does a great job of looking inside the brain of a neural network. That's important, but not always easy to do.
Lecture 4b: A brief diversion into cognitive science
- This video is part of the course, i.e. it's not optional, despite what Geoff says in the beginning of the video.
- This video gives a high-level interpretation of what's going on in the family tree network.
- This video contrasts two types of inference:
  - Conscious inference, based on relational knowledge.
  - Unconscious inference, based on distributed representations.
Lecture 4c: Another diversion: The softmax output function
- This is not really a diversion: it's a crucial ingredient of language models, and many other neural networks.
- We've seen binary threshold output neurons and logistic output neurons. This video presents a third type. This one only makes sense if we have multiple output neurons.
- The first "problem with squared error" is a problem that shows up when we're combining the squared error loss function with logistic output units. The logistic has small gradients, if the input is very positive or very negative.
Written material: The math of softmax units
- This goes over softmax units in more detail, including derivatives and detailed derivations.
Lecture 4d: Neuro-probabilistic language models
- This is the first of several applications of neural networks that we'll studying in some detail, in this course.
Synonyms: word embedding; word feature vector; word encoding.
- All of these describe the learned collection of numbers that is used to represent a word.
- "embedding" emphasizes that it's a location in a high-dimensional space: it's where the words are embedded in that space. When we check to see which words are close to each other, we're thinking about that embedding.
- "feature vector" emphasizes that it's a vector instead of a scalar, and that it's componential, i.e. composed of multiple feature values.
- "encoding" is very generic and doesn't emphasize anything specific.

January 30

Lecture 4e: Ways to deal with the large number of possible outputs
- Way 1: a serial architecture, based on trying candidate next words, using feature vectors (like in the family example). This means fewer parameters, but still a lot of work.
- Way 2: using a binary tree.
- Way 3: Collobert & Weston's search for good feature vectors for words, without trying to predict the next word in a sentence.
- Displaying learned feature vectors. Pretty picture!

February 4

Lecture 5a: Why object recognition is difficult
- We're switching to a different application of neural networks: computer vision, i.e. having a computer really understand what an image is showing.
- This video explains why it is difficult for a computer to go from an image (i.e. the color and intensity for each pixel in the image) to an understanding of what it's an image of.
- Some of this discussion is about images of 2-dimensional objects (writing on paper), but most of it is about photographs of 3-D real-world scenes.
- Make sure that you understand the last slide:
  - It explains how switching age and weight is like an object moving over to a different part of the image (to different pixels).
  - These two might sound like very different situations, but the analogy is in fact quite good: they're not really very different.
  - Understanding this is prerequisite for especially the next video.
Lecture 5b: Achieving viewpoint invariance
- "invariant" means, literally, that it doesn't vary: it doesn't change as a result of a change of viewpoint.
  - This means that if the neuron for the feature detector is fairly active (say it's a logistic neuron and it has a value close to 1) for one input image, then if we give the neural network a image of that same scene from a somewhat different viewpoint, that same neuron will still be fairly active. Its activity is invariant under viewpoint changes.
  - "invariant" is a matter of degrees: there's very little that's completely invariant, or that has no invariance at all, but some things are more invariant than others.
- The invariant features are things like "there's a red circle somewhere in the image", and the neuron for that feature detector should somehow learn to turn on when there is indeed a red circle in the input, and turn off if there isn't.
- Try to come up with examples of features that are largely invariant under viewpoint changes, and examples of features that don't have that property.
Lecture 5c: Convolutional nets for digit recognition
- Like many of the stories which we tell with the application of recognizing handwritten digits, this one, too, is applicable to a great variety of vision tasks.
  - It's just that handwritten digit recognition is a standard example for neural networks.
- Convolutional nets are still very much used.
- The slide "Backpropagation with weight constraints" can be confusing. Here are some clarifications. (note that not every researcher uses the same definitions)
  - Error Backpropagation (a.k.a. "backpropagation" or "backprop") is an algorithm that cleverly uses the chain rule to calculate gradients for neural networks. It doesn't really care about weights constraints.
  - What does care about weight constraints is the optimizer: the system that, bit by bit, changes the weights & biases of the network to reduce the error, and that uses the gradient (obtained by backprop) to figure out in which direction to change the weights.
  - The gradient for two weights will typically not be the same, even if they're two weights that we'd like to keep equal.
  - The optimizer can keep the "tied" weights the same in at least two ways.
  - One way is to use the sum of the gradients of the various "instances" of the tied weights as if it were the gradient for each of the instances. That's what the video describes.
  - Another way is to use the mean instead of the sum.
  - Both methods have their advantages.
  - The main point of this is that it's not the gradients that change if we have convolution; what changes is what we do with the gradients.
  - Another interpretation is to say that there really aren't two (or more) weights that we're trying to keep equal, but that there's really only one parameter that shows up in two (or more) places in the network.
    - That's the more mathematical interpretation.
    - It favours using the sum of gradients instead of the mean (you can try to figure out why, if you're feeling mathematical).
    - This interpretation is also closer to what typically happens in the computer program that runs the convolutional neural net.
Lecture 5d: Convolutional nets for object recognition
- This video is more a collection of interesting success stories than a thorough introduction to new concepts. Sit back and enjoy.
Terminology: "backpropagation" is often used as the name for the combination of two systems:
- System 1: the error backpropagation system that computes gradients.
- System 2: the gradient descent system that uses those gradients to gradually improve the weights and biases of a neural network.
- Most researchers, including Geoffrey, usually mean this combination, when they say "backpropagation".

February 6

Lecture 6a: Overview of mini-batch gradient descent
- Now we're going to discuss numerical optimization: how best to adjust the weights and biases, using the gradient information from the backprop algorithm.
- This video elaborates on the most standard neural net optimization algorithm (mini-batch gradient descent), which we've seen before.
- We're elaborating on some issues introduced in video 3e.
Lecture 6b: A bag of tricks for mini-batch gradient descent
- Part 1 is about transforming the data to make learning easier.
  - At 1:10, there's a comment about random weights and scaling. The "it" in that comment is the average size of the input to the unit.
  - At 1:15, the "good principle": what he means is INVERSELY proportional.
  - At 4:38, Geoff says that the hyperbolic tangent is twice the logistic minus one. This is not true, but it's almost true. As an exercise, find out's missing in that equation.
  - At 5:08, Geoffrey suggests that with a hyperbolic tangent unit, it's more difficult to sweep things under the rug than with a logistic unit. I don't understand his comment, so if you don't either, don't worry. This comment is not essential in this course: we're never using hyperbolic tangents in this course.
- Part 2 is about changing the stochastic gradient descent algorithm in sophisticated ways. We'll look into these four methods in more detail, later on in the course.
Jargon: "stochastic gradient descent" is mini-batch or online gradient descent.
- The term emphasizes that it's not full-batch gradient descent.
- "stochastic" means that it involves randomness. However, this algorithm typically does not involve randomness.
- However, it would be truly stochastic if we would randomly pick 100 training cases from the entire training set, every time we need the next mini-batch.
- We call traditional "stochastic gradient descent" stochastic because it is, in effect, very similar to that truly stochastic version.
Jargon: a "running average" is a weighted average over the recent past, where the most recent past is weighted most heavily.

February 11

Lecture 6c: The momentum method
- Now we're going to take a more thorough look at some of the tricks suggested in video 6b.
- The biggest challenge in this video is to think of the error surface as a mountain landscape. If you can do that, and you understand the analogy well, this video will be easy.
  - You may have to go back to video 3b, which introduces the error surface.
  - Important concepts in this analogy: "ravine", "a low point on the surface", "oscillations", "reaching a low altitude", "rolling ball", "velocity".
  - All of those have meaning on the "mountain landscape" side of the analogy, as well as on the "neural network learning" side of the analogy.
  - The meaning of "velocity" in the "neural network learning" side of the analogy is the main idea of the momentum method.
- Vocabulary: the word "momentum" can be used with three different meanings, so it's easy to get confused.
  - It can mean the momentum method for neural network learning, i.e. the idea that's introduced in this video. This is the most appropriate meaning of the word.
  - It can mean the viscosity constant (typically 0.9), sometimes called alpha, which is used to reduce the velocity.
  - It can mean the velocity. This is not a common meaning of the word.
- Note that one may equivalently choose to include the learning rate in the calculation of the update from the velocity, instead of in the calculation of the velocity.
Lecture 6d: Adaptive learning rates for each connection
- This is really "for each parameter", i.e. biases as well as connection strengths.
- Vocabulary: a "gain" is a multiplier.
- This video introduces a basic idea (see the video title), with a simple implementation.
  - In the next video, we'll see a more sophisticated implementation.
- You might get the impression from this video that the details of how best to use such methods are not universally agreed on. That's true. It's research in progress.
Lecture 6e: Rmsprop: Divide the gradient by a running average of its recent magnitude
- This is another method that treats every weight separately.
- rprop uses the method of video 6d, plus that it only looks at the sign of the gradient.
- Make sure to understand how momentum is like using a (weighted) average of past gradients.
Synonyms: "moving average", "running average", "decaying average".
- All of these describe the same method of getting a weighted average of past observations, where recent observations are weighted more heavily than older ones.
  - That method is shown in video 6e at 5:04. (there, it's a running average of the square of the gradient)
- "moving average" and "running average" are fairly generic. "running average" is the most commonly used phrase.
- "decaying average" emphasizes the method that's used to compute it: there's a decay factor in there, like the alpha in the momentum method.

February 13

Lecture 7a: Modeling sequences: A brief overview
- This video talks about some advanced material that will make a lot more sense after you complete the course: it introduces some generative models for unsupervised learning (see video 1e), namely Linear Dynamical Systems and Hidden Markov Models. These are neural networks, but they've very different in nature from the deterministic feedforward networks that we've been studying so far. For now, don't worry if those two models feel rather mysterious.
- However, Recurrent Neural Networks are the next topic of the course, so make sure that you understand them.
Lecture 7b: Training RNNs with back propagation
- Most important prerequisites to perhaps review: videos 3d and 5c (about backprop with weight sharing).
- After watching the video, think about how such a system can be used to implement the brain of a robot as it's producing a sentence of text, one letter at a time. What would be input; what would be output; what would be the training signal; which units at which time slices would represent the input & output?

February 25

Lecture 7c: A toy example of training an RNN
- Clarification at 3:33: there are two input units. Do you understand what each of those two is used for?
- The hidden units, in this example, as in most neural networks, are logistic. That's why it's somewhat reasonable to talk about binary states: those are the extreme states.
Lecture 7d: Why it is difficult to train an RNN
- This is all about backpropagation with logistic hidden units. If necessary, review video 3d and the example that we studied in class.
- Remember that Geoffrey explained in class how the backward pass is like an extra long linear network? That's the first slide of this video.
- Echo State Networks: At 6:36, "oscillator" describes the behavior of a hidden unit (i.e. the activity of the hidden unit oscillates), just like we often use the word "feature" to functionally describe a hidden unit.
- Echo State Networks: like when we were studying perceptrons, the crucial question here is what's learned and what's not learned. ESNs are like perceptrons with randomly created inputs.
- At 7:42: the idea is good initialization with subsequent learning (using backprop's gradients and stochastic gradient descent with momentum as the optimizer).
Lecture 7e: Long-term Short-term-memory
- This video is about a solution to the vanishing or exploding gradient problem. Make sure that you understand that problem first, because otherwise this video won't make much sense.
- The material in this video is quite advanced.
- In the diagram of the memory cell, there's a somewhat new type of connection: a multiplicative connection.
  - It's shown as a triangle.
  - It can be thought of as a connection of which the strength is not a learned parameter, but is instead determined by the rest of the neural network, and is therefore probably different for different training cases.
    - This is the interpretation that Geoffrey uses when he explains backpropagation through time through such a memory cell.
  - That triangle can, alternatively, be thought of as a multiplicative unit: it receives input from two different places, it multiplies those two numbers, and it sends the product somewhere else as its output.
    - Which two of the three lines indicate input and which one indicates output is not shown in the diagram, but is explained.
- In Geoffrey's explanation of row 4 of the video, "the most active character" means the character that the net, at this time, consider most likely to be the next character in the character string, based on what the pen is doing.

February 27

Lecture 9a: Overview of ways to improve generalization
- In the discussion of overfitting, we assume that the bottleneck of our ability to do machine learning is the amount of data that we have; not the amount of training time or computer power that we have.
Lecture 9b: Limiting the size of the weights
- There is some math in this video. It's not complicated math. You should make sure to understand it.
Lecture 9c: Using noise as a regularizer
- First slide
  - This slide serves to show that noise is not a crazy idea.
  - The penalty strength can be thought of as being sigma i squared, or twice that (to compensate for the 1/2 in the weight decay cost function), but that detail is not important here.
- Second slide (the math slide)
  - I don't entirely like the explanation of this slide, but the formulas are correct.
  - The reason why the middle term is zero is that all of the epsilons have mean zero.
  - You may notice that the result is not exactly like the L2 penalty of the previous video: the factor 1/2 is missing. Or equivalently, the strength of the penalty is not sigma i squared, but twice that. The main point, however, is that this noise is equivalent to an L2 penalty.
Jargon: overfitting, underfitting, generalization, and regularization
- Overfitting can be thought of as the model being too confident about what the data is like: more confident than would be justified, given the limited amount of training data that it was trained on.
- If an alien from outer space would take one look at a street full of cars (each car being a training case), and it so happens that there were only two Volkswagens there, one dark red and one dark blue, then the alien might conclude "all Volkswagens on Earth are of dark colours." That would be overfitting.
- If, on the other hand, the alien would be so reluctant to draw conclusions that he even fails to conclude that cars typically have four wheels, then that would be underfitting.
- We seek the middle way, where we don't draw more than a few unjustified conclusions, but we do draw most of the conclusions that really are justified.
- Regularization means forcing the model to draw fewer conclusions, thus limiting overfitting. If we overdo it, we end up underfitting.
- Jargon: "generalization" typically means the successful avoidance of both overfitting and underfitting. Since overfitting is harder to avoid, "generalization" often simply means the absence of (severe) overfitting.
- The "accidental regularities" that training data contains are often complicated patterns. However, NNs can learn complicated patterns quite well.
Jargon: "capacity" is learning capacity. It's the amount of potential (artificial) brain power in a model, and it mostly depends on the number of learned parameters (weights & biases).

March 4

Lecture 9d: Introduction to the full Bayesian approach
- Videos 9d and 9e are not easy. There's a lot of math, and not everything is explained in great detail. However, they provide invaluable insights into all regularization techniques. Don't rush through them; take your time.
- The full Bayesian approach is the ultimate in regularization. The gold standard. However, it takes so much computation time, that we always look for approximations to it.
- The terms "prior", "likelihood term", and "posterior" are explained in a more mathematical way at the end of the video, so if you're confused, just keep in mind that a mathematical explanation follows.
- For the coin example, try not to get confused about the difference between "p" (the probability of seeing heads) and "P" (the abbreviation for "probability").
- Jargon: "maximum likelihood" means maximizing the likelihood term, without regard to any prior that there may be.
- At 8:22 there's a slightly incorrect statement in the explanation, though not in the slide. The mean is not at .53 (although it is very close to that). What's really at .53 is the mode, a.k.a. the peak, a.k.a. the most likely value.
- The Bayesian approach is to average the network's predictions, at test time, where "average" means that we use network parameters according to the posterior distribution over parameter settings given the training data. Essentially, we're averaging the predictions from many predictors: each possible parameter setting is a predictor, and the weight for that weighted average is the posterior probability of that parameter setting.
Lecture 9e: The Bayesian interpretation of weight decay
- In this video, we use Bayesian thinking (which is widely accepted as very reasonable) to justify weight decay (which may sound like an arbitrary hack).
- Maximum A Posteriori (MAP) learning means looking for that setting of the network parameters that has greatest posterior probability given the data.
- As such it's somewhat different from the simpler "Maximum Likelihood" learning, where we look for the setting of the parameters that has the greatest likelihood term: there, we don't have a prior over parameter settings, so it's not very Bayesian at all. Slide 1 introduces Maximum Likelihood learning. Try to understand well what that has to do with the Bayesian "likelihood term", before going on to the next slide.
- The reason why we use Gaussians for our likelihood and prior is that that makes the math simple, and fortunately it's not an insane choice to make. However, it is somewhat arbitrary.
- 10:15: Don't worry about the absence of the factor 1/2 in the weight decay strength. It doesn't change the story in any essential way.
Lecture 10a: Why it helps to combine models
- This video is about a very different (and more powerful) method of preventing overfitting.
- There's, again, a lot of math, although it's less difficult than in videos 9d and 9e. Be sure to understand the formulas before moving on.
- We're going to combine many models, by using the average of their predictions, at test time.
- 5:38: There's a mistake in the explanation of why that term disappears.
  - The mistake is that -2(t-ybar) is not a random variable, so it makes no sense to talk about its variance, mean, correlations, etc.
  - The real reason why the term disappears is simply that the right half of the term, i.e. i, is zero, because ybar is the mean of the yi values.
Lecture 10b: Mixtures of Experts
- This is a different way of combining multiple models.
- "Nearest neighbor" is a very simple regression method that's not a neural network.
- 7:22: The formula is confusing.
  - The idea is a weighted average of squared errors (weighted by those probabilities p_i).
  - That can be written as an weighted expectation, with weights p_i, of (t-y_i)^2; or as a sum of p_i * (t-y_i)^2. The formula on the slide mixes those two notations.
  - On the next slide it's written correctly.
- 10:03: This formula is not trivial to find, but if you differentiate and simplify, you will find it.

March 6

Lecture 10c: The idea of full Bayesian learning
- In this video you learn what exactly we want to do with that difficult-to-compute posterior distribution.
- This video shows an ideal method, which is so time-consuming that we can never do it for normal-size neural networks. This is a theory video.
- We average the predictions from many weight vectors on test data, with averaging weights coming from the posterior over weight vectors given the training data.
  - That sounds simple and is indeed, in a sense, what happens.
  - However, there's more to be said about what this "averaging" entails.
  - The Bayesian approach is all about probabilities, so the idea of producing a single number as output has no place in the Bayesian approach.
  - Instead, the output is a distribution, indicating how likely the net considers every possible output value to be.
  - In video 9e we introduced the idea that the scalar output from a network really is the mean of such a predictive distribution. We need that idea again here.
    - That is what Geoffrey means at 6:37. "Adding noise to the output" is a way of saying that the output is simply the centre of a predictive distribution.
  - What's averaged is those distributions: the predictive distribution of the Bayesian approach is the weighted mean of all those Gaussian predictive distributions of the various weight vectors.
    - By the way, the result of this averaging of many such Gaussian distributions is not a Gaussian distribution.
  - However, if we're only interested in the mean of the predictive distribution (which would not be very Bayesian in spirit), then we can simply average the outputs of the networks to get that mean. You can mathematically verify this for yourself.
Lecture 10d: Making full Bayesian learning practical
- Maximum Likelihood is the least Bayesian.
- Maximum A Posteriori (i.e. using weight decay) is slightly more Bayesian.
- This video introduces a feasible method that's even closer to the Bayesian ideal. However, it's necessarily still an approximation.
- 4:22: "save the weights" means recording the current weight vector as a sampled weight vector.
Lecture 10e: Dropout
- This is not Bayesian. This is a specific way of adding noise (that idea was introduced in general in video 9c). It's a recent discovery and it works very, very well.
- Dropout can be viewed in different ways:
  - One way to view this method is that we add noise.
  - Another more complicated way, which is introduced first in the video, is about weight sharing and different models.
  - That second way to view it serves as the explanation of why adding noise works so well.
- The first slide in other words: a mixture of models involves taking the arithmetic mean (a.k.a. "the mean") of the outputs, while a product of models involves taking the geometric mean of the outputs, which is a different kind of mean.

March 13

Lecture 11a: Hopfield Nets
- Now, we leave behind the feedforward deterministic networks that are trained with backpropagation gradients. We're going to see quite a variety of different neural networks now.
  - These networks do not have output units.
  - These networks have units that can only be in states 0 and 1.
  - These networks do not have units of which the state is simply a function of the state of other units.
  - These networks are, instead, governed by an "energy function".
- Best way to really understand Hopfield networks: Go through the example of the Hopfield network finding a low energy state, by yourself. Better yet, think of different weights, and do the exercise with those.
- Typically, we'll use Hopfield networks where the units have state 0 or 1; not -1 or 1.
Lecture 11b: Dealing with spurious minima
- The last in-video question is not easy. Try to understand how the perceptron learning procedure is used in a Hopfield net; it's not very thoroughly explained.
Lecture 11c: Hopfield nets with hidden units
- This video introduces some sophisticated concepts, and is not entirely easy.
- An "excitatory connection" is a connection of which the weight is positive. "inhibitory", likewise, means a negative weight.
- We look for an energy minimum, "given the state of the visible units". That means that we look for a low energy configuration, and we'll consider only configurations in which the visible units are in the state that's specified by the data. So we're only going to consider flipping the states of the hidden units.
- Be sure to really understand the last two sentences that Geoffrey speaks in this video.

March 18

Lecture 11d: Using stochastic units to improve search
- We're still working with a mountain landscape analogy.
  - This time, however, it's not an analogy for parameter space, but for state space.
  - A particle is, therefore, not a weight vector, but a configuration.
  - What's the same is that we're, in a way, looking for low points in the landscape.
- We're also using the physics analogy of systems that can be in different states, each with their own energy, and subject to a temperature.
  - This analogy is introduced in slide 2.
  - This is the analogy that originally inspired Hopfield networks.
  - The idea is that at a high temperature, the system is more inclined to transition into configurations with high energy, even though it still prefers low energy.
- 3:25: "the amount of noise" means the extent to which the decisions are random.
- 4:20: If T really were 0, we'd have division by zero, which is not good. What we really mean here is "as T gets really, really small (but still positive)".
  - For mathematicians: it's the limit as T goes to zero from above.
- Thermal equilibrium, and this whole random process of exploring states, is much like the exploration of weight vectors that we can use in Bayesian methods. It's called a Markov Chain, in both cases.
Lecture 11e: How a Boltzmann machine models data
- Now, we're making a generative model of binary vectors. In contrast, mixtures of Gaussians are a generative model of real-valued vectors.
- 4:38: Try to understand how a mixture of Gaussians is also a causal generative model.
- 4:58: A Boltzmann Machine is an energy-based generative model.
- 5:50: Notice how this is the same as the earlier definition of energy. What's new is that it's mentioning visible and hidden units separately, instead of treating all units the same way.
Lecture 12a: Boltzmann machine learning
- 6:50: Clarification: The energy is linear in the weights, but quadratic in the states. What matters for this argument is just that it's linear in the weights.

March 20

Lecture 12c: Restricted Boltmann Machines
- 3:02. Here, a "particle" is a configuration. These particles are moving around the configuration space, which, when considered with the energy function, is our mountain landscape.
- 4:58. It's called a reconstruction because it's based on the visible vector at t=0 (via the hidden vector at t=0). It will, typically, be quite similar to the visible vector at t=0.
- A "fantasy" configuration is one drawn from the model distribution by running a Markov Chain for a long time.
  - The word "fantasy" is chosen as part of the analogy of a Boltzmann Machine vs. a brain that learned several memories.
Lecture 12d: An example of RBM learning
- This is not an easy video. Prerequisite is a rather extensive understanding of what an RBM does. Be sure to understand video 12c quite well before proceeding with 12d.
- Prerequisite for this video is that you understand the "reconstruction" concept of the previous video.
- The first slide is about an RBM, but uses much of the same phrases that we previously used to talk about deterministic feedforward networks.
  - The hidden units are described as feature detectors, or "features" for short.
  - The weights are shown as arrows, even though a Boltzmann Machine has undirected connections.
  - That's because calculating the probability of the hidden units turning on, given the state of the visible units, is exactly like calculating the real-valued state of a logistic hidden unit, in a deterministic feedforward network.
    - However, in a Boltzmann Machine, that number is then treated as a probability of turning on, and an actual state of 1 or 0 is chosen, randomly, based on that probability.
  - We'll make further use of that similarity next week.
- 2:30. That procedure for changing energies, that was just explained, is a repeat (in different words) of the Contrastive Divergence story of the previous video. If you didn't fully realize that, then review.
Lecture 13a: The ups and downs of back propagation
- 6:15: Support Vector Machines are a popular method for regression: for learning a mapping from input to output, as we have been doing with neural networks during the first half of the course.

March 25

Lecture 13b: Belief Nets
- 7:43. For this slide, keep in mind Boltzmann Machines. There, too, we have hidden units and visible units, and it's all probabilistic. BMs and SBNs have more in common than they have differences.
- 9:16. Nowadays, "Graphical Models" are sometimes considered as a special category of neural networks, but in the history that's described here, they were considered to be very different types of systems.

March 27

Lecture 13c: Learning sigmoid belief nets
- It would be good to read the first part of "The math of Sigmoid Belief Nets" before watching this video.
- 4:39. The second part of "The math of Sigmoid Belief Nets" mathematically derives this formula. Read it after finishing this video.
- 7:04. Actually, those numbers aren't quite correct, although they're not very far off. The take-home message, however, is correct: p(0,1) and p(1,0) are large, while the other two are small.
- 7:33. Here's "explaining away" rephrased in a few more ways:
  - If the house jumps, everybody starts wondering what might have caused that. Was there an earthquake? Did a truck hit the house? We're not at all sure.
  - When the wind then carries, through the open window, the voice of an upset truck driver bemoaning his bad luck, we know that a truck hit the house. That finding "explains away" the possibility that there might have been an earthquake: all of a sudden, we no longer suspect that there might have been an earthquake, even though we haven't consulted the seismological office.
  - In other words: as soon as we learn something about one possible cause (truck hits house), we can make an inference about other possible causes (earthquake).
Lecture 13d: The wake-sleep algorithm
- 4:38. Another way to say this is that the multiple units behave independently: the probability of unit 2 turning on has nothing to do with whether or not unit 1 turned on.
- 5:30. The green weights are the weights of the Sigmoid Belief Net.
- An "unbiased sample" from some distribution is a sample that's really drawn from that distribution. A "biased sample" is a sample that's not quite from the intended distribution.
- We don't really do maximum likelihood learning. We just use the maximum likelihood learning rule, while substituting "a sample from the posterior" by "a sample from the approximate posterior". The only "maximum likelihood" part of it is that the formula for going from that sample to delta w is the same.

April 1

Lecture 15a: From PCA to autoencoders
- Remember how, in assignment 4, we're use unsupervised learning to obtain a different representation of each data case? PCA is another example of that, but for PCA, there's even greater emphasis on obtaining that different representation.
- Chapter 15 is about unsupervised learning using deterministic feedforward networks.
  - By contrast, the first part of the course was about supervised learning using deterministic feedforward networks, and the second part was about unsupervised learning using very different types of networks.
- 0:26. A linear manifold is a hyperplane.
- 1:25. A curved manifold is no longer a hyperplane. One might say it's a bent hyperplane, but really, "hyperplane" means that it's not bent.
- 1:37. "N-dimensional data" means that the data has N components and is therefore handled in a neural network by N input units.
- 1:58. Here, that "lower-dimensional subspace" is yet another synonym for "linear manifold" and "hyperplane".
- 2:46 and 3:53. Geoffrey means the squared reconstruction error.
- 4:43. Here, for the first time, we have a deterministic feedforward network with lots of output units that are not a softmax group.
- An "autoencoder" is a neural network that learns to encode data in such a way that the original can be approximately reconstructed.
Lecture 15b: Deep autoencoders
- 2:51. "Gentle backprop" means training with a small learning rate for not too long, i.e. not changing the weights a lot.
Lecture 15c: Deep autoencoders for document retrieval
- "Latent semantic analysis" and "Deep Learning" sound pretty good as phrases... there's definitely a marketing component in choosing such names :)
- 1:14. The application for the method in this video is this: "given one document (called the query document), find other documents similar to it in this giant collection of documents."
- 2:04. Some of the text on this slide is still hidden, hence for example the count of 1 for "reduce".
- 3:09. This slide is a bit of a technicality, not very central to the story. If you feel confused, postpone focusing on this one until you've understood the others well.
- 6:49. Remember t-SNE?

April 3

Lecture 15d: Semantic Hashing
- We're continuing our attempts to find documents (or images), in some huge given pile, that are similar to a single given document (or image).
- Last time, we focused on making the search produce truly similar documents. This time, we focus on simply making the search fast (while still good).
- This video is one of the few times when machine learning goes hand in hand very well with intrinsically discrete computations (the use of bits, in this case).
- We'll still use a deep autoencoder.
- This video is an example of using noise as a regularizer (see video 9c).
- Crucial in this story is the notion that units of the middle layer, the "bottleneck", are trying to convey as much information as possible in their states to base the reconstruction on.
  - Clearly, the more information their states contain, the better the reconstruction can potentially be.
Lecture 15e: Learning binary codes for image retrieval
- It is essential that you understand video 15d before you try 15e.
- 7:13. Don't worry if you don't understand that last comment.
Lecture 15f: Shallow autoencoders for pre-training
- This video is quite separate from the others of chapter 15.

CSC321 - Introduction to Neural Networks and Machine Learning