Bayesian Neural Networks

Bayesian inference allows us to learn a probability distribution over possible neural networks. We can approximately solve inference with a simple modification to standard neural network tools. The resulting algorithm mitigates overfitting, enables learning from small datasets, and tells us how uncertain our predictions are.

What's wrong with neural networks?

You may have heard deep neural networks described as powerful function approximators. Their power is due to the extreme flexibility of having many model parameters (the weights and biases) whose values can be learned from data via gradient-based optimization. Because they are good at approximating functions (input-output relationships) when lots of data are available, neural networks are well-suited to artificial intelligence tasks like speech recognition and image classification.

But the extreme flexibility of neural networks has a downside: they are particularly vulnerable to overfitting. Overfitting happens when the learning algorithm does such a good job of tuning the model parameters for performance on the training set—by optimizing its objective function—that the performance on new examples suffers. Deep neural networks have a ton of parameters (typically millions in modern models), which essentially guarantees eventual overfitting because the learning algorithm can always do just a little bit better on the training set by tweaking some of the many knobs available to it. The flexibility of neural networks during training time actually makes them brittle at test time. This might sound surprising at first, but let's look at the training procedure both mathematically and graphically (for a toy problem) to build some intuition around why deep neural networks overfit.

In standard neural network training we want to learn an input-to-output mapping $y \approx f(x, w)$ via a network $f$ with weights $w$. We use a dataset of labeled examples $D = \{x_i, y_i\}$ to minimize a loss function $L(D, w)$ with respect to the weights $w$:

$$ \newcommand{\niceblue}[1]{\textcolor{#0074D9}{#1}} \newcommand{\nicered}[1]{\textcolor{#FF4136}{#1}} \newcommand{\nicegreen}[1]{\textcolor{#2ECC40}{#1}} \newcommand{\niceorange}[1]{\textcolor{#FFA50}{#1}} \newcommand{\nicepurple}[1]{\textcolor{#B10DC}{#1}} $$ $$ \newcommand{\w}{\niceblue{w}} \newcommand{\y}{\nicered{y}} \newcommand{\x}{\nicegreen{x}} \newcommand{\D}{\niceorange{D}} \newcommand{\mylambda}{\nicepurple{\lambda}} $$
$$L(\textcolor{#FFA500}{D},\textcolor{#0074D9}{w}) \coloneqq \sum_{\textcolor{#2ECC40}{x}_i,\textcolor{#FF4136}{y}_i \in D} (\textcolor{#FF4136}{y}_i - f(\textcolor{#2ECC40}{x}_i,\textcolor{#0074D9}{w}))^2 + \textcolor{#B10DC9}{\lambda} \sum_d \textcolor{#0074D9}{w}_d^2$$
Loss of our model on dataset $\textcolor{#FFA500}{D}$
is squared error between the target label $\textcolor{#FF4136}{y}_i$ and output of network with input $\textcolor{#2ECC40}{x}_i$ and weights $\textcolor{#0074D9}{w}$
hyperparameter $\textcolor{#B10DC9}{\lambda}$ sets importance of regularizing large weight values $\textcolor{#0074D9}{w}_d$.

This turns out to be equivalent to maximum likelihood training, in other words, maximizing the log probability of the data $D$ and weights $w$:

$$\log p(\textcolor{#FFA500}{D},{\textcolor{#0074D9}{w}}) \coloneqq \sum_{\textcolor{#2ECC40}{x}_i,\textcolor{#FF4136}{ y}_i \in D} \log \mathcal{N}(\textcolor{#FF4136}{y}_i | f(\textcolor{#2ECC40}{x}_i,\textcolor{#0074D9}{w}),I) + \sum_d \log \mathcal{N}(\textcolor{#0074D9}{w}_d| 0, \tfrac{1}{\sqrt{\textcolor{#B10DC9}{\lambda}}})$$
Log-likelihood of dataset $\textcolor{#FFA500}{D}$ under model
is log-probability of target label $\textcolor{#FF4136}{y}_i$ under gaussian distribution $\mathcal{N}$ with mean given by output of network $f(\textcolor{#2ECC40}{x}_i, \textcolor{#0074D9}{w})$ and variance $1$.
and prior belief about weights given by log-probability of $\textcolor{#0074D9}{w}_d$ under Gaussian $\mathcal{N}$ with $0$ mean and hyperparameter-controlled variance $\frac{1}{\sqrt{\textcolor{#B10DC9}{\lambda}}}$.

We train on the training dataset $D_\text{train}$, but evaluate on the test dataset. As we see in the demo below, the training objective function $\log p(D_\text{train})$ is different from the test objective function $\log p(D_\text{test})$. Overfitting happens when we choose network weights $w$ that do well on $\log p(D_\text{train})$ but poorly on $\log p(D_\text{test})$. As we add more layers to the network, the problem gets worse.

Luckily, Bayesian neural networks address overfitting by modeling uncertainty in the weights. Plus they can be trained using standard neural net tools using an algorithm called stochastic variational inference, which we cover at the end of this tutorial. But first let's explore why being Bayesian helps with overfitting.

Training log likelihood $\log p({\textcolor{#FFA500}{D_\text{train}}},\textcolor{#0074D9}{w_1, w_2})$
The neural network objective $L$ is a function of both the weights and the dataset. Here we show $L_{\text{train}}$ and $L_{\text{test}}$ as a function of only two of the neuron weights, with the remaining weights fixed to pre-trained values. Redder is better. Use your mouse to change the values of the weights for these neurons. Notice how weight values affect both the training and test objectives (left, right) and the network predictions changes (below).
Training log likelihood $\log p({\textcolor{#FFA500}{D_\text{test}}},\textcolor{#0074D9}{w_1, w_2})$
Prediction function $\hat y_w(x)$. Red dots are training data, green are test.
Deep
Shallow
Linear

When it comes to describe training, the last figure is somewhat deceptive, in that the background heatmap appear to be fixed. It only appear so because we fixed every weights and biases except the first two. Only two weights are being adjusted here. If we freed all the weights, the colourful background you see would be constantly changing.

Training log likelihood $\log p({\textcolor{#FFA500}{D_\text{train}}},\textcolor{#0074D9}{w_1, w_2})$
Observe how in actual training, the weights are (gradient) descending a constantly changing landscape.
Training log likelihood $\log p({\textcolor{#FFA500}{D_\text{test}}},\textcolor{#0074D9}{w_1, w_2})$
Validation Set
Training Set
Training log likelihood $\log p({\textcolor{#FFA500}{D_\text{train}}},\textcolor{#0074D9}{w_1, w_2})$
The same discrepancy happens in classification tasks. Notice how the contours differ.
Training log likelihood $\log p({\textcolor{#FFA500}{D_\text{test}}},\textcolor{#0074D9}{w_1, w_2})$

Bayes to the rescue

The problem of overfitting is certainly not unique to neural networks. But because the flexibility of neural networks makes them particularly susceptible, researchers and practitioners have come up with many extensions to the standard learning algorithm (early stopping, weight decay, and dropout, just to name a few) to reduce overfitting .

This tutorial focuses on Bayesian inference, a powerful framework that not only helps with overfitting, but also tells us how uncertain our model is about its parameters. Before we used the training data $D_{\text{train}}$ plus a gradient-based optimizer to tune the weights $w$ in order to maximize $L_{\text{train}}(w)$. In Bayesian inference, instead of learning the parameter values, we seek to compute $p(w | D_{\text{train}})$, the conditional distribution of the weights given the training data. $p(w|D_{\text{train}})$ is called the posterior distribution, or often the posterior for short.

Exact Bayesian inference

Bayes' rule tells us how to compute the posterior: $$ p(w | D) = \dfrac{p(D|w)p(w)}{p(D)} = \dfrac{p(D|w)p(w)}{\int_{w'} p(D|w')p(w') dw'} . $$ Computing the posterior in this way is sometimes called exact inference, simply to distinguishes it from approximations which we will later find to be more practical. Exact inference necessitates specifying both the prior $p(w)$, and the likelihood $p(D | w)$. While the posterior distribution is easily expressed, it is typically expensive to compute due to the pesky integral over all possible values of $w$. For most neural network we can't evaluate this integral analytically. We could instead approximate it numerically, but this will only be practical for small neural networks, since $w$ represents all the weights and biases, so it becomes very high dimensional for deep networks.

So we will need to address this computational issue, which we do in the next section. In the meantime, let's consider another upside of using the posterior distribution over weights $p(w|D)$ instead of the estimate $w^*$ from standard neural network training. In particular, knowing the posterior allows us to do probabilistic prediction by expressing a distribution over predicted output $\hat y$ as a function of the new input $x$, $$ p(\hat y(x)| D) = \int_{w} p(\hat y(x)| w) p(w | D) dw = \mathbb{E}_{p(w|D)}[p(\hat y(x)|w)], $$ which we call the predictive distribution. Notice that we can express this distribution as the expectation of the single network likelihood under the posterior $p(w|D)$. This yields a nice interpretation of the predictive distribution as an infinite ensemble of networks , with each network's contribution to the overall prediction weighted by the posterior likelihood of its weights given the training data. Also, it implies that we can approximate this infinite ensemble using a finite number of Monte Carlo samples from the posterior. Approximating the predictive distribution with only a single Monte Carlo sample is equivalent to using a single network $\hat y_w(x)$ with weights $w$ chosen at random from $p(w|D)$; in a slight abuse of terminology, we can think of this process as sampling networks from the posterior.

Modeling uncertainty

But before we move on, let's briefly consider the implications of Bayesian inference, which replaces a point estimate of the weights $w^*$ and its corresponding prediction function $\hat y_{w^*}(x)$ from standard training with inferred distributions $p(w|D)$ and $p(\hat y_{w}(x)|D)$. An immediate benefit of specifying distributions over the model parameters and predictions is that we can quantify our uncertainty about these things, e.g., by computing their variance. This is especially relevant when learning from small datasets; standard neural net training will overfit for the reasons discussed above, but Bayesian inference will find the best explanation for the model parameters given the available data, which typically have high uncertainty when data is scarce. In the limit of dataset size far exceeding the number of neurons, the inferred distributions sharpen and begin to resemble the solutions from the standard training; for modern networks we don't typically reach this limit in practice, so having a notion of model uncertainty is helpful.

Approximate inference

We have seen computationally difficult integrals arise during the Bayesian setup in the following two expressions While the second expression is more important for prediction, evaluating it exactly implicitly references the first expression via the posterior. :

At first glance, the glaring difficulty of computing these objects might suggest that Bayesian neural networks will be difficult implement in practice. But as we will see, it is possible to approximately compute one or both of these objects via clever algorithms, which can be broadly categorized as either sampling-based or variational.

Sampling methods approximately compute $p(\hat y(x)|D) = \mathbb{E}_{p(w|D)}[p(\hat y(x)|w)$ by generating a finite set of network weights $\{w_1, \ldots, w_N\}$ whose empirical distribution matches $p(w|D)$ in the limit of large $N$. Loosely speaking, the algorithmic design challenge is to relatively quickly produce a modest number ($N$ not too big) of network samples that yield a decent approximation of $p(\hat y(x)| D)$

Variational methods instead directly model the posterior $p(w|D)$ using a parameterized distribution $q_\phi(w)$, called the approximate posterior, then iteratively improve the approximation by solving a suitable optimization problems. We will show that a suitable optimization problem can be derived, so the remaining design challenges are to specify $q_\phi$ and an optimizer. Thankfully, we can use off-the-shelf optimizers from standard neural network training!

While both sampling and variational methods are of practical and historical import, we emphasize that the latter approach admits a learning algorithm, called stochastic variational inference, that closely resembles standard neural network training. We devote a section to this algorithm to underscore that being Bayesian in neural network learning is well worth the modest cost in engineering overhead.

TODO: make the point (perhaps in a footnote that variational methods still typically evaluate the posterior predictive via a Monte Carlo estimate, so they are not sampling-free.

Sampling methods

Monte Carlo integration

Sampling methods rely on Monte Carlo integration: the use of a finite set of random samples to approximate an expected value. In particular, they generate a set of neural networks asymptotically distributed according to $p(w|D)$ in order to approximate $p(\hat y(x)|D)$ as the empirical expectation of single-network predictions $p(\hat y(x)|w)$ under the sampled networks. Monte Carlo integration is widely applicable, but it isn't always practical. We can see this by deriving a naive approximation to one of our previously identified difficult integrals: the marginal data likelihood $p(D)$. $$ p(D) = \int_{w} p(D|w)p(w) dw = \mathbb{E}_{p(w)} \left[ p(D|w) \right] \approx \frac{1}{N} \sum_{i=1}^{N} p(D|w_i). $$ Here the sample set of network weights $\{w_1 \ldots w_N\}$ are drawn independently from the prior $p(w)$.

Given a good estimate for $p(D)$ then we could again use Monte Carlo integration using network samples $\{w_i\}$ from the prior to approximate the predictive posterior: $$\begin{aligned} p(\hat y(x)|d) &= \mathbb{e}_{p(w|d)} \left[ p(\hat y(x)|w) \right] = \mathbb{e}_{p(w)} \left[ p(\hat y(x)|w) \frac{p(d|w)}{p(d)} \right]\\ &\approx \frac{1}{n} \sum_{i=1}^{n} p(\hat y(x)|w_i) \frac{p(d|w_i)}{p(d)} \end{aligned}$$ $$\begin{aligned} p(\hat y(x)|D) &= \mathbb{E}_{p(w|d)} \left[ p(\hat y(x)|w) \right]\\ &\approx \frac{1}{N} \sum_{i=1}^{n} p(\hat y(x)|w_i) \quad \text{where} \quad w_i \sim p(w|D) \end{aligned}$$ The law of large numbers assures us that our both Monte Carlo estimates will eventually converge to the true quantities given sufficient number of samples, but in this naive setup $N$ will need to be huge due to the large dimensionality of $w$.

Markov Chain Monte Carlo

We saw that Monte Carlo estimates from naively sampling network weights naively from the prior won't work. We can sacrifice our assumption of sample independence, instead using a Markov Chain to produce a sequence of dependent samples. This is formalized by a proposal distribution over choices of next sample in the sequence conditioned on the previous choice $q(w'|w_t)$. The Metropolis algorithm uses this proposal distribution to generate a set of samples that asymptotically are distributed according to $p(w|D)$. We use the Markov chain to generate candidate samples and then stochastically accept them with probability $a$, expressed as the acceptance rate Here we assume a symmetric proposal distribution to simplify the expression of $a$, but this assumption is not necessary . is defined as $$ a = \frac{p(D, w')}{p(D, w_t)} .$$ The accept rate is also equivalent to the ratio of posterior probabilities under the proposed and most recent samples $\frac{p(w'|D)}{p(w_t| D)}$. This is because $p(D, w) = p(w|D)p(D)$ is equal to the posterior up to a multiplicative constant $p(D)$ that is the same for all $w$; here we've leveraged the fact that we can compute $p(D, w)$ but not $p(w|D)$. A simple specification of proposal distribution is a Normal distribution centered at the current sample $ q(w'|w_t) = \mathcal{N}(w_t, \sigma^2) $. If we were to accept every sample from this proposal distribution, we would be performing a random walk in parameter space with step size equal to $\sigma$. The accept/reject step of the Metropolis algorithm mitigates random steps into regions of low probability mass. We've introduced a new hyperparameter $\sigma$ that will need tuning. Generally speaking, we want to accept new samples often in order to converge quickly, and the acceptance rate is sensitive to the value of $\sigma$. MCMC based methods are better for high dimensionality problems since they focus on areas with relatively high probability mass, we still don't have any guarantees about their speed of convergence.

TODO: segueway to the demo

Gaussian Random Walk Proposal on Training Data

The demo above plots the sequence of functions we get when sampling network weights from the posterior using two sampling algorithms: one with the Metropolis proposal and one with Hamiltonian dynamics. shows the prediction function for a sequence of networks sampled in this way, as well as the approximate predictive distribution computed as the running average of sampled networks. The weighted running average of sampled functions is shaded in red; this corresponds to the Monte Carlo estimate of the predictive distribution $p(\hat y(x)| D)$. On the other hand, producing samples from this unknown distribution is often feasible using algorithms described in the next section, and we can aggregate a finite number of these samples to obtain an approximate posterior.

TODO: briefly describe the wealth of other MCMC options that seek to design a better proposal distribution, cite Neal, MacKay, etc.

Variational inference

TODO: contrast sampling-based inference with gradient based inference, especially with their ability to scale.

Sampling-based inference is good enough for many applications, especially where the sample space is reasonable and achieving an asymptotically exact solution is worth the computational cost. As network and dataset size grow, however, so grows the cost of estimating the posterior via sampling. In these cases we can instead solve a distinct but closely related inference problem that admits a learning algorithm very similar algorithm to what we used for standard neural network training; we'll pose the problem now and discuss the learning algorithm in the next section.

Modeling the approximate posterior

Generally speaking, when faced with an intractable distribution like the posterior $p(w|D)$, we can appeal to variational methods, which first define a parametrized and tractable stand-in distribution, here called the approximate posterior $q_\phi(w)$ While we can alternatively specify the approximate posterior as conditionally dependent on specific inputs $q_\phi(w|x)$, as in the variational autoencoder's encoder network , this is not necessary in the classification task we consider here since there is no need to reconstruct the inputs. , then tune the parameters $\phi$ so that it better approximates the intractable distribution. Developing a variational method for approximate inference requires two steps: first, formalizing a notion of similarity between two probability distributions, then writing down a tractable optimization problem that corresponds to maximizing this notion of similarity. The second step, maximizing the similarity of the approximate and true posteriors, might seem difficult since we can't even evaluate the true posterior. But we will see that our choice of objective in the first step will help us in this regard.

The Kullback-Liebler divergence

Mean μ = 0.0
Standard Deviation = 1.0
$D_{KL}(q||p)$
$D_{KL}(p||q)$
$D_{JS}(p, q)$
In variational methods we fit a tractable parameterized distribution $q$ (e.g., the single Gaussian in red) to a complicated (sometimes intractable distribution) $p$ (e.g., the mixture of Gaussians in black) by optimizing the divergence between these distributions. The shaded contour represents $q \log \frac{q}{p}$ and the diameter of the circle and its color represents its integral, the KL-divergence $d_{KL}(q||p)$. How should we select parameter values of $q$ if we want to minimize $d_{KL}(q||p)$? What about if we want to minimize one of the alternative metrics, reverse KL or Jensen-Shannon distance?

We use a quantity from information theory called the Kullback-Liebler divergence (or KL-divergence) to convey dissimilarity between two distributions, expressed as $$\begin{aligned} \quad d_{KL}[q_\phi(w)||p(w|D)] &\coloneqq \int_{w} q_\phi(w) \log \frac{q_\phi(w)}{p(w|D)} \\ &= \mathbb{E}_{q_\theta(w)}[\log q_\phi(w) - \log p(w|D)]. \end{aligned}$$ So our optimization problem should be something like "minimize the KL-divergence from the approximate posterior to the true posterior", which we will formalize shortly. But before proceeding, it's worthwhile to consider some properties of the KL-divergence using the above demo, which fixes a $p$ that we'd like to approximate, and shows how the parameters of a Gaussian $q$ (its mean and variance) affect $d_{KL}(q||p)$.

By definition the KL-divergence is not symmetrical, $d_{KL}(q||p) \neq d_{KL}(p||q)$, which is why it's called a divergence instead of a distance. We choose the $q||p$ direction for computational reasons: since the expectation is expressed over $q$ we can estimate it using Monte Carlo samples. But we should note the implications of this choice of direction; we can see in the demo that in the $q||p$ direction wherever we don't assign probability mass to $q$ there is no price paid for failing to model mass in $p$. Moreover, $q$ must be 0 wherever $p$ is zero, otherwise $d_{KL}(q||p)$ blows up. In the demo you can also see how the parameters of $q$ affect two alternative measures of dissimilarity: the reverse KL-divergence $d_{KL}(p||q)$ and Jensen-Shannon distance defined as $d_{JS}(p, q) = 0.5d_{KL}(p||0.5p+0.5q) + 0.5d_{KL}(q||0.5p+0.5q)$ .

Approximate inference as optimization

Having acquired some intuition about the KL-divergence---our notion of dissimilarity between the approximate and true posteriors---we now derive an optimization problem that minimizes this quantity. Regrettably, we can't directly optimize $d_{KL}(q_\phi(w)||p(w|D)$ due to the familiar frustration of not being able to compute the posterior. Fortunately, we can derive a tractable and closely related quantity that is equal to the KL-divergence up to an additive constant and thus suitable as a surrogate objective. $$\begin{aligned} d_{KL} (q_\phi(w)||p(w|D)) &= \mathbb{E}_{q_\phi(w)} \left[ \log q_\phi(w) \right] - \mathbb{E}_{q_\phi(w)} \left[ \log p(w|D) \right] \\ &= \mathbb{E}_{q_\phi(w)} \left[ \log q_\phi(w) \right] - \mathbb{E}_{q_\phi(w)} \left[ \log p(w, D) - \log p(D) \right] \\ &= \mathbb{E}_{q_\phi(w)} \left[ \log q_\phi(w) + \log p(w, D)\right] - \log p(D) \end{aligned}$$ Here $p(D)$ represents the marginal data likelihood; we know this quantity to be intractable to compute since it was the denominator in Bayes' rule. But notably this term does not depend on the parameters of $q_\phi$, which means it can be ignored while optimizing with respect to $\phi$. We've encapsulated all of the computational intractability in the $\log p(x)$ term; taking the expectation of the remaining terms over the data distribution $D$ yields a tractable and suitable surrogate objective called the evidence lower bound (ELBO): $$\begin{aligned} \tilde L(\phi) &:=\mathbb{E}_{D, q_\phi(w)} \left[ \log p(w, D) - \log q_\phi(w) \right] \\ &= \mathbb{E}_{x, y \sim D} \left[ \mathbb{E}_{w \sim q_{\phi}(w)} \left[ \log p(\hat y(x) = y|w) + \log p(w) - \log q_{\phi}(w) \right] \right] \end{aligned}$$ Thus the variational inference corresponds to optimizing the ELBO with respect to the variational parameters: $$\underset{\phi}{\text{maximize}} \quad \tilde L(\phi)$$

The probability density function of $q_\phi(w)$, which depends on its parameters $\phi$ in addition to $w$, is a design choice that we will need to specify.

Optimizing with gradients

Variational inference re-frames the computation of an integral (the marginal likelihood from exact inference) as the optimization of its lower bound. Notably, this implies that we can now use tools from the optimization literature to approximately solve our inference problem. This includes optimizing with minibatch (a.k.a. "stochastic") gradients, like we did with standard neural network training. Thus, with a slight tweak, we can adapt stochastic gradient ascent to variational inference in an algorithm called Stochastic Variational Inference.

As discussed above, an important step of designing a variational inference algorithm is to specify the parameterized variational distribution over the weights, $q_\phi(w)$. In order to explicate the learning algorithm we discuss a simple choice of variational distribution: a diagonal-covariance Gaussian. However the practical import of appropriately specifying $q$ should be emphasized; in theory we want $q$ to be be sufficiently expressive to model the true posterior $p(w|D)$, which may not be true of the diagonal-covariance Gaussian for most interesting problems.

TODO: opportunity to cite papers like normalizing flows here

Specifying $q$ as a diagonal-covariance Gaussian implies that all of the network weights $w$ are jointly distributed according to the multivariate Gaussian $q_{\mu, \sigma}(w) = \mathcal{N}(\mu, \sigma^2 I)$, which has log probability $\log q_{\mu, \sigma}(w) = -\frac{1}{2}(w-\mu)^T(\sigma^2 I)^{-1}(w - \mu)$ Note that we have chosen an notation that makes explicit the variational parameters comprising all the means and standard deviations $\phi = (\mu, \sigma)$. Due to the structure of the covariance matrix, we can equivalently think of each network weight $w_i$ as a random variable distributed according to its own Gaussian distribution with mean $\mu_i$ and standard deviation $\sigma_i$.

In order to derive a gradient-based learning algorithm, let's compare the variational inference objective function $\tilde L(\phi)$ (with our choice of $q_\phi$) to the MAP objective function $L(w)$ from standard neural network training. MAP neural network training is $$ L(w) = \mathbb{E}_{x, y \sim D} \left[ \log p(\hat y(x) = y|w) + \log p(w) \right]. $$ VI neural network training is $$ \tilde L(\mu, \sigma) = \mathbb{E}_{x, y \sim D} \left[ \mathbb{E}_{w \sim q_{\mu, \sigma}} \left[ \log p(\hat y(x) = y|w) + \log p(w) - \log q_{\mu, \sigma}(w) \right] \right] $$

We want to optimize the VI objective with gradients, which necessitates computing $\nabla_\mu \tilde L(\mu, \sigma)$ and $\nabla_\sigma \tilde L(\mu, \sigma)$, the vector of partial gradients of the VI objective with respect to the model parameters $(\mu, \sigma)$. Note that whereas both the MAP and VI objectives involve an expectation over the data distribution, the VI objective additionally requires an expectation over the weights distribution. We can use the standard stochastic gradient ascent algorithm, which we discuss next, to compute gradients through the expectation over data $\mathbb{E}_{x, y \sim D}$ using minibatches. The trouble is with the second expectation $\mathbb{E}_{w \sim q_{\mu, \sigma}}$ since the distribution $q_{\mu, \sigma}$ depends on the parameters $\mu$ and $\sigma$ that we want to optimize.

Stochastic gradient ascent

Let's revisit a detail we skipped over in earlier on: how to compute gradients with respect to $w$ of the single network log likelihood. The quantity we want to compute is $\nabla_w L(w)$. We can show that the gradient of the average over training examples is equivalent to the average over the $N$ training examples of individual gradients as follows: $$ \begin{aligned} \nabla_w L(w) &= \nabla_w \mathbb{E}_{x,y \sim D} \left[ \log p(\hat y(x_i) = y_i|w) + \log p(w) \right] \\ &= \nabla_w \frac{1}{N} \sum_{i=1}^{N} \left[ \log p(\hat y(x_i) = y_i|w) + \log p(w) \right] \\ &= \underbrace{\frac{1}{N} \sum_{i=1}^{N} \nabla_w \left[ \log p(\hat y(x_i) = y_i|w) + \log p(w) \right]}_{g_{\text{batch}}(w)}. \end{aligned}$$ We call this computation of the exact gradient $g_{\text{batch}}(w)$ the batch gradient because it involves averaging over the entire batch of training data at once. We used the independence of the data from the weights to move the gradient operator $\nabla_w$ inside the expectation over training data $\mathbb{E}_{x,y}$. Each $\nabla_w$ inside the sum can be computed efficiently with backpropagation, and the overall average is parallelizable. The algorithm that iteratively updates the weights according to $w \leftarrow w + \nabla_{w}L(w)$ is thus called batch gradient ascent

Whereas batch gradient ascent requires iterating over the entire training data for every update to $w$, we might reasonably assume that the gradient $\nabla_w L(w)$ can be estimated by averaging over a smaller subset of training examples. By dividing the training data into $M$ partitions called "minibatches" we can compute this approximate gradient by averaging over the $N_m$ samples of the $m$-th minibatch: $$ \nabla_w L(w) \approx \underbrace{\frac{M}{N} \sum_{i=1}^{N_m} \nabla_w \left[ \log p(\hat y(x_i) = y_i|w) + \log p(w) \right]}_{g_{\text{SGA}}(w)}$$ Estimating the gradient in this way is called stochastic gradient ascent (SGA). It speeds up the training by roughly a factor of $M$ so long as the minibatches are large enough so that $g_{\text{SGA}}(w)$ is a reasonably good approximation to $g_{\text{batch}}(w)$. Interestingly, stochastic gradient updates provide some additional benefits including mitigating overfitting.

TODO: cite papers on the theory of SGD here. also maybe talk more about the role of gradient noise as a regularizer, since we will use that concept later on

The reparameterization gradient estimator

Now lets turn our attention to the gradient-based optimization of the variational objective $\tilde L(\mu, \sigma)$. $$ \tilde L(\mu, \sigma) = \mathbb{E}_{x, y \sim D} \left[ \mathbb{E}_{w \sim q_{\mu, \sigma}} \left[ \log p(\hat y(x) = y|w) + \log p(w) - \log q_{\mu, \sigma}(w) \right] \right] $$ We will need to compute the gradients of the objective with respect to the parameters, $\nabla_\mu \tilde L(\mu, \sigma)$ and $\nabla_\sigma \tilde L(\mu, \sigma)$ We saw earlier how the independence of the data from the weights lets us move the gradient operator $\nabla$ inside the outer-most expectation. $$ \nabla_{\mu, \sigma} \tilde L(\mu, \sigma) = \frac{1}{N} \sum_{i=1}^{N} \left[\nabla_{\mu, \sigma} \mathbb{E}_{w \sim q_{\mu, \sigma}} \left[ \log p(\hat y(x_i) = y_i|w) + \log p(w) - \log q_{\mu, \sigma}(w) \right] \right] $$ Unfortunately, that's as far as we can move $\nabla$, since the next expectation is over a distribution that depends on the parameters $\mu$ and $\sigma$ that we're trying to optimize.

TODO: more explication here

$$ w \sim \mathcal{N}(\mu, \sigma I) $$ This can be reparameterized as $w = \mu + \sigma \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$. Putting this all together (including the minibatch trick from before) yields the SVI gradient estimator: $$ \nabla_{\mu, \sigma} \tilde L(\mu, \sigma) \approx \underbrace{\frac{M}{N} \sum_{i=1}^{N_m} \sum_{j=1}^{S} \nabla_{\mu, \sigma} \left[ \log p(\hat y(x_i) = y_i|w(\mu, \sigma, \epsilon_j)) + \log p(w(\mu, \sigma, \epsilon_j)) \right]}_{g_{\text{SVI}}(\mu, \sigma)}. $$ Notice that we've introduced a new hyperparameter $S$ which sets the number of samples from $p(\epsilon)$ we draw to Monte Carlo estimate the expected log likelihood. We can think of this as the number of sampled networks we consider during training. Perhaps surprisingly, we can often get away with a small value of $S$ at training time ($S=1$ is commonly used). At test time, however, we can increase the number of samples used to estimate the posterior predictive distribution.

TODO: list the exact assumptions of the reparameterization estimator and mention what we can do if the variational distribution over weights is discrete (cite Luisoz work on binary nnets)

Stochastic variational inference

TODO: put it all together and segue to the algo box

        \begin{algorithmic}
        \PROCEDURE{SGA}{$x, y, w$}
            \WHILE {training not converged}
                \STATE $g = \text{BACKPROP}(x, y, w)$
                \STATE $w = w + \alpha g$
            \ENDWHILE
            \ENDPROCEDURE
        \end{algorithmic}
    
        \begin{algorithmic}
        \PROCEDURE{SVI}{$x, y, w$}
            \WHILE {training not converged}
                \STATE ${\color{blue} \epsilon \sim \mathcal{N}(0, I)}$
                \STATE $g = \text{BACKPROP}(x, y, {\color{blue}w+\epsilon\sigma})$
                \STATE $w = w + \alpha g$
                \STATE ${\color{blue}\sigma = \sigma + \alpha\epsilon g}$
            \ENDWHILE
            \ENDPROCEDURE
        \end{algorithmic}
    

TODO: segue into the figure and algo box

Variational Inference
Now we fit a distribution of weights to the underlying latent distribution.
What if you sampled from the learned distribution?
Training Loss
The loss surface for the Variational Net. The training loss is plotted on the left, the validation loss is plotted on the right.
Test Loss

HERE ARE SOME IDEAS FOR EXTRA STUFF TO TALK ABOUT, SPACE PERMITTING

Automatic Differentiation Variational Inference

The reparameterization trick ADVI Bayes by backprop

Connection to Dropout

Dropout can be seen as an approximation to stochastic variational inference, where we add weights that can turn off each unit entirely, and integrate out these weights.

Connection to Ensembles

asdf

Connection to Gradient Noise

asdf

Conclusion

TODO: OUTRO To recap, we started with a standard maximum-likelihood optimization procedure, which overfits sometimes. To fix this, we argued that we should instead integrate over all possible parameter values. However, because this was too expensive to do exactly, we turned this integration problem into an optimization problem, with extra noise. Why not just add noise to SGA in the first place, and skip all the math? [citations, e.g. dropout, and] The answer is that this theory tells us how to tune the amount of noise on a per-parameter basis, automatically. Also: Overfitting, small data, and uncertainty.

Acknowledgments

TODO: Write down our acknowledgments.

Many of our demos rely on Karpathy's ConvNetJS library , including the example code provided therein.

Author contribution

TODO: List the author contributions.