A. In what way is the Bayesian approach similar to model averaging? In what way is it different?

Answer notes: similar because we're using outputs from different models to arrive at a final output; different because it's a weighted average.

B. Let's talk about Bayesian learning. When we're adding the right kind of partially random weight updates during "training", the weights never settle anywhere: they keep moving around the whole time. That's to give us samples from the posterior distribution. Let's say that we record such a sample after every 10,000 iterations. Despite the weights not converging, there is convergence of something else. What converges, and what does it converge to?

Answer notes: the output converges, and the distribution of sampled weight vectors converges to the posterior distribution.

C. Here's a learning task. There's one input unit, and the input values are real-valued (they're from the uniform distribution between 1.0 and 5.0). There are two logistic hidden units, and one logistic output unit. The target output is 0 when the input is greater than 4.0 or less than 2.0, and 1 if the input is between 2 and 4.

1. Hand-design weights that will perform this task very well.

2. How well would that network work in the presence of dropout of 50% on the hidden units?

3. What weights would work better in the presence of dropout?