Suppose that we have a Boltzmann Machine that's been trained, so we're not going to change the weights and biases anymore. We just want to talk about probabilities as they are now, after training. Let's say that we're interested in two configurations: configuration A and configuration B. Suppose that the goodness of configuration A is 18 and that of configuration B is 15. Because the goodness of configuration A is 3 more than the goodness of configuration B, configuration A has a greater probability. What is P(A) / P(B)? exp(3) Now we're told that the system has chosen a configuration X from its distribution P() over configurations (remember that it's more likely to have chosen configurations with greater goodness than configurations with smaller goodness). We're also told that the configuration that was chosen was either configuration A or configuration B, i.e. X=A or X=B. With that knowledge, the probability that it was A is P(X=A | X=A or X=B), and the probability that B was chosen is P(X=B | X=A or X=B). Clearly, those two must sum to 1. What is the probability that configuration A was chosen, i.e. what is P(X=A | X=A or X=B)? And how does that depend on the difference in goodness between the two configurations, i.e. what if that difference would be something other than 3? P(X=A | X=A or X=B) = P(A) / (P(A) + P(B)) = (P(A)/P(B)) / ((P(A) + P(B))/P(B)) = exp(3) / (exp(3) + 1) = 1 / (1 + exp(-3)) = logistic(3) P(X=A | X=A or X=B) = logistic(G(A) - G(B)) Example: A is state 1,1,1 B is state 0,1,1 For Hopfield networks, we do "unlearning" by letting the system settle to an energy minimum and unlearning there. How is that similar to Boltzmann Machine learning? How is it different? In what sense are the hidden units of a Restricted Boltzmann Machine "feature detectors"? In what sense are they not? Why does having P(h) make the positive phase exact and tractable? Don't we need to sample any longer?