Bayesian Inference and Generative Classifiers
  
  
   Preliminaries
  
  
   The training data
  
  
   First, we introduce some notation. We have a dataset
   
    $D$
   
   . For example,
   
    $D$
   
   might be a dataset like
  
  
   
    $$D = \{ (\text{'homework'}, \text{'dog'}, spam), (\text{'cheap'}, \text{'college'}, spam), (\text{'tuition'}, \text{'university'}, \neg spam),...\}.$$
   
  
  
   Or the dataset might look like
  
  
   
    $$D = \{(1, 2), (3, 4), ...\}.$$
   
  
  
   In the context of supervised learning, we assume that for new data, we’ll observe some part of the data, and would have to predict some other part of the data. We’ll refer to data we observe as the
   
    $x$
   
   ‘s, and we’ll refer to data we don’t observe as the
   
    $y$
   
   ‘s. In general, for supervised learning, we’ll write the data set as
  
  
   
    $$D = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), (x^{(3)}, y^{(3)}), ...\}.$$
   
  
  
   In the spam examples, the
   
    $y$
   
   ‘s would be the
   
    $spam$
   
   or
   
    $\neg spam$
   
   parts of the data points.
  
  
   
    $\theta$
   
   : a vector that describes how the data works
  
  
   If we look at the data, we might form a hypothesis about how it works. We started out by thinking about that hypothesis as relating the
   
    $x$
   
   ‘s and the
   
    $y$
   
   ‘s:
  
  
   
    $$h_\theta (x) \approx y.$$
   
  
  
   For example, in the case of linear regression,
   
    $h_\theta (x) = \theta^T x.$
   
   and for neural networks
   
    $h_\theta (x) = net_\theta (x).$
   
   The idea is that we settle on some form of
   
    $h$
   
   or
   
    $net$
   
   and assume that there is
   
    some
   
   
    $\theta$
   
   that will work.
  
  
   In the first lecture, we said that we want to find a
   
    $\theta$
   
   that works the best, in some sense. For example, we might like to find a
   
    $\theta$
   
   that minimizes the the squared differences between
   
    $h_\theta(x^{(i)})$
   
   and
   
    $y^{(i)}$
   
   .
  
  
   Here is another view that we can take. Instead of simply saying that
   
    $h_\theta (x)$
   
   approximates
   
    $y$
   
   , we can imagine that
   
    $y$
   
   is randomly generated, according to what
   
    $x$
   
   and
   
    $\theta$
   
   are. In the case of linear regression, an assumption can be that
   
    $y$
   
   is generated according to
  
  
   
    $$y \sim N(\theta^T x, \sigma^2).$$
   
  
  
   Here,
   
    $\theta$
   
   still parametrizes a hypothesis about how the data works. Specifically, for a specific
   
    $\theta$
   
   , if we believe it, we’re saying that we believe the
   
    $y$
   
   ‘s were generated from the
   
    $x$
   
   using, for example,
   
    $y \sim N(\theta^T x, \sigma^2).$
   
   In other words, we are now defining
   
    $P_\theta (y|x)$
   
   similarly to how we previously defined
   
    $h_\theta (x)$
   
   .
  
  
   Maximum Likelihood and Maximum A-Posteriori
  
  
   Likelihood
  
  
   How to figure out what’s the best
   
    $\theta$
   
   ? One approach is to find the
   
    $\theta$
   
   for which the data is as plausible as possible. In other words, we want to find a
   
    $\theta$
   
   such that the probability of the data is as large as possible given
   
    $\theta$
   
   . That is, we want the
   
    $\theta$
   
   such that
   
    $P_\theta (y|x)$
   
   is as large as possible.
  
  
   Why does this make sense? because we’re trying to figure out what
   
    $\theta$
   
   is. If the data looks like it was generated using
   
    $y \sim P_\hat{\theta}(y|x)$
   
   , that’s a good reason to think that
   
    $\theta = \hat{\theta}$
   
   . We proved previously that if we assume a Gaussian likelihood, finding the theta that maximizes
   
    $\sum_i \log P_\theta (y^{(i)}|x^{(i)})$
   
   for
   
    $P_\theta (y^{(i)}|x^{(i)}) = N(\theta^T x; \sigma^2)$
   
   is the same as as minimizing the cost function
   
    $\sum_i (h_\theta (x^{(i)})-y^{(i)})^2$
   
   .
  
  
   We can write
   
    $P_\theta(y|x)$
   
   as
   
    $P(y|x, \theta)$
   
   instead. One way to look at it is to say that that’s just a change of notation. It says that we’re computing the probability of
   
    $y$
   
   differently depending on what
   
    $\theta$
   
   is. But another way of looking at it is that now, we’re thinking of
   
    $\theta$
   
   as just another kind of data. We just don’t happen to observe it.
  
  
   One last thing: there is a bit of awkwardness going on with the
   
    $x$
   
   s. We sometimes (for example, when doing linear regression using maximum likelihood) want to say that the
   
    $x$
   
   ‘s are fixed, so that
   
    $P(x=x_\text{observed}) = 1$
   
   . In that case,
   
    $P(y, x|\theta) = P(y|\theta, x)P(x) = P(y|\theta, x)$
   
   .
  
  
   In general, the likelihood of the training set
   
    $D$
   
   will be
  
  
   
    $$P(D|\theta).$$
   
  
  
   We will commonly assume that the training cases are all independent of each other, which means that we can write
  
  
   
    $$P(D|\theta) = \Pi_i P( (x^{(i)}, y^{(i)}) | \theta).$$
   
  
  
   Putting it all together: learning with Maximum Likelihood
  
  
   Learning with Maximum Likelihood means finding a
   
    $\theta_{ML} = \hat{\theta}$
   
   such that
   
    $P(D|\hat{\theta})$
   
   is maximized. In other words,
  
  
   
    $$\theta_{ML} = argmax_\theta P(D|\theta).$$
   
  
  
   We can then know the distirbution for the output for new data
   
    $P((y, x)|\hat{\theta})$
   
   . When
   
    $x$
   
   is fixed, the distribution of the output
   
    $y$
   
   is
   
    $P(y|\hat{\theta}, x)$
   
   . We can take
   
    $E[y|\hat{\theta}, x]$
   
   in order to predict a
   
    $y$
   
   given a new
   
    $x$
   
   . Note that, in the context of linear regression, this is equivalent to predicting
   
    $y\approx \theta^T x$
   
   .
  
  
   Posterior probability of
   
    $\theta$
   
  
  
   Perhaps we have a
   
    prior
   
   distribution on
   
    $\theta$
   
   . That means that, even if the data looks like it was generated using a particular
   
    $\theta'$
   
   , we might not believe this, since our prior belief in that
   
    $\theta'$
   
   is that it’s unlikely (i.e.,
   
    $P(\theta=\theta')$
   
   is small). That should influence what we should think the most likely
   
    $\theta$
   
   is after observing the data. We can formalize this by computing the
   
    posterior probability
   
   over
   
    $\theta$
   
   :
  
  
   
    $$P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}  \propto P(D|\theta)P(\theta).$$
   
  
  
   (We can ignore the denominator since it does not depend on
   
    $\theta$
   
   .
  
  
   The larger
   
    $P(\theta' = \theta|D)$
   
   , the more we believe that the true
   
    $\theta$
   
   (the one that generated the data D) is equal to
   
    $\theta'$
   
   .
  
  
   Aside: the unicorn example
  
  
   Suppose that our
   
    $D$
   
   is a photo, and our
   
    $\theta$
   
   describes the scene. The
   
    $\theta$
   
   would contain information about the objects in the photo and their appearance.
   
    $P(D|\theta)$
   
   would be high if the pixels in the photo
   
    $D$
   
   match the information encoded in
   
    $\theta$
   
   , and low otherwise.
  
  
   (How would a
   
    $\theta$
   
   encode information about the contents of an image? Maybe
   
    $\theta_1$
   
   is 1 iff the image contains a horse,
   
    $\theta_2$
   
   is 1 iff the image contains a dog, …,
   
    $\theta_{1000}$
   
   is 1 iff the image contains a unicorn,
   
    $\theta_{100500}$
   
   is 1 iff the image contains an object that’s blue, …)
  
  
   What if the photo
   
    $D$
   
   appears to contain a unicorn? Then
   
    $P(D|\theta')$
   
   would be high for a
   
    $\theta'$
   
   that encodes information about the image like “a white unicorn at location (100, 20) in the image”.
  
  
   However, it might be that we don’t believe that genuine photos of unicorns exist, so our
   
    $P(\theta'|D)$
   
   should be 0. It will be, if we think that
   
    $P(\theta')=0$
   
   since
   
    $$P(\theta|D)  \propto P(D|\theta)P(\theta).$$
   
  
  
   Maximum A-Posteriori inference
  
  
   If we can obtain
   
    $P(\theta|D)$
   
   , it’s natural to want to find the
   
    most
   
   probably
   
    $\theta$
   
   , given the data that we’ve observed. That’s exactly what Maximum A-Posteriori (MAP) inference does:
  
  
   
    $$\theta_{MAP} = argmax_\theta P(\theta|D).$$
   
  
  
   We can use
   
    $\theta_{MAP}$
   
   for predicting new outputs the same way we used
   
    $\theta_{ML}$
   
   .
  
  
   Generative models for data: discrete case
  
  
   Suppose we know are performing some kind test for a cancer. The test can come back positive or negative.
  
  
   Suppose we know that:
   
   
    The test comes back positive in 98% of cases where the cancer is present
    
   
   The test comes back negative in 97% of cases where there is no cancer
   
   * 0.008 of the population has cancer (and the cancer screening is random; that is, it’s not that we’re testing people who are already sick)
  
  
   That means that we have a
   
    generative model
   
   for the data: we can generate “fake data” that looks just like real data by:
  
  
   - 
    Generating a patient
    
     - 
      Make the patient have cancer with
      
       $P=0.008$
      
     
- 
      If the patient was assigned cancer, return the test as positive with prob. 98%
     
- 
      Else return the test as positive with prob. 3%
     
 
   In the terms of the previous section, we can encode information how to generate new data in
   
    $\theta = (0.008, .98, .03)$
   
   .
  
  
   Using Bayes’ rule, we can now infer whether the patient has cancer if the test came back positive by first extracting information from
   
    $\theta$
   
   :
  
  
   
    $P(cancer) = 0.008$
   
   
   
    $P(+|cancer) = 0.98$
   
   
   
    $P(-|\neg cancer) = 0.97$
   
  
  
   and then using Bayes’ Rule and the Law of Total Probability:
  
  
   
    $$P(cancer|+) = \frac{P(+|cancer)P(cancer)}{P(+)} = \frac{P(+|cancer)P(cancer)}{P(+|cancer)P(cancer)+P(+|\neg cancer)P(\neg cancer)}.$$
   
  
  
   Learning a Generative Model
  
  
   In order to learn the generative model, we need to estimate the following:
  
  
   
    $P(cancer)$
   
   
   
    $P(+|cancer)$
   
   
   
    $P(-|\neg cancer)$
   
  
  
   We can do this using the training set:
  
  
   
    $P(cancer) \approx \frac{count(cancer)}{N}$
   
   
   
    $P(+|cancer) \approx \frac{count(cancer, +)}{count(cancer)}$
   
   
   
    $P(+|\neg cancer) \approx \frac{count(\neg cancer, +)}{count(\neg cancer)}$
   
  
  
   You can show that those are the Maximum Likelihood estimates in the same way that you would do that for ML estimates for the Bernoulli model
  
  
   Generative models for data: continous case
  
  
   Suppose that the test outputs are actual real numbers. Suppose the measurement is
   
    $t$
   
   . The generative model in this case will be
  
  
   
    $$t\sim N(t|\mu_{diagnosis}, \sigma _{diagnosis}^{2})$$
   
  
  
   (With the
   
    $\theta$
   
   for the model being
   
    $(\mu_{cancer}, \mu_{\neg cancer}, \sigma_{cancer}, ...)$
   
   .)
  
  
   We can generate new data similarly to what we did before.
  
  
   The probability that a person has cancer will now be
  
  
   
    $$P(cancer|t) =  \frac{P(t|cancer)P(cancer)}{P(t|cancer)P(cancer)+P(t|\neg cancer)P(\neg cancer)}.$$
   
  
  
   Learning a Generative Model
  
  
   The parameters
   
    $\theta = (\mu_{cancer}, \mu_{\neg cancer}, \sigma_{cancer}, ...)$
   
   can be learned using Maximum Likelihood, similarly to how they were learned in the case of linear regression.
  
  
   Classification of new instances
  
  
   So what’s
   
    $P(cancer|t, D)$
   
   ? (Assuming we don’t know the true
   
    $\theta$
   
   ). That is, what’s the probability that the patient has cancer, given the training data
   
    $D$
   
   and the test result
   
    $t$
   
   ?
  
  
   It’s not
   
    $P_{\theta_{MAP}}(cancer|t)$
   
   . Let’s do this carefully:
  
  
   
    $$P(cancer|t, D) = \sum_{\theta \in \Theta}P(cancer|t, theta)P(\theta| D) = \sum_{\theta \in \Theta}P_{\theta}(cancer|t)P(\theta| D).$$
   
  
  
   That is we are considering all the possible
   
    $\theta$
   
   s, compute the probability according to each of them, and weight them by how much we believe them to be the true
   
    $\theta$
   
   s.
  
  
   However,
   
    $P_{\theta_{MAP}}(cancer|t)$
   
   is not a horrible estimate here (it
   
    is
   
   the largest contributor to the sum above.)