A PROBABILITY ESTIMATION PROBLEM WITH BINARY DATA

To illustrate the mixture model software, I first generated a dataset
with binary data.  Each case was composed of ten binary attributes.
The distribution of these binary vectors was a mixture of four
component distributions, in each of which the ten attributes were
independent.  The four components each had probability 0.25 of being
chosen to generate a case.  The probabilities for each component for
each binary attributes were as shown below:

          1    2    3    4    5    6    7    8    9   10

     1   0.1  0.2  0.2  0.2  0.2  0.8  0.8  0.8  0.8  0.7
     2   0.1  0.8  0.8  0.8  0.8  0.2  0.2  0.2  0.2  0.7
     3   0.9  0.2  0.2  0.8  0.8  0.2  0.2  0.8  0.8  0.7
     4   0.9  0.8  0.8  0.2  0.2  0.8  0.8  0.2  0.2  0.7

Each row gives the probabilities of each of the attributes being 1 for
one component of the mixture.  The columns are for the ten binary
attributes in each case.  The vectors generated in this way can be
seen as coming from one of four "patterns": 0000011111, 0111100001,
1001100111, and 1110011001, but with each bit of the chosen pattern
having a small probability of being switched (ranging from 0.1 to 0.3)
in any particular case.

I generated 1000 cases from this distribution, which are stored in the
file 'xbdata'.  The first 500 are used for training (the others are not
used at all at the moment).


A finite mixture model for the binary data.

We will first see what happens when we model this data as a mixture of
four components, which is the true number of components for this
problem.  Each component of the mixture will give each target
attribute a certain probability of being 1 rather than 0.  These
probabilities are determined from the "offset" parameters for each
vtarget, by means of the "logistic" function:

   Pr(target=1)  =  1 / (1 + exp(-offset))

The offset parameters for each attribute for each component are given
Gaussian prior distributions, with means and standard deviations that
are hyperparameters.  These hyperparameters could be fixed, but here
we will make them variable, separately for each target attribute, but
with each of them linked to a top-level hyperparameter, common to all
targets.

We set up this mixture model with the 'mix-spec' command, as follows:

    > mix-spec xblog.4 0 10 4 / x1 0.05:0.5:0.2 10

Here, "xblog.4" is the name of the log file used for this run, which
is created by the 'mix-spec' command.  The arguments of 'mix-spec'
that follow the name of the log file are the number of input variables
(always 0 for mixture models at present), the number of target
attributes (10 for this problem), and the number of mixture components
(set to the true value of 4 in the command above).  The specifications
for priors are given following these arguments, after a "/" separator.

The first prior specification gives the concentration parameter of the
Dirichlet distribution for the mixing proportions for the components.
If this specification starts with "x", this value is automatically
divided by the number of components, which is the scaling required for
the model to reach a sensible limit as the number of components goes
to infinity.  The specification of "x1" above mildly favours unequal
mixing proportions (recall that the true proportions are equal).

The second argument after the "/" specifies the prior for the standard
deviations of the "offsets" for components (which determine the
probabilities for the binary target attributes).  In the hierarchical
scheme specified here (described in general in prior.doc), there is a
top-level standard deviation hyperparameter, given a prior in which
the corresponding "precision" (standard deviation to the power -2) has
a gamma distribution with mean 0.05 and shape parameter 0.5.  There is
a lower-level standard deviation hyperparameter for each target; for
each of these, the corresponding precision has a gamma distribution
with mean given by the higher-level precision, and shape parameter of
0.2.  Both these priors are rather vague (but the shape parameter of
0.2 is vaguer than 0.5), so these hyperparameters can adapt to the
structure of the data.

The last argument of 'mix-spec' is the standard deviation for the
means of the offsets for each of the targets, here set to 10.
Currently, this standard deviation for the mean offset is the same for
all targets, and is fixed in the 'mix-spec' command (it cannot be a
variable hyperparameter).

We can look at these specifications by calling 'mix-spec' with just
the log file as an argument:

    > mix-spec xblog.4

    Number of inputs:      0
    Number of targets:     10

    Number of components:  4

    Dirichlet concentration parameter: x1.000
    Prior for SD hyperparameters:       0.050:0.50:0.20
    Prior for mean component offsets:   10.000

After the 'mix-spec' command, we need to specify that the targets are
binary using the 'model-spec' command, as follows:

    > model-spec xblog.4 binary

We can then specify where the data comes from using 'data-spec':

    > data-spec xblog.4 0 10 2 / xbdata@1:500 . 

The number of inputs is here specified to be 0 (as it must be at
present for mixture models), and the number of targets is specified to
be 10.  These must match the values given to 'mix-spec'.  We also
specify that the targets must be the integers 0 or 1 putting in a
following argument of 2 (meaning that there are 2 possible values,
which start at 0).  After a slash, we say where the inputs and targets
come from.  Note that we have to say where the inputs come from even
though there are 0 of them, but fortunately, we can then say that the
targets come from the same place by just using the "." argument.  In
the specification above, the data comes from the first 500 lines of
the file 'xbdata', with one case per line.  See data-spec.doc for more
details.

Finally, we need to specify how Markov chain sampling is to be done
for this model.  At present, none of the standard Markov chain methods
are allowed for mixture models, only the specialized procedures
documented in mix-mc.doc.  Each of these procedures updates one of the
three parts of the state - the values of the hyperparameters, the
values of the parameters for the various mixture components, and the
values of the indicator variables saying which mixture component is
currently associated with each training case.  The values in each of
these parts of the state can be updated by Gibbs sampling, using the
corresponding procedure.  The following call of 'mc-spec' sets this
up:

    > mc-spec xblog.4 repeat 20 gibbs-indicators gibbs-params gibbs-hypers

Here, the three Gibbs sampling operations are repeated 20 times each
iteration, just to cut down on the number of states saved in the log
file.

We can now run the Markov chain simulation for 100 interations with the
following command:

    > mix-mc xblog.4 100

The simulation starts with a state in which all the training cases are
associated with the same component, whose parameters are set to their
means, as specified by hyperparameter values that are also set to
their prior means.  The 'mix-gen' program could be used to initialize
things diffferently (see mix-gen.doc).

The above run takes about one minute on our SGI machine.  It can be
run in the background, in which case progress can be monitored with
the 'mix-display' and 'mix-plt' programs.  For example, after a few
seconds, we might see the following:

    > mix-display xblog.4

    MIXTURE MODEL IN FILE "xblog.4" WITH INDEX 17

    HYPERPARAMETERS

    Standard deviations for component offsets:

        0.198:     4.056    0.718    1.579    1.702    1.830
                   1.548    2.504    1.957    4.120    0.184

    Means for component offsets:

                  +1.816   +0.022   +0.182   +0.037   +0.500
                  +1.322   -0.018   +1.375   +2.802   +0.990


    PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE

       1: 0.310   -2.451   +0.738   +1.342   +1.093   +0.605
                  -0.960   -1.146   -0.441   -0.798   +0.903

       2: 0.250   -2.121   -1.530   -1.999   -0.917   -1.393
                  +1.727   +1.862   +1.603   +3.118   +1.101

       3: 0.238   +3.950   +0.915   +1.372   -1.960   -0.940
                  +1.307   +1.793   -1.268   -2.102   +1.035

       4: 0.202   +2.710   -0.978   -1.015   +1.900   +1.915
                  -1.802   -2.903   +1.268   +1.452   +0.827

This displays the state at the last iteration computed so far (here
17).  The hyperparameter values are shown first, with top-level
hyperparameters on the left, before the colon.  The lower-level
hyperparameters follow, ordered by the target attribute that they
pertain to.  Values for the component parameters follow, sorted by the
fraction of training cases that they are currently associated with.
These fractions are the first numbers after 1:, 2:, etc.; note that
the actual mixing probabilities are not explicitly represented, but
are instead always integrated over.  Any components that are not
associated with any training case are omitted from the output, and are
not explicitly represented in the state.  The other numbers shown for
each component are the "offset" parameters for each target attribute.

In this example, the simulation appears to have learned the essence of
the distribution by the iteration shown above.  Recall that a positive
offset corresponds to the probability of a 1 being greater than 1/2, a
negative offset to the probability of a 1 being less than 1/2.  The
four components shown above can thus be seen to correspond to the four
patterns used to generate the cases, as described earlier.  Note that
the last of the offset standard deviation hyperparameters (for the
last target attribute) is quite small.  This indicates that the model
has "learned" that the probabilities for the last target are the same
for all components, and hence can be modeled most efficiently at the
hyperparameter level, by the mean for that offset, rather than
separately for each component.

The indicators of which components are associated with each training
case can also be examined with 'mix-display', and of course iterations
other than the last can be viewed.  See mix-display.doc for details.

One can also look at the progress of the simulation using 'mix-plt'.
For example, the following will produce a time-plot of the cumulative
proportions of the training cases associated with the four components
(as sorted by decreasing frequency):

    > mix-plt t C1C2C3C4 xblog.4 | plot

Here, 'plot' is some suitable plotting program.  One can also just let
the output of 'mix-plt' go to standard output, and then examine the
numbers manually.  Other quantities can also be plotted, as described
in mix-quantities.doc.

As well as examining quantities with 'mix-display' and 'mix-plt', one
can also produce a sample of cases from the predictive distribution
using 'mix-cases'.  Unfortunately, these are about the only things one
can do with the results at the moment.  There are no programs that
find predictive probabilities, fill in missing values in partial
cases, or do any of the other useful things that are quite possible
with these models, but which haven't been implemented.  See
mix-extensions.doc for a wish list of such facilities.


A countably infinite mixture model for the binary data.

Even though the true distribution for this example is a mixture of
four components, good results can nevertheless be obtained using a
mixture with a countably infinite number of components.  The prior for
mixture proportions used with such a model is designed so that a few
of the infinite number of components have substantial probability, so
the model does not result in an "overfitted" solution, in which every
training case is "explained" as coming from a different component.

We specify a countably infinite mixture model by simply omitting the
argument to 'mix-spec' that gives the number of components.  For this
example, we change the 'mix-spec' command used above to the following:

    > mix-spec xblog.inf 0 10 / x1 0.05:0.5:0.2 10

The Dirichlet prior specification for mixing proportions of "x1" is
the same as for the mixture model with four components.  The "x"
specifies scaling downward with the number of components, which
produces a sensible limit as the number of components goes to
infinity, as here.  The "x" is therefore required for infinite
mixtures.

The other arguments of 'mix-spec' are as described for the finite
mixture model.  The 'model-spec' and 'data-spec' commands used are
also the same:

    > model-spec xblog.inf binary
    > data-spec xblog.inf 0 10 2 / xbdata@1:500 . 

We must change the 'mc-spec' command, however, since it is impossible
to do Gibbs sampling for the component indicators associated with
training cases - since their number is infinite, we can't compute all
the required conditional probabilities.  However, we can instead use a
specialized Metropolis-Hastings update, in which a new component to go
with a training case is proposed with probability determined by the
frequencies with which components are associated with other training
cases.  The proposal is accepted or rejected based on the resulting
change in likelihood.  The process can then be repeated any desired
number of times, to produce an approximation to Gibbs sampling.  With
this change, we can use the following 'mc-spec' command for this model:

    > mc-spec xblog.inf repeat 20 met-indicators 10 gibbs-params gibbs-hypers

This is the same as for the finite mixture model, except that 10
repetitions of the Metropolis-Hastings update for the indicators are
specified using 'met-indicators 10', in place of 'gibbs-indicators'.

As before, we can now run the simulation with a command such as:

    > mix-mc xblog.inf 100

We can examine the states with 'mix-display'.  For example, once the
'mix-mc' command completes, the state should be something like the
following:

    > mix-display xblog.inf

    MIXTURE MODEL IN FILE "xblog.inf" WITH INDEX 100
    
    HYPERPARAMETERS
    
    Standard deviations for component offsets:
    
        0.370:     2.876    0.821    1.766    3.086    2.065
                   2.173    4.980    1.545    1.297    0.233
    
    Means for component offsets:
    
                  -1.920   -0.036   -2.914   -1.415   -0.403
                  +0.485   -0.998   +1.149   -0.085   +1.097
    
    
    PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE
    
       1: 0.322   -1.435   +0.889   +1.286   +1.482   +0.942
                  -1.522   -0.981   -0.437   -0.847   +1.161
    
       2: 0.250   +1.802   +0.937   +2.023   -2.592   -0.999
                  +1.967   +1.648   -1.689   -1.428   +0.834
    
       3: 0.224   -1.730   -1.553   -2.590   -1.424   -1.497
                  +2.323   +1.774   +2.095   +2.811   +1.261
    
       4: 0.096   +1.214   -0.703   -1.050   +2.644   +1.141
                  -0.786   -7.497   +0.354   +0.257   +1.109
    
       5: 0.088   +3.275   -1.625   -2.087   +1.559   +2.855
                  -2.353   -5.211   +2.406   +1.126   +0.711
    
       6: 0.006   -1.316   +0.242   -2.282   -0.770   -2.892
                  -0.076   -6.411   +0.558   -1.696   +0.992
    
       7: 0.004   -2.567   +0.123   -5.843   -0.176   -0.657
                  -1.567   -8.938   +1.448   -1.590   +1.458
    
       8: 0.004   -4.741   +0.354   -3.319   +2.514   -0.216
                  -2.402   -2.348   +2.054   -0.146   +1.306
    
       9: 0.004   -3.308   -0.998   -1.107   -2.161   -0.568
                  +2.066   -1.865   +1.931   +0.477   +1.473
    
      10: 0.002   -0.049   -0.706   -6.785   -7.624   -0.853
                  +0.533   -3.103   +2.073   +0.377   +1.293

The output is similar to that seen for the mixture model with four
components, except that ten components are associated with at least
one training case at iteration 100, as shown above.  (The infinite
number of remaining components are not explicitly represented, and are
not displayed by 'mix-display'.)  Many of the then components are
associated with very few training cases, however (as seen from the
fractions of the training set after the component number).  This is to
be expected when the true distribution can be expressed using only
four components.  The number of such low-frequency components will
vary from iteration to iteration, as will other aspects of the state,
in accordance with the variability in the posterior distribution.