### Approximate Learning of Large Scale Graphical Models: Theory and Applications.

**Organizers:**

- Ruslan Salakhutdinov , CSAIL, MIT, rsalakhu at mit dot edu
- Amir Globerson , The Hebrew University of Jerusalem, gamir at cs dot huji dot ac dot il
- David Sontag , CSAIL, MIT, dsontag at mit dot edu

**Motivation and Topics:**

Undirected graphical models provide a powerful framework for representing dependency structure between random variables. Learning the parameters of undirected models plays a crucial role in solving key problems in many machine learning applications, including natural language processing, visual object recognition, speech perception, information retrieval, computational biology, and many others.

Learning in undirected graphical models of large treewidth is difficult because of the hard inference problem induced by the partition function for maximum likelihood learning, or by finding the MAP assignment for margin-based loss functions. Over the last decade, there has been considerable progress in developing algorithms for approximating the partition function and MAP assignment, both via variational approaches (e.g., belief propagation) and sampling algorithms (e.g., MCMC). More recently, researchers have begun to apply these methods to learning large, densely-connected undirected graphical models that may contain millions of parameters. A notable example is the learning of Deep Belief Networks and Deep Boltzmann Machines, that employ MCMC strategy to greedily learn deep hierarchical models.

The goal of this workshop is to assess the current state of the field and explore new directions in both theoretical foundations and empirical applications. In particular, we shall be interested in discussing the following topics:

**State of the field:**What are the existing methods and what is the relationship between them? Which problems can be solved using existing algorithms and which cannot?**The use of approximate inference in learning:**There are many algorithms for approximate inference. In principle all of these can be "plugged-into" learning algorithms. What are the relative merits of using one approximation vs. the other (e.g., MCMC approximation vs. a variational one). Are there effective combined strategies?**Learning with latent variables:**Graphical models with latent (or hidden) variables often possess more expressive power than models with only observed variables. However, introducing hidden variables makes learning far more difficult. Can we develop better optimization and approximation techniques that would allow us to learn parameters in such models more efficiently?**Learning in models with deep architectures:**Recently, there has been notable progress in learning deep probabilistic models, including Deep Belief Networks and Deep Boltzmann Machines, that contain many layers of hidden variables and millions of parameters. The success of these models heavily relies on the greedy layer-by-layer unsupervised learning of a densely-connected undirected model called a Restricted Boltzmann Machine (RBM). Can we develop efficient and more accurate learning algorithms for RBM's and deep multilayer generative models? How can learning be extended to semi-supervised setting and be made more robust to dealing with highly ambiguous or missing inputs? What sort of theoretical guarantees can be obtained for such greedy learning schemes?**Scalability and success in real-world applications:**How well do existing approximate learning algorithms scale to large-scale problems including problems in computer vision, bioinformatics, natural language processing, information retrieval? How well do these algorithms perform when applied to modeling high-dimensional real-world distributions (e.g. the distribution of natural images)?**Theoretical Foundations:**What are the theoretical guarantees of the learning algorithms (e.g. accuracy using the learned parameters with respect to best possible, asymptotic convergence guarantees such as almost sure convergence to the maximum likelihood estimator). What are the tradeoffs between running time and accuracy?**Loss functions:**In the supervised learning setting, two popular loss functions are log-loss (e.g., in conditional random fields) and margin-based-loss (e.g., in maximum margin Markov networks). In intractable models these approaches result in rather different approximation schemes (since the former requires partition function estimation, whereas the latter only requires MAP estimates). What can be said about the differences between these schemes? When is one model more appropriate than the other? Can margin-based models be applied in the unsupervised case?**Structure vs. accuracy:**Which graph structures are more amenable to approximations and why? How can structure learning be combined with approximate learning to yield models that are both descriptive and learnable with good accuracy?

This workshop aims to build on the success of the earlier NIPS workshop: Approximate inference - how far have we come? , but with a specific emphasis on the learning aspect. Through having a series of invited talks, and a panel discussion, this workshop is intended to bring together machine learning researchers working on approximate inference in learning, assess the current state of the field, discuss key challenges, and identify future promising directions of investigation.