In a recent article, El Karoui et al. study the distribution of robust regression estimators in the regime in which the number of parameters p is of the same order as the number of samples n. Using numerical simulations and ‘highly plausible’ heuristic arguments, they unveil a striking new phenomenon. Namely, the regression coefficients contain an extra Gaussian noise component that is not explained by classical concepts such as the Fisher information matrix. We show here that that this phenomenon can be characterized rigorously techniques that were developed by the authors to analyze the Lasso estimator under high-dimensional asymptotics. We introduce an approximate message passing (AMP) algorithm to compute M-estimators and deploy state evolution to evaluate the operating characteristics of AMP and so also M-estimates. Our analysis clarifies that the ‘extra Gaussian noise’ encountered in this problem is fundamentally similar to phenomena already studied for regularized least squares in the setting n<p. (Based on joint work with David Donoho.)

In this talk, I will introduce a new computational paradigm where probabilistic reasoning is achieved by solving a small number of randomly perturbed instances of a combinatorial optimization problem (MAP inference) closely related to the maximum-likelihood decoding problem. Exploiting properties of universal hashing, the new approach provides formal guarantees on the accuracy of the results. Based on the theory of low-density parity check codes, I will introduce several approaches to solve, approximate, and relax the underlying combinatorial optimization problem, leading to a new family of inference algorithms that outperforms traditional variational and sampling methods in a range of domains, in particular those involving combinations of probabilistic and deterministic constraints. This ability to handle both probabilistic and deterministic constraints creates new opportunities for including prior knowledge into probabilistic models. As an example, I will discuss a scientific data analysis application in the emerging area of Computational Sustainability where we greatly improved the quality of the results by incorporating prior background knowledge of the physics of the system into the model.

Follow the perturbed leader and other such randomized online learning algorithms have enjoyed success especially due to their computational efficiency and simple form. However the specific problems for which such perturbation based algorithms have been previously provided and their analysis have been rather case specific and fail to shed light on the general idea behind why such algorithms should work in general. In this talk I shall describe generalized conditions under which one can develop randomized algorithms based on perturbations for online learning problems. Under this general condition I shall describe a generic recipe for developing randomized algorithms for online algorithms. We shall see how the follow the perturbed leader algorithm for online linear optimization is in fact a special case of an algorithm that can be developed from the generic recipe. Through the methodology we shall also see how whenever online learning and statistical learning rates match one can expect to design perturbation based algorithms. Going well beyond online linear optimization we shall also see how one can derive perturbation based algorithms from the recipe for several new online learning problems like online matrix completion, time series prediction models, online link classification etc.

In this talk, we explore feature noising schemes such as dropout training. We show that under a local approximation, feature noising is equivalent to a regularizer that favors predictions which are both (i) confident under the model and (ii) stable under feature noise. We show how our methods can be extended to structured prediction and semi-supervised learning, yielding improved accuracies on several NLP tasks. We conclude with some preliminary analysis of the generalization properties of dropout training. Joint work with Stefan Wager, Sida Wang, Mengqiu Wang, Chris Manning.

The PAC-Bayesian theorem governs the generalization error of a model-perturbation predictor — a predictor based on a stochastic draw of model parameters. Although model perturbation is central to PAC-Bayesian theory, PAC-Bayesian theory does not use model perturbation directly to model conditional density. A PAC-Bayesian consistency theorem typically has the perturbation size go to zero as the amount of training data goes to infinity even if the true conditional distribution of label given input is broadly distributed. This talk will compare and contrast the two motivations for model perturbation — generalizations bounds and conditional density modeling — and consider the extent to which these motivations reinforce each other.

We address the problem of efficiently obtaining pixel-accurate boundaries for objects in images using an interactive framework and MAP inference. For this purpose we develop new bounds to the entropy function that are based on the expected amount of perturbations to the potential functions that do not change MAP decisions. By reasoning about the boundary of MAP decisions we actively select regions for further refinement in a multi-scale manner that increase the annotation certainty at minimal cost. We empirically show that our framework achieves accurate annotations much faster than traditional tools.