# OPT 2010: 3rd International Workshop on Optimization for Machine Learning

## Accepted Papers

**Information-theoretic lower bounds on the oracle complexity of sparse convex optimization**

*Alekh Agarwal, Peter Bartlett, Pradeep Ravikumar, Martin Wainwright*

Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Recent years have seen a surge in optimization methods tailored to sparse optimization problems. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation, when the objective is optimized at a sparse vector in a high dimensional space. Our result is matched by an appropriately tuned method of mirror descent, establishing the minimiax optimality of the result.

**An efficient algorithm for sparse PCA**

*Yunlong He, Renato Monteiro, Haesun Park*

Sparse principal component analysis(PCA) imposes extra constraints or penalty terms to the standard PCA to achieve sparsity. In this paper, we introduce an efficient algorithm for finding a single sparse principal component (PC) with a specified cardinality. The algorithm consists of two stages. In the first stage, it identifies an active index set with a desired cardinality corresponding to the nonzero entries of the PC. In the second one, it uses the power iteration method to find the best direction with respect to the active index set. Experiments on randomly generated data and real-world datasets show that our algorithm is very fast, especially on large and sparse data sets, while the numerical quality of the solution is comparable to the state-of-art algorithm.

**Hierarchical Classification via Orthogonal Transfer**

*Dengyong Zhou, Lin Xiao, Mingrui Wu*

We consider multiclass classification problems in which the set of labels are organized hierarchically as a category tree, and the examples are classified recursively from the root to the leaves. We propose a hierarchical support-vector-machine that encourages the classifiers at each node of the tree to be different from the classifiers at its ancestors. More specifically, we introduce regularizations that force the normal vector of the classifying hyperplane at each node of the tree to be orthogonal to those at its ancestors as much as possible. We establish sufficient conditions under which such an objective is a convex function of the normal vectors. We also present an efficient dual-averaging method for solving the resulting nonsmooth convex optimization problem. We evaluate the method on a number of real-world text categorization tasks and obtain state-of-the-art performance.

**An Incremental Subgradient Algorithm for Approximate MAP Estimation in Graphical Models**

*Jeremy Jancsary, Gerald Matz, Harald Trost*

We present an incremental subgradient algorithm for approximate computation of maximum-a-posteriori (MAP) states in cyclic graphical models. Its most striking property is its immense simplicity: each iteration requires only the solution of a sequence of trivial optimization problems. The algorithm can be equally understood as a degenerated dual decomposition scheme or as minimization of a degenerated tree-reweighted upper bound and assumes a form that is reminiscent of message-passing. Despite (or due to) its conceptual simplicity, it is equipped with important theoretical guarantees and exposes strong empirical performance.

**Augmenting Dual Decomposition for MAP Inference**

*Andre Martins, Mario Figueiredo, Noah Smith, Pedro Aguiar, Eric Xing*

In this paper, we propose combining augmented Lagrangian optimization with the dual decomposition method to obtain a fast algorithm for approximate MAP (maximum a posteriori) inference on factor graphs. We also show how the proposed algorithm can efficiently handle problems with (possibly global) structural constraints. The experimental results reported testify for the state-of-the-art performance of the proposed approach.

**Statistical Linearization for Value Function Approximation in Reinforcement Learning**

*Matthieu Geist*

Reinforcement learning (RL) is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the so-called value function. An important RL subtopic is to approximate this function when the system is too large for an exact representation. This paper presents statistical-linearization-based approaches to estimate such functions. Compared to more classical approaches, this allows considering nonlinear parameterizations as well as the Bellman optimality operator, which induces some differentiability problems. Moreover, the statistical point of view adopted here allows considering colored observation noise models instead of the classical white one; in RL, this can provide useful.

**Gradient-type methods for primal SVM model selection**

*Gregory Moore, Charles Bergeron, Kristin Bennett*

Selection of multiple SVM model hyperparameters by cross-validation can be expressed as a bilevel optimization problem. Explicit methods optimize the model parameters and hyperparameters simultaneously. In implicit methods, the model parameters are considered to be implicit functions of the hyperparameters. Recently, gradient-based methods have emerged as an efficient way to optimize the hyperparameters. In this work, we examine both implicit and explicit model selection algorithms for linear SVM-type machine learning models expressed as nonsmooth unconstrained optimization problems. A key point is that the underlying optimization problems are nonsmooth and nonconvex, so set-valued subgradients must be used. Nonsmooth nonconvex optimization techniques can lead to scalable model selection algorithms but appropriate choice and use of subgradients is essential for good performance. A new nonconvex implicit bundle method is developed and compared computationally to recent nonsmooth implicit and explicit gradient methods and grid search. All of the gradient methods out perform grid search. The subgradients calculated using the simple implicit method may not yield good directions, leading to algorithm failures. Smaller datasets can benefit from the implicit bundle algorithms that have specialized strategies for calculating effective subgradients. The well-founded explicit method consistently provides robust solutions for all size problems but at greater computational expense for larger problems. Problems with massive training data can be frequently solved with the simplified implicit subgradient algorithm.

**An Optimization Based Framework for Dynamic Batch Mode Active Learning**

*Shayok Chakraborty, Vineeth Balasubramanian, Sethuraman Panchanathan*

Active learning techniques have gained popularity in reducing human effort to annotate data instances for inducing a classifier. When faced with large quantities of unlabeled data, such algorithms automatically select the salient and representative samples for manual annotation. Batch mode active learning schemes have been recently proposed to select a batch of data instances simultaneously, rather than updating the classifier after every single query. While numerical optimization strategies seem a natural choice to address this problem (by selecting a batch of points to ensure that a given objective criterion is optimized), many of the proposed approaches are based on greedy heuristics. Also, all the existing work on batch mode active learning assume that the batch size is given as an input to the problem. In this work, we propose a novel optimization based strategy to dynamically decide the batch size as well as the specific points to be queried, based on the particular data stream in question. Our results on the widely used VidTIMIT and the MBGC biometric datasets corroborate the efficacy of the framework to adaptively identify the batch size and the particular data points to be selected for manual annotation, in any batch mode active learning application.