scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Dual coordinate descent methods for logistic regression and maximum entropy models

01 Oct 2011-Machine Learning (Springer US)-Vol. 85, Iss: 1, pp 41-75
TL;DR: This paper applies coordinate descent methods to solve the dual form of logistic regression and maximum entropy, and shows that many details are different from the situation in linear SVM.
Abstract: Most optimization methods for logistic regression or maximum entropy solve the primal problem. They range from iterative scaling, coordinate descent, quasi-Newton, and truncated Newton. Less efforts have been made to solve the dual problem. In contrast, for linear support vector machines (SVM), methods have been shown to be very effective for solving the dual problem. In this paper, we apply coordinate descent methods to solve the dual form of logistic regression and maximum entropy. Interestingly, many details are different from the situation in linear SVM. We carefully study the theoretical convergence as well as numerical issues. The proposed method is shown to be faster than most state of the art methods for training logistic regression and maximum entropy.

Content maybe subject to copyright    Report

Citations
More filters
Journal Article
TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.
Abstract: LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.

7,848 citations


Cites methods from "Dual coordinate descent methods for..."

  • ...L2-regularized Logistic Regression (Solving Dual) See Yu et al. (2011) for details of a dual coordinate descent method....

    [...]

  • ...See Yu et al. (2011) for details of a dual coordinate descent method....

    [...]

  • ...Appendix I. L2-regularized Logistic Regression (Solving Dual) See Yu et al. (2011) for details of a dual coordinate descent method....

    [...]

Proceedings ArticleDOI
10 Dec 2012
TL;DR: It is shown that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD, and it is empirically shown that CCD++ is much faster than ALS and SGD in both settings.
Abstract: Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updatesrank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.

308 citations


Cites methods from "Dual coordinate descent methods for..."

  • ...Recently it has been successfully applied to various large-scale problems such as linear SVMs [14], maximum entropy models [15], NMF problems [9], [10], and sparse inverse covariance estimation...

    [...]

Journal ArticleDOI
03 Apr 2012
TL;DR: A comprehensive survey on the recent development of linear classifiers is given, which shows how efficient optimization methods to construct linear classifier methods have applied them to some large-scale applications.
Abstract: Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some large-scale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.

303 citations


Cites methods from "Dual coordinate descent methods for..."

  • ...The work in [29] follows [64] to apply a two-level coordinate descent method, but uses a different method in the second level to decide variables for update....

    [...]

  • ...One example is a coordinate descent method (LIBLINEAR [29])....

    [...]

Journal Article
TL;DR: Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.
Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.

273 citations


Cites background from "Dual coordinate descent methods for..."

  • ...Yu et al. (2010) proposed a quasi Newton approach to solve non-smooth convex optimization problems....

    [...]

Proceedings Article
01 Jan 2013
TL;DR: This study presents a novel analysis using a real-world data set with user-generated online reviews, where it identifies the student, course, platform, and university characteristics that affect student retention and estimate their relative effect.
Abstract: Massive Open Online Courses (MOOCs) have experienced rapid expansion and gained significant popularity among students and educators. Although the broad acceptance of MOOCs, there is still a long way to go in terms of satisfaction of students’ needs, as witnessed in the extremely high drop-out rates. Working toward improving MOOCs, we employ the Grounded Theory Method (GTM) in a quantitative study and explore this new phenomenon. In particular, we present a novel analysis using a real-world data set with user-generated online reviews, where we both identify the student, course, platform, and university characteristics that affect student retention and estimate their relative effect. In the conducted analysis, we integrate econometric, text mining, opinion mining, and machine learning techniques, building both explanatory and predictive models, toward a more complete analysis. This study also provides actionable insights for MOOCs and education, in general, and contributes to the related literature discovering new findings.

257 citations


Cites result from "Dual coordinate descent methods for..."

  • ...…the 5 Other models that were tested and yielded similar results include decision trees (Breiman et al. 1984), one vs. one multiclass strategy with support vector classification (Chang and Lin 2011), and one vs. all multiclass strategy with logistic regression (Yu et al. 2011) and ridge regression....

    [...]

References
More filters
Book
01 Jan 1995

12,671 citations


Additional excerpts

  • ...1 by Bertsekas (1999), which gives the convergence of coordinate descent methods for the following problem: min D(α) subject to α ∈ A1 × · · · ×Al, (75)...

    [...]

Journal ArticleDOI
TL;DR: The numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence, and the convergence properties are studied to prove global convergence on uniformly convex problems.
Abstract: We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.

7,004 citations

Journal ArticleDOI
TL;DR: Decomposition implementations for two "all-together" multiclass SVM methods are given and it is shown that for large problems methods by considering all data at once in general need fewer support vectors.
Abstract: Support vector machines (SVMs) were originally designed for binary classification. How to effectively extend it for multiclass classification is still an ongoing research issue. Several methods have been proposed where typically we construct a multiclass classifier by combining several binary classifiers. Some authors also proposed methods that consider all classes at once. As it is computationally more expensive to solve multiclass problems, comparisons of these methods using large-scale problems have not been seriously conducted. Especially for methods solving multiclass SVM in one step, a much larger optimization problem is required so up to now experiments are limited to small data sets. In this paper we give decomposition implementations for two such "all-together" methods. We then compare their performance with three methods based on binary classifications: "one-against-all," "one-against-one," and directed acyclic graph SVM (DAGSVM). Our experiments indicate that the "one-against-one" and DAG methods are more suitable for practical use than the other methods. Results also show that for large problems methods by considering all data at once in general need fewer support vectors.

6,562 citations


"Dual coordinate descent methods for..." refers methods in this paper

  • ...Therefore, we follow Memisevic (2006) and earlier SVM works (Crammer and Singer 2000; Hsu and Lin 2002; Keerthi et al. 2008) to consider variables associated with an xi as a block....

    [...]

Proceedings ArticleDOI
08 Feb 1999
TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Abstract: Introduction to support vector learning roadmap. Part 1 Theory: three remarks on the support vector method of function estimation, Vladimir Vapnik generalization performance of support vector machines and other pattern classifiers, Peter Bartlett and John Shawe-Taylor Bayesian voting schemes and large margin classifiers, Nello Cristianini and John Shawe-Taylor support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Grace Wahba geometry and invariance in kernel based methods, Christopher J.C. Burges on the annealed VC entropy for margin classifiers - a statistical mechanics study, Manfred Opper entropy numbers, operators and support vector kernels, Robert C. Williamson et al. Part 2 Implementations: solving the quadratic programming problem arising in support vector classification, Linda Kaufman making large-scale support vector machine learning practical, Thorsten Joachims fast training of support vector machines using sequential minimal optimization, John C. Platt. Part 3 Applications: support vector machines for dynamic reconstruction of a chaotic system, Davide Mattera and Simon Haykin using support vector machines for time series prediction, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel. Part 4 Extensions of the algorithm: reducing the run-time complexity in support vector machines, Edgar E. Osuna and Federico Girosi support vector regression with ANOVA decomposition kernels, Mark O. Stitson et al support vector density estimation, Jason Weston et al combining support vector and mathematical programming methods for classification, Bernhard Scholkopf et al.

5,506 citations

Posted ContentDOI
TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.
Abstract: Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements SVM light is an implementation of an SVM learner which addresses the problem of large tasks This chapter presents algorithmic and computational results developed for SVM light V 20, which make large-scale SVM training more practical The results give guidelines for the application of SVMs to large domains

4,145 citations