Dual coordinate descent methods for logistic regression and maximum entropy models

doi:10.1007/S10994-010-5221-8

Home
/
Papers
/
Dual coordinate descent methods for logistic regression and maximum entropy models

Journal Article•DOI•

Dual coordinate descent methods for logistic regression and maximum entropy models

Hsiang-Fu Yu¹, Fang-Lan Huang¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

01 Oct 2011-Machine Learning (Springer US)-Vol. 85, Iss: 1, pp 41-75

TL;DR: This paper applies coordinate descent methods to solve the dual form of logistic regression and maximum entropy, and shows that many details are different from the situation in linear SVM.

read less

Abstract: Most optimization methods for logistic regression or maximum entropy solve the primal problem. They range from iterative scaling, coordinate descent, quasi-Newton, and truncated Newton. Less efforts have been made to solve the dual problem. In contrast, for linear support vector machines (SVM), methods have been shown to be very effective for solving the dual problem. In this paper, we apply coordinate descent methods to solve the dual form of logistic regression and maximum entropy. Interestingly, many details are different from the situation in linear SVM. We carefully study the theoretical convergence as well as numerical issues. The proposed method is shown to be faster than most state of the art methods for training logistic regression and maximum entropy.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•

LIBLINEAR: A Library for Large Linear Classification

[...]

Rong-En Fan¹, Kai-Wei Chang¹, Cho-Jui Hsieh¹, Xiang-Rui Wang¹, Chih-Jen Lin¹ - Show less +1 more•Institutions (1)

National Taiwan University¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: LIBLINEAR is an open source library for large-scale linear classification that supports logistic regression and linear support vector machines and provides easy-to-use command-line tools and library calls for users and developers.

...read moreread less

Abstract: LIBLINEAR is an open source library for large-scale linear classification. It supports logistic regression and linear support vector machines. We provide easy-to-use command-line tools and library calls for users and developers. Comprehensive documents are available for both beginners and advanced users. Experiments demonstrate that LIBLINEAR is very efficient on large sparse data sets.

...read moreread less

7,848 citations

Cites methods from "Dual coordinate descent methods for..."

...L2-regularized Logistic Regression (Solving Dual) See Yu et al. (2011) for details of a dual coordinate descent method....
[...]
...See Yu et al. (2011) for details of a dual coordinate descent method....
[...]
...Appendix I. L2-regularized Logistic Regression (Solving Dual) See Yu et al. (2011) for details of a dual coordinate descent method....
[...]

Proceedings Article•DOI•

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

[...]

Hsiang-Fu Yu¹, Cho-Jui Hsieh¹, Si Si¹, Inderjit S. Dhillon¹•Institutions (1)

University of Texas at Austin¹

10 Dec 2012

TL;DR: It is shown that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD, and it is empirically shown that CCD++ is much faster than ALS and SGD in both settings.

...read moreread less

Abstract: Matrix factorization, when the matrix has missing values, has become one of the leading techniques for recommender systems. To handle web-scale datasets with millions of users and billions of ratings, scalability becomes an important issue. Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) are two popular approaches to compute matrix factorization. There has been a recent flurry of activity to parallelize these algorithms. However, due to the cubic time complexity in the target rank, ALS is not scalable to large-scale datasets. On the other hand, SGD conducts efficient updates but usually suffers from slow convergence that is sensitive to the parameters. Coordinate descent, a classical optimization approach, has been used for many other large-scale problems, but its application to matrix factorization for recommender systems has not been explored thoroughly. In this paper, we show that coordinate descent based methods have a more efficient update rule compared to ALS, and are faster and have more stable convergence than SGD. We study different update sequences and propose the CCD++ algorithm, which updatesrank-one factors one by one. In addition, CCD++ can be easily parallelized on both multi-core and distributed systems. We empirically show that CCD++ is much faster than ALS and SGD in both settings. As an example, on a synthetic dataset with 2 billion ratings, CCD++ is 4 times faster than both SGD and ALS using a distributed system with 20 machines.

...read moreread less

308 citations

Cites methods from "Dual coordinate descent methods for..."

...Recently it has been successfully applied to various large-scale problems such as linear SVMs [14], maximum entropy models [15], NMF problems [9], [10], and sparse inverse covariance estimation...
[...]

Journal Article•DOI•

Recent Advances of Large-Scale Linear Classification

[...]

Guo-Xun Yuan¹, Chia-Hua Ho¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

03 Apr 2012

TL;DR: A comprehensive survey on the recent development of linear classifiers is given, which shows how efficient optimization methods to construct linear classifier methods have applied them to some large-scale applications.

...read moreread less

Abstract: Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some large-scale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.

...read moreread less

303 citations

Cites methods from "Dual coordinate descent methods for..."

...The work in [29] follows [64] to apply a two-level coordinate descent method, but uses a different method in the second level to decide variables for update....
[...]
...One example is a coordinate descent method (LIBLINEAR [29])....
[...]

Journal Article•

A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

[...]

Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin

01 Mar 2010-Journal of Machine Learning Research

TL;DR: Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.

...read moreread less

Abstract: Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.

...read moreread less

273 citations

Cites background from "Dual coordinate descent methods for..."

...Yu et al. (2010) proposed a quasi Newton approach to solve non-smooth convex optimization problems....
[...]

Proceedings Article•

What Makes a Great MOOC? An Interdisciplinary Analysis of Student Retention in Online Courses

[...]

Panagiotis Adamopoulos

01 Jan 2013

TL;DR: This study presents a novel analysis using a real-world data set with user-generated online reviews, where it identifies the student, course, platform, and university characteristics that affect student retention and estimate their relative effect.

...read moreread less

Abstract: Massive Open Online Courses (MOOCs) have experienced rapid expansion and gained significant popularity among students and educators. Although the broad acceptance of MOOCs, there is still a long way to go in terms of satisfaction of students’ needs, as witnessed in the extremely high drop-out rates. Working toward improving MOOCs, we employ the Grounded Theory Method (GTM) in a quantitative study and explore this new phenomenon. In particular, we present a novel analysis using a real-world data set with user-generated online reviews, where we both identify the student, course, platform, and university characteristics that affect student retention and estimate their relative effect. In the conducted analysis, we integrate econometric, text mining, opinion mining, and machine learning techniques, building both explanatory and predictive models, toward a more complete analysis. This study also provides actionable insights for MOOCs and education, in general, and contributes to the related literature discovering new findings.

...read moreread less

257 citations

Cites result from "Dual coordinate descent methods for..."

...…the 5 Other models that were tested and yielded similar results include decision trees (Breiman et al. 1984), one vs. one multiclass strategy with support vector classification (Chang and Lin 2011), and one vs. all multiclass strategy with logistic regression (Yu et al. 2011) and ridge regression....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

Collapse

References

PDF

Open Access

More filters

Book•

Nonlinear Programming

[...]

Dimitri P. Bertsekas

01 Jan 1995

12,671 citations

Additional excerpts

...1 by Bertsekas (1999), which gives the convergence of coordinate descent methods for the following problem: min D(α) subject to α ∈ A1 × · · · ×Al, (75)...
[...]

Journal Article•DOI•

On the limited memory BFGS method for large scale optimization

[...]

Dong C. Liu¹, Jorge Nocedal¹•Institutions (1)

Northwestern University¹

01 Dec 1989-Mathematical Programming

TL;DR: The numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence, and the convergence properties are studied to prove global convergence on uniformly convex problems.

...read moreread less

Abstract: We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.

...read moreread less

7,004 citations

Journal Article•DOI•

A comparison of methods for multiclass support vector machines

[...]

Hsu Chih-Wei¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

01 Mar 2002-IEEE Transactions on Neural Networks

TL;DR: Decomposition implementations for two "all-together" multiclass SVM methods are given and it is shown that for large problems methods by considering all data at once in general need fewer support vectors.

...read moreread less

Abstract: Support vector machines (SVMs) were originally designed for binary classification. How to effectively extend it for multiclass classification is still an ongoing research issue. Several methods have been proposed where typically we construct a multiclass classifier by combining several binary classifiers. Some authors also proposed methods that consider all classes at once. As it is computationally more expensive to solve multiclass problems, comparisons of these methods using large-scale problems have not been seriously conducted. Especially for methods solving multiclass SVM in one step, a much larger optimization problem is required so up to now experiments are limited to small data sets. In this paper we give decomposition implementations for two such "all-together" methods. We then compare their performance with three methods based on binary classifications: "one-against-all," "one-against-one," and directed acyclic graph SVM (DAGSVM). Our experiments indicate that the "one-against-one" and DAG methods are more suitable for practical use than the other methods. Results also show that for large problems methods by considering all data at once in general need fewer support vectors.

...read moreread less

6,562 citations

"Dual coordinate descent methods for..." refers methods in this paper

...Therefore, we follow Memisevic (2006) and earlier SVM works (Crammer and Singer 2000; Hsu and Lin 2002; Keerthi et al. 2008) to consider variables associated with an xi as a block....
[...]

Proceedings Article•DOI•

Advances in kernel methods: support vector learning

[...]

Bernhard Schölkopf¹, Christopher John Burges, Alexander J. Smola•Institutions (1)

Max Planck Society¹

08 Feb 1999

TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.

...read moreread less

Abstract: Introduction to support vector learning roadmap. Part 1 Theory: three remarks on the support vector method of function estimation, Vladimir Vapnik generalization performance of support vector machines and other pattern classifiers, Peter Bartlett and John Shawe-Taylor Bayesian voting schemes and large margin classifiers, Nello Cristianini and John Shawe-Taylor support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Grace Wahba geometry and invariance in kernel based methods, Christopher J.C. Burges on the annealed VC entropy for margin classifiers - a statistical mechanics study, Manfred Opper entropy numbers, operators and support vector kernels, Robert C. Williamson et al. Part 2 Implementations: solving the quadratic programming problem arising in support vector classification, Linda Kaufman making large-scale support vector machine learning practical, Thorsten Joachims fast training of support vector machines using sequential minimal optimization, John C. Platt. Part 3 Applications: support vector machines for dynamic reconstruction of a chaotic system, Davide Mattera and Simon Haykin using support vector machines for time series prediction, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel. Part 4 Extensions of the algorithm: reducing the run-time complexity in support vector machines, Edgar E. Osuna and Federico Girosi support vector regression with ANOVA decomposition kernels, Mark O. Stitson et al support vector density estimation, Jason Weston et al combining support vector and mathematical programming methods for classification, Bernhard Scholkopf et al.

...read moreread less

5,506 citations

Posted Content•DOI•

Making large scale SVM learning practical

[...]

Thorsten Joachims

29 Oct 1999-Technical reports

TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.

...read moreread less

Abstract: Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements SVM light is an implementation of an SVM learner which addresses the problem of large tasks This chapter presents algorithmic and computational results developed for SVM light V 20, which make large-scale SVM training more practical The results give guidelines for the application of SVMs to large domains

...read moreread less

4,145 citations