A dual coordinate descent method for large-scale linear SVM
read more
Citations
LIBLINEAR: A Library for Large Linear Classification
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Foundations of Machine Learning
Pegasos: primal estimated sub-gradient solver for SVM
Stochastic dual coordinate ascent methods for regularized loss
References
LIBSVM: A library for support vector machines
A training algorithm for optimal margin classifiers
Advances in kernel methods: support vector learning
Fast training of support vector machines using sequential minimal optimization, advances in kernel methods
Fast training of support vector machines using sequential minimal optimization
Related Papers (5)
Frequently Asked Questions (19)
Q2. What is the common method for training nonlinear SVM?
As the kernel matrix is dense and cannot be stored in the computer memory, decomposition methods solve a sub-problem of few variables at each iteration.
Q3. What is the common loss function for SVM?
Two common loss functions are:max(1− yiwTxi, 0) and max(1− yiwTxi, 0)2. (2)The former is called L1-SVM, while the latter is L2SVM.
Q4. What is the cost of calculating wTxi?
for a linear SVM, the authors can definew = ∑lj=1 yjαjxj , (11)so (10) becomes∇if(α) = yiwTxi − 1 +Diiαi. (12)To evaluate (12), the main cost is O(n̄) for calculating wTxi.
Q5. What is the optimality condition of boundconstrained problems?
From the optimality condition of boundconstrained problems, α is optimal for (4) if and only ifAlgorithm 2 Coordinate descent algorithm with randomly selecting one instance at a timeGiven α and the corresponding w = ∑ i yiαixi.
Q6. How fast is the solution for L1- and L2-SVM?
The reference solutions of L1- and L2-SVM are respectively obtained by solving DCDL1 and DCDL2 until the duality gaps are less than 10−6.
Q7. What is the advantage of using the whole gradient?
As using the whole gradient implies fewer iterations (i.e., faster convergence due to the ability to choose for updating the variable that violates optimality most), one should take this advantage.
Q8. What is the cost of calculating the ith row of the kernel matrix?
If n̄ is the average number of nonzero elements per instance, and O(n̄) is needed for each kernel evaluation, then calculating the ith row of the kernel matrix takes O(ln̄).
Q9. What is the cost of calculating w?
to evaluate ∇if(αk,i), the authors use∇if(α) = (Q̄α)i − 1 = ∑lj=1 Q̄ijαj − 1. (10)Q̄ may be too large to be stored, so one calculates Q̄’s ith row when doing (10).
Q10. What is the cost of maintaining the whole gradient?
After updating αk,ii to α k,i+1 i , the authors obtain Q̄’s ith column (same as the ith row due to the symmetry of Q̄), and calculate the new whole gradient:∇f(αk,i+1) = ∇f(αk,i) + Q̄:,i(αk,i+1i − α k,i i ), (18)where Q̄:,i is the ith column of Q̄.
Q11. What is the proof of the convergence of the two algorithms?
The convergence rate is at least linear: there are 0 < µ < 1 and an iteration k0 such thatf(αk+1)− f(α∗) ≤ µ(f(αk)− f(α∗)),∀k ≥ k0. (14)The proof is in Appendix 7.1.
Q12. What is the way to solve a linear SVM problem?
For linear SVM, in solving the smaller problem (15), the authors still have the vectorw = ∑i∈A yiαixi + ∑ i∈Ā yiαixithough only the first part ∑ i∈A yiαixi is updated.
Q13. how do the authors update k to i?
For updating αk,i to αk,i+1, the authors solve the following one-variable sub-problem:min df(αk,i + dei) subject to 0 ≤ αki + d ≤ U, (5)where ei = [0, . . . , 0, 1, 0, . . . , 0] T .
Q14. What is the convergent point of k,i?
In their iterative procedure, what the authors have are ∇if(αk,i), i = 1, . . . , l. Hence, at the (k − 1)st iteration, the authors obtainMk−1 ≡ max j ∇Pj f(αk−1,j),mk−1 ≡ min j ∇Pj f(αk−1,j).
Q15. What is the simplest way to solve a convex problem?
their algorithm uses (12) to compute ∇if(αk,i), checks the optimality of the sub-problem (5) by (7), updates αi by (9), and then maintains w by (13).
Q16. What is the way to solve a n-l problem?
In practice, one thus should try from a small C. Moreover, if n l and C is too large, then their DCDL2 is slower than TRON or PCD (see problem a9a in Table 2, where the accuracy does not change after C ≥ 0.25).
Q17. What is the reason for the slow convergence of DCDL1?
As discussed in Section 4.2, the learning rate of stochastic gradient descent may be the cause, but for DCDL1 the authors exactly solve sub-problems to obtain the step size in updating w.
Q18. What is the convergent point of k,ii?
Then at each inner step of the kth iteration, before updating αk,ii to α k,i+1 i , this element is shrunken if one of the following two conditions holds:αk,ii = 0 and ∇if(α k,i) >
Q19. What are the main considerations needed in solving the single-variable sub-problem?
as primal L2-SVM is differentiable but not twice differentiable, certain considerations are needed in solving the single-variable sub-problem.