Uncovering shared structures in multiclass classification
Citations
A Singular Value Thresholding Algorithm for Matrix Completion
A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography
Exact matrix completion via convex optimization
The Power of Convex Relaxation: Near-Optimal Matrix Completion
Tensor completion for estimating missing values in visual data
References
Convex Optimization
On the algorithmic implementation of multiclass kernel-based vector machines
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study
Multi-Task Feature Learning
Maximum-Margin Matrix Factorization
Related Papers (5)
Frequently Asked Questions (12)
Q2. How many sets were used to select the optimal value of C?
The data was partitioned to three sets: 1000 were used as a training set, 500 were held out and used to select the optimal value of C and 500 were used as a test set.
Q3. What is the main reason for the success of largemargin methods?
much of the success of largemargin methods stems form the fact that one does not need access to the feature representation itself, but only to the inner product between feature vectors, specified by a kernel function k(x,x′).
Q4. What is the general family of Eq. 16?
their results on dualization, kernelization and representation of the learned latent feature space apply also to the multi-task setting studied by Argyriou et al, as well as to the general family of Eq. (16).
Q5. What is the way to solve the trace-norm?
In order to obtain a smooth approximation to the trace-norm, the authors replace the non-smooth absolute value with a smooth function g defined asg(γ) ={γ2 2r + r 2 γ ≤ r |γ| otherwise .
Q6. What is the loss function for a classifier?
The loss function suggested by Crammer et alis the maximal hinge loss over all comparisons between the correct class and an incorrect class:ℓ (W ; (x, y)) = max y′ 6=y[ 1 + W ty′ · x − W ty · x ] + (3)where [z]+ = max(0, z).
Q7. What is the spectral norm of XQ?
By applying standard Lagrange duality the authors deduce the dual of Eq. (9) is given by the following optimization problem, which may also be written as a semi-definite program:max ∑i(−Qiyi) s.t.∀i,j 6=yi Qij ≥ 0 ∀i (−Qiyi) = ∑j 6=yiQij ≤ c‖XQ‖2 ≤ 1where Q ∈ Rn×k denotes the dual Lagrange variable and ‖XQ‖2 is the spectral norm of XQ (i.e. the maximal singular value of this matrix).
Q8. What is the optimum of Eq. (6)?
The optimization problem Eq. (9) can be formulated as a semi-definite program (SDP) and off-the-shelf SDP solvers can be used to recover the optimal W .
Q9. What is the optimum of Eq. (9)?
Corollary 1 assures us that the optimum of Eq. (9) is of the form Xα, and so the authors can substitute W = Xα into Eq. (14) and minimize over α.
Q10. What is the purpose of this paper?
Although this paper studies multi-class learning, the technical contributions, including the optimization approach, study of the dual problem, and kernelization, apply equally well also to the multi-task formulation Eq. (11).
Q11. What is the reason the trace-norm is nondifferentiable?
Although the singular values are nonnegative, the absolute value in Eq. (8) emphasizes the reason the trace-norm is non-differentiable when a singular value is zero and a singular vector abruptly changes direction.
Q12. What is the effect of the learning rule for multi-class learning?
The authors studied a learning rule for multi-class learning in which the magnitude of the factorization of the weight matrix is regularized, rather then the magnitude of the weights themselves.