scispace - formally typeset
Open AccessProceedings ArticleDOI

Uncovering shared structures in multiclass classification

TLDR
This paper suggests a method for multiclass learning with many classes by simultaneously learning shared characteristics common to the classes, and predictors for the classes in terms of these characteristics.
Abstract
This paper suggests a method for multiclass learning with many classes by simultaneously learning shared characteristics common to the classes, and predictors for the classes in terms of these characteristics. We cast this as a convex optimization problem, using trace-norm regularization and study gradient-based optimization both for the linear case and the kernelized setting.

read more

Content maybe subject to copyright    Report

Uncovering Shared Structures in Multiclass Classification
Abstract
We suggest a method for multi-class learning
with many classes by simultaneously learning
shared characteristics common to the classes, and
predictors for the classes in terms of these char-
acteristics. We cast this as a convex optimization
problem, using trace-norm regularization, study
gradient-based optimization both for the linear
case and the kernelized setting, and show how
this approach can yield improved classification
accuracy.
1. Introduction
In this paper we address the question of how to utilize hid-
den structure in order to improve multiclass classification
accuracy. Our goal is to provide a mechanism for learning
the underlying characteristics that are shared between the
target classes, and to demonstrate the benefit of extracting
common characteristics. We build upon the powerful no-
tion of large margin linear classifiers, and specifically focus
on the recent extensions to multiclass settings (Crammer &
Singer, 2001).
The challenge of accurate classification of an instance into
one of a large number of target classes surfaces in many do-
mains, such as object recognition, face identification, tex-
tual topic classification, and phoneme recognition. In many
of these domains it is natural to assume that even though
there are a large number of classes (e.g. different people
in a face recognition task), classes are related and build
on some underlying common characteristics. For exam-
ple, many different mammals share characteristics such as
a striped texture or an elongated snout, and people’s faces
can be identified based on underlying characteristics such
as gender, being Caucasian, or having red hair. Recover-
ing the true underlying characteristics of a domain can sig-
nificantly reduce the effective complexity of the multiclass
problem and by that transfer knowledge between related
classes.
The obvious question that arises is how to select the fea-
Preliminary work. Under review by the International Conference
on Machine Learning (ICML). Do not distribute.
ture mapping appropriate for a given task. One method to
resolve this need is by manually designing a domain spe-
cific kernel (e.g. (Shpigelman et al., 2002)). When the
route of manual kernel design is not feasible one can at-
tempt to learn a data specific feature mapping (Crammer
et al., 2002). In practice, researchers often simply test sev-
eral of the standard kernels in order to assess which attains
better performance on a validation set. These approaches,
however, fail to provide a clear mechanism for utilizing the
existence of structures in selecting the appropriate feature
mapping. We would therefore like to find an efficient way
to learn feature mappings that capture the underlying struc-
ture of a given set of classes.
The observation that learning a hidden representation of
some shared characteristics can facilitate learning has a
long history in multiclass learning (e.g. Dekel et al.
(2004)). This notion is often termed learning-to-learn or
interclass transfer (Thrun, 1996). While some approaches
assume some information on the shared characteristics is
provided to the learner in advance (Fink & Levi, 2004; Fink
et al., 2006), others rely on various learning heuristics in
order to extract the shared features (Torralba et al., 2004).
Simultaneously learning the underlying structure between
the classes and the class models is a challenging optimiza-
tion task. Many of the heuristic approaches stated above
aim at extracting powerful non-linear hidden characteris-
tics. However, this goal often entails non-convex optimiza-
tion tasks, prone to local minima problems. In contrast, we
will focus on modeling the shared characteristics, as linear
transformations of the input space. Thus, our model will
postulate a linear mapping of shared features, followedby a
multiclass linear classifier. We will show that such models
can be efficiently learned in a convex optimization scheme
and that they can significantly improvethe accuracyof mul-
ticlass linear classifiers, despite the fact that they are re-
stricted to simple linear mappings of the instance space.
The rest of this paper is organized as follows. We begin by
introducing our learning setting, motivating our approach
and formulating the suggested learning rule (Sec. 2). By
studying the dual of the resulting optimization problem, we
show, in Section 3, how to “kernalize” our learning rule.
Then, in Section 4, we discuss the learning rule in the con-
text of learning a latent feature representation. In Section 5
we derive an optimization scheme and in Section 7 demon-

Uncovering Shared Structures in Multiclass Classification
strate our approach on picture classification and handwrit-
ten letter recognition tasks.
2. Formulation
The goal of multiclass classification is to learn a mapping
H : X Y from instances in X to labels in Y =
{1, ..., k}. We consider linear classifiers over X = R
n
,
parametrized by a weight vector W
y
R
n
for each class
y Y, and which take the form:
H
W
(x) = argmax
y∈Y
W
t
y
· x . (1)
We wish to learn the weights from a set of m labeled train-
ing examples (x
i
, y
i
) X × Y, which we summarize in
a matrix X R
n×m
whose columns are given by x
i
.
Inspired by the large margin approach for classification,
Crammer and Singer (2001) suggest learning the weights
by minimizing a trade-off between an average empirical
loss (to be discussed shortly) and a regularizer of the form:
X
y
kW
y
k
2
= kW k
2
F
(2)
where kW k
F
is the Frobenius norm of the matrix W whose
columns are the vectors W
y
. The loss function suggested
by Crammer et alis the maximal hinge loss over all com-
parisons between the correct class and an incorrect class:
(W ; (x, y)) = max
y
6=y
1 + W
t
y
· x W
t
y
· x
+
(3)
where [z]
+
= max(0, z). For a trade-off parameter C, the
weights are then given by the following learning rule:
min
W
1
2
kW k
2
F
+ C
m
X
i=1
(W ; (x
i
, y
i
)) . (4)
For a binary classification problem, Y = {1, 2}, this for-
mulation reduces to the familiar Support Vector Machine
(SVM) formulation (with W
1
= W
2
=
1
2
w
svm
at the op-
timum, and C appropriately scaled). For larger number of
classes, the formulation generalizes SVMs by requiring a
margin between every pair of classes, and penalizing, for
each training example, the amount by which the margin
constraing it violated. Similarly to SVMs the optimiza-
tion problem Eq. (4) is convex, and by introducing a “slack
variable” for each example, it can be written as quadratic
programming. Crammer et aldiscuss practical optimization
approaches.
Recall that our goal is to learn W better by modeling char-
acteristics shared among multiple classes. We restrict our-
selves to modelling each common characteristics r as lin-
ear functions F
t
r
x of the input vectors x. The activation of
each class y is then taken to be a linear function G
t
y
(F
t
x)
of the vector F
t
x of common characteristics, instead of a
linear function of the input vectors. Formally our model
substitutes the weight matrix W R
n×k
with the product
W = F G of a weight matrix F R
n×t
, whose columns
define the t common characteristics, and G R
t×k
, whose
columns predict the classes based on the common charac-
teristics:
H
G,F
(x) = argmax
y∈Y
G
t
y
· (F
t
x) = argmax
y∈Y
(F G)
t
y
· x ,
(5)
It should be emphasized that if F and G are not constrained
in any way, the hypothesis space defined by Eq. (1) and
by Eq. (5) is identical, since any linear transformations in-
duced by applying F and then G can always be attained by
a single linear transformation W . We aim to show that nev-
ertheless, regularizing the decomposition F G, as we dis-
cuss shortly, instead of the Frobenius norm of the weight
matrix W , can yield a significant generalization advantage.
When the common characteristics F are known, we can
replace the input instances x
i
with the vectors F
t
x
i
and
revert back to our original formulation Eq. (4), with the
matrix G taking the role of the weight matrix. Each char-
acteristic r is now a feature (F
t
x
i
)
r
in this transformed
problem. The challenge we address in this paper is of si-
multaneously learning the common characteristics (or la-
tent features) F and the class weights G.
In order for the regularizer kGk
F
to be meaningful, we must
also control the magnitude of F , suggesting regularizing,
in addition to kGk
F
, also
P
r
kF k
2
= kF k
2
F
, yielding the
learning rule:
min
F,G
1
2
kF k
2
F
+
1
2
kGk
2
F
+ C
m
X
i=1
(F G; (x
i
, y
i
)) . (6)
The norm of each F
r
determines how “easy” it is for class
predictors to use this characteristic: increasing the norm
kF
r
k allows smaller values of G
yr
to yield the same pre-
diction, making it “cheaper” to use the characteristic. It is
thus beneficial for useful characteristics to have high norm.
But generalization ability is ensured by limiting the overall
norm of characteristics. It is important to note that, as we
are accustomed to in large-marginmethods, we do not have
to also limit the number of characteristics t. We are relying
here on the norm of F and G for regularization, rather than
their dimensionality.
The optimization objective of Eq. (6) is non-convex, and
involves matrices of unbounded dimensionality. However,
instead of explicitly learning F, G, the optimization prob-
lem Eq. (6) can also be written directly as a convex learning
rule for W . Following Srebro et al. (2005), we consider the
trace-norm of a matrix W :
kW k
Σ
= min
F G=W
1
2
(kF k
2
F
+ kGk
2
F
) (7)

Uncovering Shared Structures in Multiclass Classification
The trace-norm is a convexfunction of W , and can be char-
acterized as the sum of its singular values (Boyd & Vanden-
berghe, 2004).
kW k
Σ
=
X
i
|γ
i
| , (8)
Using Eq. (7), we can rewrite Eq. (6) as:
min
W
kW k
Σ
+ C
m
X
i=1
(W ; (x
i
, y
i
)) . (9)
Furthermore, following Fazel et al. (2001) and Srebro et al.
(2005), the optimization problem Eq. (9) can be formulated
as a semi-definite program (SDP).
To summarize, we saw how learning to classify based on
shared characteristics yields a learning rule in which the
Frobenius-norm regularization is replaced with a trace-
norm regularization.
3. Dualization and Kernelization
So far, we assumed we have direct access to the feature
representation x. However, much of the success of large-
margin methods stems form the fact that one does not need
access to the feature representation itself, but only to the
inner product between feature vectors, specified by a kernel
function k(x, x
). In order to obtain a kernelized form of
trace-norm regularized multi-class learning, we first briefly
describe the dual of Eq. (9), and how the optimum W can
be obtained from the dual optimum.
By applying standard Lagrange duality we deduce the dual
of Eq. (9) is given by the following optimization problem,
which may also be written as a semi-definite program:
max
X
i
(Q
iy
i
) s.t.
i,j6=y
i
Q
ij
0
i
(Q
iy
i
) =
X
j6=y
i
Q
ij
c
kXQk
2
1
where Q R
n×k
denotes the dual Lagrange variable
and kXQk
2
is the spectral norm of XQ (i.e. the maxi-
mal singular value of this matrix). The spectral norm con-
straint can be equivalently specified as k(XQ)
t
(XQ)k
2
=
kQ
t
(X
t
X)Qk
2
1. This form is particularly interesting,
since it allows us to write the dual in terms of the Gram
matrix K = X
t
X instead of the feature representation X
explicitly:
max
X
i
(Q
iy
i
) s.t.
i,j6=y
i
Q
ij
0
i
(Q
iy
i
) =
X
j6=y
i
Q
ij
c
kQ
t
KQk
2
1
(10)
Although Eq. (10) is not a semi-definite program, it is a
convex problem on Q that involves a semi-definite con-
straint (the spectral-norm constraint) on a matrix whose
size is independent of the size of the training set, and only
depends on the number of classes k.
The following Representer Theorem describes the opti-
mum weight matrix W in terms of the dual optimum Q ,
and allows the use of the kernel mechanism for prediction.
Theorem 1 Let Q be the optimum of Eq. (10) and V be the
matrix of eignevectors of Q
KQ, then for some diagonal
D R
k×k
, the matrix W = X (QV
t
DV ) is an optimum
of Eq. (9), with kW k
Σ
=
P
r
|D
rr
|.
Proof Using complementary slackness and following ar-
guments similar to those of Srebro et al. (2005), it can be
shown that XQ and the optimum W of Eq. (9) share the
same singular vectors. That is, if XQ = USV is the sin-
gular value decomposition of XQ, then W = UDV for
some diagonal matrix D. Furthermore D
rr
= 0 whenever
S
rr
6= 1, i.e. SD = D. Note also that the right singular
vectors V of XQ = U SV are precisely the eigenvectors of
(XQ)
t
(XQ) = Q
t
X
t
XQ = Q
t
KQ. We can now express
W as follows: First note that W = U DV . Since D = SD
we may express W as U SDV . Since V V
t
= I we may
further expand this expression to U SV V
t
DV . Finally, re-
placing USV with XQ we obtain X (QV
t
DV ).
Corollary 1 There exists α R
m×k
s.t. W = Xα is an
optimum of Eq. (9)
The situation is perhaps not as pleasing as for standard
SVMs where the weight vector can be explicitly repre-
sented in terms of the dual optimum solution. Here, even
after obtaining the dual optimum Q, we still need to re-
cover the diagonal matrix D. However, substituting W =
XQV
t
DV into Eq. (9), the first term becomes
P
r
|D
rr
|,
while the second is piecewise linear in KQV
t
DV . We
therefore obtain a linear program (LP) in the k unknown
entries on the diagonal of D, which can be easily solved to
recover D, and hence W . It is important to stress that the
number of variables of this LP depends only on the number
of classes, and not on the size of the data set, and that the
entire procedure (solving Eq. (10), extracting V and recov-
ering D) uses only the Gram matrix K and does not require
direct access to the explicit feature vectors X.
Even if the dual is not directly tackled, the representation
of the optimum W guaranteed by Thm. 1 can be used to
solve the primal Eq. (9) using the Gram matrix K instead
of the feature vectors X, as we discuss in Section 5.

Uncovering Shared Structures in Multiclass Classification
4. Learning a Latent Feature Representation
As alluded to above, learning F can be thought of as learn-
ing a latent feature space F
t
X, which is useful for pre-
diction. Since F is learned jointly over all classes, it ef-
fectively transfers knowledge between the classes. Low-
norm decompositions were previously discussed in these
terms by Srebro et al. (2005). More recently, Argyriou
et al. (2007) studied a formulation equivalent to using the
trace-normexplicitly for transfer learning between multiple
tasks: consider k binary classification tasks, and use W
j
as
a linear predictor for the jth task. Using an SVM to learn
each class independently corresponds to the learning rule:
min
W
X
j
(
1
2
kW
i
k
2
+C
j
(W
j
)) = min
W
1
2
kW k
2
F
+C
X
j
j
(W
j
)
where
j
(W
j
) is the total (hinge) loss of W
j
on the training
examples for task j. Replacing the Frobenius norm with
the trace norm :
min
W
kW k
Σ
+ C
X
j
j
(W
j
) (11)
corresponds to learning a feature representation φ(x) =
F
t
x that allows good, low-norm prediction for all k task,
where the linear predictor for task j, in this feature space, is
given by V
j
. After such a feature representation is learned,
a new task can be learned directly using the feature vec-
tors F
t
x using standard SVM machinery, taking advantage
of the transfered knowledge from the other, previously-
learned, tasks.
In the multi-class setting, the predictors W
y
are never inde-
pendent, as even in the standard Frobenius norm formula-
tion Eq. (4), the loss couples together the predictors for the
different classes. However, the between-class transfer af-
forded by implicitly learning shared characteristics is much
stronger. As will be demonstrated later, such transfer is par-
ticularly important if only a few number of examples are
available from some class of interest.
Although this paper studies multi-class learning, the tech-
nical contributions, including the optimization approach,
study of the dual problem, and kernelization, apply equally
well also to the multi-task formulation Eq. (11).
It is interesting to note that we can learn a feature represen-
tation φ(x) = F
t
x even when we are not given the feature
representation X explicitly, but only a kernel k from which
we can obtain the Gram matrix K = X
t
X. In this sit-
uation we do not have access to X, nor can we obtain F
explicitly. As discussed above, what we can obtain is a
matrix α such that W = Xα is an optimum of Eq. (9).
Let W = U DV be the singular value decomposition of
W (which we cannot calculate, since we do not have ac-
cess to X). We have that F = U
D is an optimum of
−1 −0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
norm
singular value
−1 0 1 2 3
0
0.5
1
1.5
2
loss
margin
Figure 1.
Left: The smoothed absolute value function g. Smaller
values of r translate to a sharper function and a better estimate
of the absolute values. Right: The binary version of the log-loss
in comparison with the binary hinge-loss. Larger values of λ in-
crease the accuracy of the log-loss approximation.
Eq. (6). What we can calculate is the singular value de-
composition of α
t
Kα = α
t
X
t
Xα = W
t
W = V
t
D
2
V ,
and thus obtain D and V (but not U ). Now, note that
D
1/2
V α
t
K = D
1/2
V (α
t
X
t
)X = D
1/2
V W
t
X =
D
1/2
V V
t
DU
t
X = D
1/2
U
t
X = F
t
X, providing us
with an explicit representation of the learned feature space
that we can calculate from K and α alone.
In either case, we should note the optimum of Eq. (6) is not
unique, and so also the learned feature space is not unique:
if F, G is an optimum of Eq. (6), then (F R ), (R
t
G) is
also an optimum, for any unitary matrix RR
t
= I. In-
stead of learning the explicit feature representation φ(x) =
F
t
x, we can therefore think of trace-norm regularization
as learning the implied kernel k
φ
(x
, x) = hF
t
x
, F
t
xi.
Even when F is rotated (and reflected) by R, the learned
kernel k
φ
is unaffected.
5. Optimization
The optimization problem Eq. (9) can be formulated as a
semi-definite program (SDP) and off-the-shelf SDP solvers
can be used to recover the optimal W . However, such off-
the-shelf solvers based on interior point methods scale very
poorly with the size of the problem and typically cannot
handle problems with more than several hundred dimen-
sions, classes and training points. Moreover, the ability of
interior point methods to obtain very accurate solutions to
Eq. (9) is not particularly important in a machine learning
application as the objectivebased on the training data is just
a stochastic approximation of our true interest in general-
ization ability, and so obtaining a very precise solution to
this approximation does not typically yield significant im-
provements in classification accuracy. Instead, we choose
to optimize Eq. (9) using simple, but powerful gradient-
based methods.
5.1. Gradient based optimization
The optimization problem Eq. (9) is non-differentiable and
so not immediately amenable to gradient-based optimiza-

Uncovering Shared Structures in Multiclass Classification
tion. In order to perform the optimization, we consider a
smoothed approximation to Eq. (9).
We begin by replacing the trace-norm with a smooth proxy.
Eq. (8) characterizes the trace-norm as the sum of the sin-
gular values of W . Although the singular values are non-
negative, the absolute value in Eq. (8) emphasizes the rea-
son the trace-norm is non-differentiable when a singular
value is zero and a singular vector abruptly changes di-
rection. In order to obtain a smooth approximation to the
trace-norm, we replace the non-smooth absolute value with
a smooth function g defined as
g(γ) =
(
γ
2
2r
+
r
2
γ r
|γ| otherwise
.
Where r is a some predefined cutoff point. Fig. 1 illustrates
the function g and the effect of the parameter r. We can
easily see that g is continuously differentiable, and that x :
g(x) |x|
r
2
. Our smoothed proxy for the trace norm
is:
kW k
S
=
X
i
g(γ
i
) (12)
where γ
i
are the singular values of W . Its gradient can be
calculated as:
kW k
S
W
= Ug
(D)V (13)
where W = UDV is the SVD of W and g
(D) is an
element-wise computation of the derivative g
of g on the
diagonal of D.
We now turn our attention on the non-differentiable multi-
class hinge-loss of Eq. (3). Since neither the hinge []
+
nor
the max operators are differentiable we employ an adapta-
tion of the log-loss for the multiclass setting (Dekel et al.,
2003), with a parameter γ controlling its sharpness (in-
spired by Zhang and Oles (2001)):
S
(W ; (x
i
y
i
)) =
1
λ
log
1 +
X
r6=y
i
e
λ·(1+W
r
·x
i
W
y
i
·x
i
)
.
This is a convexand continuously differentiable function of
W which approaches the multiclass hinge-loss as λ
(Fig. 1). In summary, instead of Eq. (9) we consider the
following optimization problem:
min
W
kW k
S
+ C
m
X
i=1
S
(W ; (x
i
, y
i
)) (14)
which is a convex and continuously differentiable function.
Fig. 2 shows how optimization of the smoothed objective
Eq. (14) approximately optimizes Eq. (9). We generated
160 training instances with 16 classes and 16-dimensional
0 20 40 60 80
5
10
15
20
25
30
Objective
γ
Figure 2.
The values of the original (non-smooth) optimization
objective Eq. (9) for minima of the smoothed objective Eq. (14)
as a function of the smoothing parameter γ (solid) compared to
the true optimum of Eq. (9) (dotted).
feature vectors using a weight matrix that is the product
of two random 16 × 4 matrices. For each value of γ, and
a fixed r = 0.01 we compared the weight matrix W re-
covered using conjugate gradient descent on Eq. (14) to
the optimizer of Eq. (9) found using an interior point SDP
solver . The figure plots the value of the original (non-
smooth) objective of both solutions. For large values of γ,
the smoothed optimization solves the original problem to
within very good accuracy.
5.2. Kernelized gradient optimization
We now turn to devising a gradient-based optimization ap-
proach appropriate when only the Gram matrix K = X
t
X
is available, but not the feature vectors X themselves.
Corollary 1 assures us that the optimum of Eq. (9) is of the
form Xα, and so we can substitute W = Xα into Eq. (14)
and minimize over α. To do so using gradient methods, we
need to be able to compute both the smoothed objective and
its derivative from K and α alone, without reference to X
explicitly.
We first tackle the smoothed trace norm of Xα: Let Xα =
UDV denote the SVD of Xα then the SVD of α
t
Kα is
given by V
t
D
2
V . We can thus recover D from the SVD of
α
t
Kα, and use Eq. (12) to calculate kXαk
S
.
In order to compute the gradient of kXαk
S
with respect to
α, we calculate:
kXαk
S
α
= X
t
kXαk
S
Xα
= X
t
Ug
(D)V
inserting D(V V
t
)D
1
= DID
1
= I:
= X
t
U(DV V
t
D
1
)g
(D)V
= X
t
(UDV )V
t
D
1
g
(D)V
and since Xα = UDV :
= X
t
(Xα)V
t
D
1
g
(D)V = KαV
t
D
1
g
(D)V (15)

Citations
More filters
Journal ArticleDOI

A Singular Value Thresholding Algorithm for Matrix Completion

TL;DR: This paper develops a simple first-order and easy-to-implement algorithm that is extremely efficient at addressing problems in which the optimal solution has low rank, and develops a framework in which one can understand these algorithms in terms of well-known Lagrange multiplier algorithms.
Journal ArticleDOI

A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography

TL;DR: The genetic identity of each virus particle present in the mixture can be assigned based solely on the structural information derived from single envelope glycoproteins displayed on the virus surface by the nuclear norm-based, collaborative alignment method presented here.
Journal ArticleDOI

Exact matrix completion via convex optimization

TL;DR: In this paper, a convex programming problem is used to find the matrix with the minimum nuclear norm that is consistent with the observed entries in a low-rank matrix, which is then used to recover all the missing entries from most sufficiently large subsets.
Journal ArticleDOI

The Power of Convex Relaxation: Near-Optimal Matrix Completion

TL;DR: This paper shows that, under certain incoherence assumptions on the singular vectors of the matrix, recovery is possible by solving a convenient convex program as soon as the number of entries is on the order of the information theoretic limit (up to logarithmic factors).
Journal ArticleDOI

Tensor completion for estimating missing values in visual data

TL;DR: The contribution of this paper is to extend the matrix case to the tensor case by proposing the first definition of the trace norm for tensors and building a working algorithm to estimate missing values in tensors of visual data.
References
More filters
Book

Convex Optimization

TL;DR: In this article, the focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them, and a comprehensive introduction to the subject is given. But the focus of this book is not on the optimization problem itself, but on the problem of finding the appropriate technique to solve it.
Journal Article

On the algorithmic implementation of multiclass kernel-based vector machines

TL;DR: This paper describes the algorithmic implementation of multiclass kernel-based vector machines using a generalized notion of the margin to multiclass problems, and describes an efficient fixed-point algorithm for solving the reduced optimization problems and proves its convergence.
Journal ArticleDOI

Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

TL;DR: A large-scale evaluation of an approach that represents images as distributions of features extracted from a sparse set of keypoint locations and learns a Support Vector Machine classifier with kernels based on two effective measures for comparing distributions, the Earth Mover’s Distance and the χ2 distance.
Proceedings Article

Multi-Task Feature Learning

TL;DR: The method builds upon the well-known 1-norm regularization problem using a new regularizer which controls the number of learned features common for all the tasks, and develops an iterative algorithm for solving it.
Proceedings Article

Maximum-Margin Matrix Factorization

TL;DR: A novel approach to collaborative prediction is presented, using low-norm instead of low-rank factorizations, inspired by, and has strong connections to, large-margin linear discrimination.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Uncovering shared structures in multiclass classification" ?

The authors cast this as a convex optimization problem, using trace-norm regularization, study gradient-based optimization both for the linear case and the kernelized setting, and show how this approach can yield improved classification accuracy. 

The data was partitioned to three sets: 1000 were used as a training set, 500 were held out and used to select the optimal value of C and 500 were used as a test set. 

much of the success of largemargin methods stems form the fact that one does not need access to the feature representation itself, but only to the inner product between feature vectors, specified by a kernel function k(x,x′). 

their results on dualization, kernelization and representation of the learned latent feature space apply also to the multi-task setting studied by Argyriou et al, as well as to the general family of Eq. (16). 

In order to obtain a smooth approximation to the trace-norm, the authors replace the non-smooth absolute value with a smooth function g defined asg(γ) ={γ2 2r + r 2 γ ≤ r |γ| otherwise . 

The loss function suggested by Crammer et alis the maximal hinge loss over all comparisons between the correct class and an incorrect class:ℓ (W ; (x, y)) = max y′ 6=y[ 1 + W ty′ · x − W ty · x ] + (3)where [z]+ = max(0, z). 

By applying standard Lagrange duality the authors deduce the dual of Eq. (9) is given by the following optimization problem, which may also be written as a semi-definite program:max ∑i(−Qiyi) s.t.∀i,j 6=yi Qij ≥ 0 ∀i (−Qiyi) = ∑j 6=yiQij ≤ c‖XQ‖2 ≤ 1where Q ∈ Rn×k denotes the dual Lagrange variable and ‖XQ‖2 is the spectral norm of XQ (i.e. the maximal singular value of this matrix). 

The optimization problem Eq. (9) can be formulated as a semi-definite program (SDP) and off-the-shelf SDP solvers can be used to recover the optimal W . 

Corollary 1 assures us that the optimum of Eq. (9) is of the form Xα, and so the authors can substitute W = Xα into Eq. (14) and minimize over α. 

Although this paper studies multi-class learning, the technical contributions, including the optimization approach, study of the dual problem, and kernelization, apply equally well also to the multi-task formulation Eq. (11). 

Although the singular values are nonnegative, the absolute value in Eq. (8) emphasizes the reason the trace-norm is non-differentiable when a singular value is zero and a singular vector abruptly changes direction. 

The authors studied a learning rule for multi-class learning in which the magnitude of the factorization of the weight matrix is regularized, rather then the magnitude of the weights themselves.