How many sets were used to select the optimal value of C?

The data was partitioned to three sets: 1000 were used as a training set, 500 were held out and used to select the optimal value of C and 500 were used as a test set.

What is the general family of Eq. 16?

their results on dualization, kernelization and representation of the learned latent feature space apply also to the multi-task setting studied by Argyriou et al, as well as to the general family of Eq. (16).

What is the way to solve the trace-norm?

In order to obtain a smooth approximation to the trace-norm, the authors replace the non-smooth absolute value with a smooth function g defined asg(γ) ={γ2 2r + r 2 γ ≤ r |γ| otherwise .

What is the spectral norm of XQ?

By applying standard Lagrange duality the authors deduce the dual of Eq. (9) is given by the following optimization problem, which may also be written as a semi-definite program:max ∑i(−Qiyi) s.t.∀i,j 6=yi Qij ≥ 0 ∀i (−Qiyi) = ∑j 6=yiQij ≤ c‖XQ‖2 ≤ 1where Q ∈ Rn×k denotes the dual Lagrange variable and ‖XQ‖2 is the spectral norm of XQ (i.e. the maximal singular value of this matrix).

What is the optimum of Eq. (6)?

The optimization problem Eq. (9) can be formulated as a semi-definite program (SDP) and off-the-shelf SDP solvers can be used to recover the optimal W .

What is the optimum of Eq. (9)?

Corollary 1 assures us that the optimum of Eq. (9) is of the form Xα, and so the authors can substitute W = Xα into Eq. (14) and minimize over α.

What is the effect of the learning rule for multi-class learning?

The authors studied a learning rule for multi-class learning in which the magnitude of the factorization of the weight matrix is regularized, rather then the magnitude of the weights themselves.

(Open Access) Uncovering shared structures in multiclass classification (2007) | Yonatan Amit

Q: What are the contributions mentioned in the paper "Uncovering shared structures in multiclass classification" ?

The authors cast this as a convex optimization problem, using trace-norm regularization, study gradient-based optimization both for the linear case and the kernelized setting, and show how this approach can yield improved classification accuracy.

Q: What is the main reason for the success of largemargin methods?

much of the success of largemargin methods stems form the fact that one does not need access to the feature representation itself, but only to the inner product between feature vectors, specified by a kernel function k(x,x′).

Q: What is the purpose of this paper?

Although this paper studies multi-class learning, the technical contributions, including the optimization approach, study of the dual problem, and kernelization, apply equally well also to the multi-task formulation Eq. (11).

Uncovering Shared Structures in Multiclass Classiﬁcation

Abstract

We suggest a method for multi-class learning

with many classes by simultaneously learning

shared characteristics common to the classes, and

predictors for the classes in terms of these char-

acteristics. We cast this as a convex optimization

problem, using trace-norm regularization, study

gradient-based optimization both for the linear

case and the kernelized setting, and show how

this approach can yield improved classiﬁcation

accuracy.

1. Introduction

In this paper we address the question of how to utilize hid-

den structure in order to improve multiclass classiﬁcation

accuracy. Our goal is to provide a mechanism for learning

the underlying characteristics that are shared between the

target classes, and to demonstrate the beneﬁt of extracting

common characteristics. We build upon the powerful no-

tion of large margin linear classiﬁers, and speciﬁcally focus

on the recent extensions to multiclass settings (Crammer &

Singer, 2001).

The challenge of accurate classiﬁcation of an instance into

one of a large number of target classes surfaces in many do-

mains, such as object recognition, face identiﬁcation, tex-

tual topic classiﬁcation, and phoneme recognition. In many

of these domains it is natural to assume that even though

there are a large number of classes (e.g. different people

in a face recognition task), classes are related and build

on some underlying common characteristics. For exam-

ple, many different mammals share characteristics such as

a striped texture or an elongated snout, and people’s faces

can be identiﬁed based on underlying characteristics such

as gender, being Caucasian, or having red hair. Recover-

ing the true underlying characteristics of a domain can sig-

niﬁcantly reduce the effective complexity of the multiclass

problem and by that transfer knowledge between related

classes.

The obvious question that arises is how to select the fea-

Preliminary work. Under review by the International Conference

on Machine Learning (ICML). Do not distribute.

ture mapping appropriate for a given task. One method to

resolve this need is by manually designing a domain spe-

ciﬁc kernel (e.g. (Shpigelman et al., 2002)). When the

route of manual kernel design is not feasible one can at-

tempt to learn a data speciﬁc feature mapping (Crammer

et al., 2002). In practice, researchers often simply test sev-

eral of the standard kernels in order to assess which attains

better performance on a validation set. These approaches,

however, fail to provide a clear mechanism for utilizing the

existence of structures in selecting the appropriate feature

mapping. We would therefore like to ﬁnd an efﬁcient way

to learn feature mappings that capture the underlying struc-

ture of a given set of classes.

The observation that learning a hidden representation of

some shared characteristics can facilitate learning has a

long history in multiclass learning (e.g. Dekel et al.

(2004)). This notion is often termed learning-to-learn or

interclass transfer (Thrun, 1996). While some approaches

assume some information on the shared characteristics is

provided to the learner in advance (Fink & Levi, 2004; Fink

et al., 2006), others rely on various learning heuristics in

order to extract the shared features (Torralba et al., 2004).

Simultaneously learning the underlying structure between

the classes and the class models is a challenging optimiza-

tion task. Many of the heuristic approaches stated above

aim at extracting powerful non-linear hidden characteris-

tics. However, this goal often entails non-convex optimiza-

tion tasks, prone to local minima problems. In contrast, we

will focus on modeling the shared characteristics, as linear

transformations of the input space. Thus, our model will

postulate a linear mapping of shared features, followedby a

multiclass linear classiﬁer. We will show that such models

can be efﬁciently learned in a convex optimization scheme

and that they can signiﬁcantly improvethe accuracyof mul-

ticlass linear classiﬁers, despite the fact that they are re-

stricted to simple linear mappings of the instance space.

The rest of this paper is organized as follows. We begin by

introducing our learning setting, motivating our approach

and formulating the suggested learning rule (Sec. 2). By

studying the dual of the resulting optimization problem, we

show, in Section 3, how to “kernalize” our learning rule.

Then, in Section 4, we discuss the learning rule in the con-

text of learning a latent feature representation. In Section 5

we derive an optimization scheme and in Section 7 demon-

Uncovering Shared Structures in Multiclass Classiﬁcation

strate our approach on picture classiﬁcation and handwrit-

ten letter recognition tasks.

2. Formulation

The goal of multiclass classiﬁcation is to learn a mapping

H : X → Y from instances in X to labels in Y =

{1, ..., k}. We consider linear classiﬁers over X = R

parametrized by a weight vector W

∈ R

for each class

y ∈ Y, and which take the form:

(x) = argmax

y∈Y

· x . (1)

We wish to learn the weights from a set of m labeled train-

ing examples (x

, y

) ∈ X × Y, which we summarize in

a matrix X ∈ R

n×m

whose columns are given by x

Inspired by the large margin approach for classiﬁcation,

Crammer and Singer (2001) suggest learning the weights

by minimizing a trade-off between an average empirical

loss (to be discussed shortly) and a regularizer of the form:

= kW k

(2)

where kW k

is the Frobenius norm of the matrix W whose

columns are the vectors W

. The loss function suggested

by Crammer et alis the maximal hinge loss over all com-

parisons between the correct class and an incorrect class:

ℓ (W ; (x, y)) = max

′

6=y



1 + W

′

· x − W

· x



(3)

where [z]

= max(0, z). For a trade-off parameter C, the

weights are then given by the following learning rule:

min

kW k

+ C

i=1

ℓ (W ; (x

, y

)) . (4)

For a binary classiﬁcation problem, Y = {1, 2}, this for-

mulation reduces to the familiar Support Vector Machine

(SVM) formulation (with W

= −W

svm

at the op-

timum, and C appropriately scaled). For larger number of

classes, the formulation generalizes SVMs by requiring a

margin between every pair of classes, and penalizing, for

each training example, the amount by which the margin

constraing it violated. Similarly to SVMs the optimiza-

tion problem Eq. (4) is convex, and by introducing a “slack

variable” for each example, it can be written as quadratic

programming. Crammer et aldiscuss practical optimization

approaches.

Recall that our goal is to learn W better by modeling char-

acteristics shared among multiple classes. We restrict our-

selves to modelling each common characteristics r as lin-

ear functions F

x of the input vectors x. The activation of

each class y is then taken to be a linear function G

of the vector F

x of common characteristics, instead of a

linear function of the input vectors. Formally our model

substitutes the weight matrix W ∈ R

n×k

with the product

W = F G of a weight matrix F ∈ R

n×t

, whose columns

deﬁne the t common characteristics, and G ∈ R

t×k

, whose

columns predict the classes based on the common charac-

teristics:

G,F

(x) = argmax

y∈Y

· (F

x) = argmax

y∈Y

(F G)

· x ,

(5)

It should be emphasized that if F and G are not constrained

in any way, the hypothesis space deﬁned by Eq. (1) and

by Eq. (5) is identical, since any linear transformations in-

duced by applying F and then G can always be attained by

a single linear transformation W . We aim to show that nev-

ertheless, regularizing the decomposition F G, as we dis-

cuss shortly, instead of the Frobenius norm of the weight

matrix W , can yield a signiﬁcant generalization advantage.

When the common characteristics F are known, we can

replace the input instances x

with the vectors F

and

revert back to our original formulation Eq. (4), with the

matrix G taking the role of the weight matrix. Each char-

acteristic r is now a feature (F

)

in this transformed

problem. The challenge we address in this paper is of si-

multaneously learning the common characteristics (or la-

tent features) F and the class weights G.

In order for the regularizer kGk

to be meaningful, we must

also control the magnitude of F , suggesting regularizing,

in addition to kGk

, also

kF k

= kF k

, yielding the

learning rule:

min

F,G

kF k

kGk

+ C

i=1

ℓ (F G; (x

, y

)) . (6)

The norm of each F

determines how “easy” it is for class

predictors to use this characteristic: increasing the norm

k allows smaller values of G

to yield the same pre-

diction, making it “cheaper” to use the characteristic. It is

thus beneﬁcial for useful characteristics to have high norm.

But generalization ability is ensured by limiting the overall

norm of characteristics. It is important to note that, as we

are accustomed to in large-marginmethods, we do not have

to also limit the number of characteristics t. We are relying

here on the norm of F and G for regularization, rather than

their dimensionality.

The optimization objective of Eq. (6) is non-convex, and

involves matrices of unbounded dimensionality. However,

instead of explicitly learning F, G, the optimization prob-

lem Eq. (6) can also be written directly as a convex learning

rule for W . Following Srebro et al. (2005), we consider the

trace-norm of a matrix W :

kW k

= min

F G=W

(kF k

+ kGk

) (7)

Uncovering Shared Structures in Multiclass Classiﬁcation

The trace-norm is a convexfunction of W , and can be char-

acterized as the sum of its singular values (Boyd & Vanden-

berghe, 2004).

kW k

|γ

| , (8)

Using Eq. (7), we can rewrite Eq. (6) as:

min

kW k

+ C

i=1

ℓ (W ; (x

, y

)) . (9)

Furthermore, following Fazel et al. (2001) and Srebro et al.

(2005), the optimization problem Eq. (9) can be formulated

as a semi-deﬁnite program (SDP).

To summarize, we saw how learning to classify based on

shared characteristics yields a learning rule in which the

Frobenius-norm regularization is replaced with a trace-

norm regularization.

3. Dualization and Kernelization

So far, we assumed we have direct access to the feature

representation x. However, much of the success of large-

margin methods stems form the fact that one does not need

access to the feature representation itself, but only to the

inner product between feature vectors, speciﬁed by a kernel

function k(x, x

′

). In order to obtain a kernelized form of

trace-norm regularized multi-class learning, we ﬁrst brieﬂy

describe the dual of Eq. (9), and how the optimum W can

be obtained from the dual optimum.

By applying standard Lagrange duality we deduce the dual

of Eq. (9) is given by the following optimization problem,

which may also be written as a semi-deﬁnite program:

max

(−Q

) s.t.

∀

i,j6=y

≥ 0

∀

(−Q

) =

j6=y

≤ c

kXQk

≤ 1

where Q ∈ R

n×k

denotes the dual Lagrange variable

and kXQk

is the spectral norm of XQ (i.e. the maxi-

mal singular value of this matrix). The spectral norm con-

straint can be equivalently speciﬁed as k(XQ)

(XQ)k

X)Qk

≤ 1. This form is particularly interesting,

since it allows us to write the dual in terms of the Gram

matrix K = X

X instead of the feature representation X

explicitly:

max

(−Q

) s.t.

∀

i,j6=y

≥ 0

∀

(−Q

) =

j6=y

≤ c

KQk

≤ 1

(10)

Although Eq. (10) is not a semi-deﬁnite program, it is a

convex problem on Q that involves a semi-deﬁnite con-

straint (the spectral-norm constraint) on a matrix whose

size is independent of the size of the training set, and only

depends on the number of classes k.

The following Representer Theorem describes the opti-

mum weight matrix W in terms of the dual optimum Q ,

and allows the use of the kernel mechanism for prediction.

Theorem 1 Let Q be the optimum of Eq. (10) and V be the

matrix of eignevectors of Q

′

KQ, then for some diagonal

D ∈ R

k×k

, the matrix W = X (QV

DV ) is an optimum

of Eq. (9), with kW k

Proof Using complementary slackness and following ar-

guments similar to those of Srebro et al. (2005), it can be

shown that XQ and the optimum W of Eq. (9) share the

same singular vectors. That is, if XQ = USV is the sin-

gular value decomposition of XQ, then W = UDV for

some diagonal matrix D. Furthermore D

= 0 whenever

6= 1, i.e. SD = D. Note also that the right singular

vectors V of XQ = U SV are precisely the eigenvectors of

(XQ)

(XQ) = Q

XQ = Q

KQ. We can now express

W as follows: First note that W = U DV . Since D = SD

we may express W as U SDV . Since V V

= I we may

further expand this expression to U SV V

DV . Finally, re-

placing USV with XQ we obtain X (QV

DV ).

Corollary 1 There exists α ∈ R

m×k

s.t. W = Xα is an

optimum of Eq. (9)

The situation is perhaps not as pleasing as for standard

SVMs where the weight vector can be explicitly repre-

sented in terms of the dual optimum solution. Here, even

after obtaining the dual optimum Q, we still need to re-

cover the diagonal matrix D. However, substituting W =

XQV

DV into Eq. (9), the ﬁrst term becomes

while the second is piecewise linear in KQV

DV . We

therefore obtain a linear program (LP) in the k unknown

entries on the diagonal of D, which can be easily solved to

recover D, and hence W . It is important to stress that the

number of variables of this LP depends only on the number

of classes, and not on the size of the data set, and that the

entire procedure (solving Eq. (10), extracting V and recov-

ering D) uses only the Gram matrix K and does not require

direct access to the explicit feature vectors X.

Even if the dual is not directly tackled, the representation

of the optimum W guaranteed by Thm. 1 can be used to

solve the primal Eq. (9) using the Gram matrix K instead

of the feature vectors X, as we discuss in Section 5.

Uncovering Shared Structures in Multiclass Classiﬁcation

4. Learning a Latent Feature Representation

As alluded to above, learning F can be thought of as learn-

ing a latent feature space F

X, which is useful for pre-

diction. Since F is learned jointly over all classes, it ef-

fectively transfers knowledge between the classes. Low-

norm decompositions were previously discussed in these

terms by Srebro et al. (2005). More recently, Argyriou

et al. (2007) studied a formulation equivalent to using the

trace-normexplicitly for transfer learning between multiple

tasks: consider k binary classiﬁcation tasks, and use W

a linear predictor for the jth task. Using an SVM to learn

each class independently corresponds to the learning rule:

min

(

+Cℓ

)) = min

kW k

ℓ

)

where ℓ

) is the total (hinge) loss of W

on the training

examples for task j. Replacing the Frobenius norm with

the trace norm :

min

kW k

+ C

ℓ

) (11)

corresponds to learning a feature representation φ(x) =

x that allows good, low-norm prediction for all k task,

where the linear predictor for task j, in this feature space, is

given by V

. After such a feature representation is learned,

a new task can be learned directly using the feature vec-

tors F

x using standard SVM machinery, taking advantage

of the transfered knowledge from the other, previously-

learned, tasks.

In the multi-class setting, the predictors W

are never inde-

pendent, as even in the standard Frobenius norm formula-

tion Eq. (4), the loss couples together the predictors for the

different classes. However, the between-class transfer af-

forded by implicitly learning shared characteristics is much

stronger. As will be demonstrated later, such transfer is par-

ticularly important if only a few number of examples are

available from some class of interest.

Although this paper studies multi-class learning, the tech-

nical contributions, including the optimization approach,

study of the dual problem, and kernelization, apply equally

well also to the multi-task formulation Eq. (11).

It is interesting to note that we can learn a feature represen-

tation φ(x) = F

x even when we are not given the feature

representation X explicitly, but only a kernel k from which

we can obtain the Gram matrix K = X

X. In this sit-

uation we do not have access to X, nor can we obtain F

explicitly. As discussed above, what we can obtain is a

matrix α such that W = Xα is an optimum of Eq. (9).

Let W = U DV be the singular value decomposition of

W (which we cannot calculate, since we do not have ac-

cess to X). We have that F = U

√

D is an optimum of

−1 −0.5 0 0.5 1

0.2

0.4

0.6

0.8

norm

singular value

−1 0 1 2 3

0.5

1.5

loss

margin

Figure 1.

Left: The smoothed absolute value function g. Smaller

values of r translate to a sharper function and a better estimate

of the absolute values. Right: The binary version of the log-loss

in comparison with the binary hinge-loss. Larger values of λ in-

crease the accuracy of the log-loss approximation.

Eq. (6). What we can calculate is the singular value de-

composition of α

Kα = α

Xα = W

W = V

V ,

and thus obtain D and V (but not U ). Now, note that

−1/2

V α

K = D

−1/2

V (α

)X = D

−1/2

V W

X =

−1/2

V V

X = D

1/2

X = F

X, providing us

with an explicit representation of the learned feature space

that we can calculate from K and α alone.

In either case, we should note the optimum of Eq. (6) is not

unique, and so also the learned feature space is not unique:

if F, G is an optimum of Eq. (6), then (F R ), (R

G) is

also an optimum, for any unitary matrix RR

= I. In-

stead of learning the explicit feature representation φ(x) =

x, we can therefore think of trace-norm regularization

as learning the implied kernel k

′

, x) = hF

′

, F

xi.

Even when F is rotated (and reﬂected) by R, the learned

kernel k

is unaffected.

5. Optimization

The optimization problem Eq. (9) can be formulated as a

semi-deﬁnite program (SDP) and off-the-shelf SDP solvers

can be used to recover the optimal W . However, such off-

the-shelf solvers based on interior point methods scale very

poorly with the size of the problem and typically cannot

handle problems with more than several hundred dimen-

sions, classes and training points. Moreover, the ability of

interior point methods to obtain very accurate solutions to

Eq. (9) is not particularly important in a machine learning

application as the objectivebased on the training data is just

a stochastic approximation of our true interest in general-

ization ability, and so obtaining a very precise solution to

this approximation does not typically yield signiﬁcant im-

provements in classiﬁcation accuracy. Instead, we choose

to optimize Eq. (9) using simple, but powerful gradient-

based methods.

5.1. Gradient based optimization

The optimization problem Eq. (9) is non-differentiable and

so not immediately amenable to gradient-based optimiza-

Uncovering Shared Structures in Multiclass Classiﬁcation

tion. In order to perform the optimization, we consider a

smoothed approximation to Eq. (9).

We begin by replacing the trace-norm with a smooth proxy.

Eq. (8) characterizes the trace-norm as the sum of the sin-

gular values of W . Although the singular values are non-

negative, the absolute value in Eq. (8) emphasizes the rea-

son the trace-norm is non-differentiable when a singular

value is zero and a singular vector abruptly changes di-

rection. In order to obtain a smooth approximation to the

trace-norm, we replace the non-smooth absolute value with

a smooth function g deﬁned as

g(γ) =

(

γ ≤ r

|γ| otherwise

Where r is a some predeﬁned cutoff point. Fig. 1 illustrates

the function g and the effect of the parameter r. We can

easily see that g is continuously differentiable, and that ∀x :



g(x) − |x|



≤

. Our smoothed proxy for the trace norm

is:

kW k

g(γ

) (12)

where γ

are the singular values of W . Its gradient can be

calculated as:

∂kW k

∂W

= Ug

′

(D)V (13)

where W = UDV is the SVD of W and g

′

(D) is an

element-wise computation of the derivative g

′

of g on the

diagonal of D.

We now turn our attention on the non-differentiable multi-

class hinge-loss of Eq. (3). Since neither the hinge []

nor

the max operators are differentiable we employ an adapta-

tion of the log-loss for the multiclass setting (Dekel et al.,

2003), with a parameter γ controlling its sharpness (in-

spired by Zhang and Oles (2001)):

ℓ

(W ; (x

)) =

log





1 +

r6=y

λ·(1+W

·x

−W

·x

)





This is a convexand continuously differentiable function of

W which approaches the multiclass hinge-loss as λ → ∞

(Fig. 1). In summary, instead of Eq. (9) we consider the

following optimization problem:

min

kW k

+ C

i=1

ℓ

(W ; (x

, y

)) (14)

which is a convex and continuously differentiable function.

Fig. 2 shows how optimization of the smoothed objective

Eq. (14) approximately optimizes Eq. (9). We generated

160 training instances with 16 classes and 16-dimensional

0 20 40 60 80

Objective

Figure 2.

The values of the original (non-smooth) optimization

objective Eq. (9) for minima of the smoothed objective Eq. (14)

as a function of the smoothing parameter γ (solid) compared to

the true optimum of Eq. (9) (dotted).

feature vectors using a weight matrix that is the product

of two random 16 × 4 matrices. For each value of γ, and

a ﬁxed r = 0.01 we compared the weight matrix W re-

covered using conjugate gradient descent on Eq. (14) to

the optimizer of Eq. (9) found using an interior point SDP

solver . The ﬁgure plots the value of the original (non-

smooth) objective of both solutions. For large values of γ,

the smoothed optimization solves the original problem to

within very good accuracy.

5.2. Kernelized gradient optimization

We now turn to devising a gradient-based optimization ap-

proach appropriate when only the Gram matrix K = X

is available, but not the feature vectors X themselves.

Corollary 1 assures us that the optimum of Eq. (9) is of the

form Xα, and so we can substitute W = Xα into Eq. (14)

and minimize over α. To do so using gradient methods, we

need to be able to compute both the smoothed objective and

its derivative from K and α alone, without reference to X

explicitly.

We ﬁrst tackle the smoothed trace norm of Xα: Let Xα =

UDV denote the SVD of Xα then the SVD of α

Kα is

given by V

V . We can thus recover D from the SVD of

Kα, and use Eq. (12) to calculate kXαk

In order to compute the gradient of kXαk

with respect to

α, we calculate:

∂kXαk

∂α

= X

∂kXαk

∂Xα

= X

′

(D)V

inserting D(V V

−1

= DID

−1

= I:

= X

U(DV V

−1

′

(D)V

= X

(UDV )V

−1

′

(D)V

and since Xα = UDV :

= X

(Xα)V

−1

′

(D)V = KαV

−1

′

(D)V (15)

Uncovering shared structures in multiclass classification

Figures

Citations

A Singular Value Thresholding Algorithm for Matrix Completion

A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography

Exact matrix completion via convex optimization

The Power of Convex Relaxation: Near-Optimal Matrix Completion

Tensor completion for estimating missing values in visual data

References

Convex Optimization

On the algorithmic implementation of multiclass kernel-based vector machines

Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

Multi-Task Feature Learning

Maximum-Margin Matrix Factorization

Related Papers (5)

A Singular Value Thresholding Algorithm for Matrix Completion

Exact Matrix Completion via Convex Optimization

Convex multi-task feature learning

Multi-Task Feature Learning

Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Uncovering shared structures in multiclass classification" ?

Q2. How many sets were used to select the optimal value of C?

Q3. What is the main reason for the success of largemargin methods?

Q4. What is the general family of Eq. 16?

Q5. What is the way to solve the trace-norm?

Q6. What is the loss function for a classifier?

Q7. What is the spectral norm of XQ?

Q8. What is the optimum of Eq. (6)?

Q9. What is the optimum of Eq. (9)?

Q10. What is the purpose of this paper?

Q11. What is the reason the trace-norm is nondifferentiable?

Q12. What is the effect of the learning rule for multi-class learning?