Support Vector Machine Learning for

Interdependent and Structured Output Spaces

Ioannis Tsochantaridis it@cs.brown.edu

Thomas Hofmann th@cs.brown.edu

Department of Computer Science, Brown University, Providence, RI 02912

Thorsten Joachims tj@cs.cornell.edu

Department of Computer Science, Cornell University, Ithaca, NY 14853

Yasemin Altun altun@cs.brown.edu

Department of Computer Science, Brown University, Providence, RI 02912

Abstract

Learning general functional dependencies is

one of the main goals in machine learning.

Recent progress in kernel-based methods has

focused on designing ﬂexible and powerful in-

put representations. This paper addresses

the complementary issue of problems involv-

ing complex outputs such as multiple depen-

dent output variables and structured output

spaces. We propose to generalize multiclass

Support Vector Machine learning in a formu-

lation that involves features extracted jointly

from inputs and outputs. The resulting op-

timization problem is solved eﬃciently by

a cutting plane algorithm that exploits the

sparseness and structural decomposition of

the problem. We demonstrate the versatility

and eﬀectiveness of our method on problems

ranging from supervised grammar learning

and named-entity recognition, to taxonomic

text classiﬁcation and sequence alignment.

1. Introduction

This paper deals with the general problem of learn-

ing a mapping from inputs x ∈Xto discrete outputs

y ∈Ybased on a training sample of input-output pairs

(x

1

, y

1

),...,(x

n

, y

n

) ∈X×Ydrawnfromsomeﬁxed

but unknown probability distribution. Unlike the case

of multiclass classiﬁcation where Y = {1, ..., k} with

interchangeable, arbitrarily numbered labels, we con-

sider structured output spaces Y.Elementsy ∈Y

may be, for instance, sequences, strings, labeled trees,

Appearing in Proceedings of the 21

st

International Confer-

ence on Machine Learning, Banﬀ, Canada, 2004. Copyright

by the authors.

lattices, or graphs. Such problems arise in a variety of

applications, ranging from multilabel classiﬁcation and

classiﬁcation with class taxonomies, to label sequence

learning, sequence alignment learning, and supervised

grammar learning, to name just a few.

We approach these problems by generalizing large

margin methods, more speciﬁcally multi-class Support

Vector Machines (SVMs) (Weston & Watkins, 1998;

Crammer & Singer, 2001), to the broader problem of

learning structured responses. The naive approach of

treating each structure as a separate class is often in-

tractable, since it leads to a multiclass problem with a

very large number of classes. We overcome this prob-

lem by specifying discriminant functions that exploit

the structure and dependencies within Y.Inthatre-

spect, our approach follows the work of Collins (2002;

2004) on perceptron learning with a similar class of

discriminant functions. However, the maximum mar-

gin algorithm we propose has advantages in terms of

accuracy and tunability to speciﬁc loss functions. A

similar philosophy of using kernel methods for learning

general dependencies was pursued in Kernel Depen-

dency Estimation (KDE) (Weston et al., 2003). Yet,

the use of separate kernels for inputs and outputs and

the use of kernel PCA with standard regression tech-

niques signiﬁcantly diﬀers from our formulation, which

is a more straightforward and natural generalization of

multiclass SVMs.

2. Discriminants and Loss Functions

We are interested in the general problem of learning

functions f : X→Ybased on a training sample of

input-output pairs. As an illustrating example, con-

sider the case of natural language parsing, where the

function f maps a given sentence x to a parse tree

Figure 1. Illustration of natural language parsing model.

y. This is depicted graphically in Figure 1. The ap-

proach we pursue is to learn a discriminant function

F : X×Y→over input/output pairs from which

we can derive a prediction by maximizing F over the

response variable for a speciﬁc given input x. Hence,

the general form of our hypotheses f is

f(x; w ) = argmax

y∈Y

F (x, y; w) , (1)

where w denotes a parameter vector. It might be use-

ful to think of −F as a w-parameterized family of cost

functions, which we try to design in such a way that

the minimum of F (x, ·; w) is at the desired output y

for inputs x of interest. Throughout this paper, we

assume F to be linear in some combined feature repre-

sentation of inputs and outputs Ψ(x, y),

F (x, y; w)=w, Ψ(x, y). (2)

The speciﬁc form of Ψ depends on the nature of the

problem and special cases will be discussed subse-

quently.

Using again natural language parsing as an illustrative

example, we can chose F such that we get a model that

is isomorphic to a Probabilistic Context Free Grammar

(PCFG). Each node in a parse tree y for a sentence

x corresponds to grammar rule g

j

, which in turn has

ascorew

j

. All valid parse trees y (i.e. trees with a

designated start symbol S as the root and the words in

the sentence x as the leaves) for a sentence x are scored

by the sum of the w

j

of their nodes. This score can

thus be written as F (x, y; w)=w, Ψ(x, y), where

Ψ(x, y) is a histogram vector counting how often each

grammar rule g

j

occurs in the tree y. f(x; w )can

be eﬃciently computed by ﬁnding the structure y ∈ Y

that maximizes F (x, y; w) via the CKY algorithm (see

Manning and Schuetze (1999)).

Learning over structured output spaces Y inevitably

involves loss functions other than the standard zero-

one classiﬁcation loss (cf. Weston et al. (2003)). For

example, in natural language parsing, a parse tree

that diﬀers from the correct parse in a few nodes only

should be treated diﬀerently from a parse tree that

is radically diﬀerent. Typically, the correctness of a

predicted parse tree is measured by its F

1

score (see

e.g. Johnson (1999)), the harmonic mean of precision

of recall as calculated based on the overlap of nodes

between the trees. We thus assume the availability of

a bounded loss function : Y×Y →where (y,

ˆ

y)

quantiﬁes the loss associated with a prediction

ˆ

y,ifthe

true output value is y.IfP (x, y) denotes the data gen-

erating distribution, then the goal is to ﬁnd a function

f within a given hypothesis class such that the risk

R

P

(f)=

X×Y

(y,f(x)) dP (x, y) . (3)

is minimized. We assume that P is unknown, but that

a ﬁnite training set of pairs S = {(x

i

, y

i

) ∈X×Y: i =

1,...,n} generated i.i.d. according to P is given. The

performance of a function f on the training sample

S is described by the empirical risk R

S

(f). For w-

parameterized hypothesis classes, we will also write

R

P

(w) ≡R

P

(f(·; w)) and similarly for the empirical

risk.

3. Margins and Margin Maximization

First, we consider the separable case in which there

exists a function f parameterized by w such that the

empirical risk is zero. If we assume that (y, y

) > 0

for y = y

and (y, y) = 0, then the condition of zero

training error can then be compactly written as a set

of non-linear constraints

∀i :max

y∈Y\y

i

{w, Ψ(x

i

, y)} < w, Ψ(x

i

, y

i

) . (4)

Each nonlinear inequality in (4) can be equivalently

replaced by |Y| − 1 linear inequalities, resulting in a

total of n|Y| − n linear constraints,

∀i, ∀y ∈Y\y

i

: w,δΨ

i

(y) > 0 , (5)

where we have deﬁned the shorthand δΨ

i

(y) ≡

Ψ(x

i

, y

i

) − Ψ(x

i

, y).

If the set of inequalities in (5) is feasible, there will

typically be more than one solution w

∗

. To specify

a unique solution, we propose to select the w with

w≤1 for which the score of the correct label y

i

is uniformly most diﬀerent from the closest runner-

up

ˆ

y

i

(w) = argmax

y=y

i

w, Ψ(x

i

, y). This general-

izes the maximum-margin principle employed in SVMs

(Vapnik, 1998) to the more general case considered in

this paper. The resulting hard-margin optimization

problem is

SVM

0

:min

w

1

2

w

2

(6a)

∀i, ∀y ∈Y\y

i

: w,δΨ

i

(y)≥1 . (6b)

To allow errors in the training set, we introduce slack

variables and propose to optimize a soft-margin crite-

rion. While there are several ways of doing this, we

follow Crammer and Singer (2001) and introduce one

slack variable for every non-linear constraint (4), which

will result in an upper bound on the empirical risk and

oﬀers some additional algorithmic advantages. Adding

a penalty term that is linear in the slack variables to

the objective results in the quadratic program

SVM

1

:min

w, ξ

1

2

w

2

+

C

n

n

i=1

ξ

i

, s.t. ∀i, ξ

i

≥ 0 (7a)

∀i, ∀y ∈Y\y

i

: w,δΨ

i

(y)≥1 −ξ

i

. (7b)

Alternatively, we can also penalize margin violations

by a quadratic term

C

2n

i

ξ

2

i

leading to an analogue

optimization problem which we refer to as SVM

2

.In

both cases, C>0 is a constant that controls the trade-

oﬀ between training error minimization and margin

maximization.

SVM

1

implicitly considers the zero-one classiﬁcation

loss. As argued above, this is inappropriate for prob-

lems like natural language parsing, where |Y| is large.

We now propose two approaches that generalize the

above formulations to the case of arbitrary loss func-

tions . Our ﬁrst approach is to re-scale the slack vari-

ables according to the loss incurred in each of the linear

constraints. Intuitively, violating a margin constraint

involving a y = y

i

with high loss (y

i

, y) should be

penalized more severely than a violation involving an

output value with smaller loss. This can be accom-

plished by multiplying the violation by the loss, or

equivalently, by scaling slack variables with the inverse

loss, which yields the problem

SVM

s

1

:min

w, ξ

1

2

w

2

+

C

n

n

i=1

ξ

i

, s.t. ∀i, ξ

i

≥ 0 (8)

∀i, ∀y ∈Y\y

i

: w,δΨ

i

(y)≥1−

ξ

i

(y

i

,y)

. (9)

A justiﬁcation for this formulation is given by the sub-

sequent proposition (proof omitted).

Proposition 1. Denote by (w

∗

, ξ

∗

) the optimal solu-

tion to SVM

s

1

.Then

1

n

n

i=1

ξ

∗

i

is an upper bound

on the empirical risk R

S

(w

∗

).

The optimization problem SVM

s

2

can be derived

analogously, where (y

i

, y) is replaced by

(y

i

, y)

in order to obtain an upper bound on the empirical

risk.

A second way to include loss functions is to re-scale

the margin as proposed by Taskar et al. (2004) for

the special case of the Hamming loss. The margin

constraints in this setting take the following form:

∀i, ∀y ∈Y\y

i

: w,δΨ

i

(y)≥(y

i

, y) − ξ

i

(10)

This set of constraints yield an optimization prob-

lem SVM

m

1

which also results in an upper bound on

R

S

(w

∗

). In our opinion, a potential disadvantage of

the margin scaling approach is that it may give signif-

icant weight to output values y ∈Ythat are not even

close to being confusable with the target values y

i

,be-

cause every increase in the loss increases the required

margin.

4. Support Vector Machine Learning

The key challenge in solving the QPs for the gener-

alized SVM learning is the large number of margin

constraints; more speciﬁcally the total number of con-

straints is n|Y|. In many cases, |Y| may be extremely

large, in particular, if Y is a product space of some

sort (e.g. in grammar learning, label sequence learn-

ing, etc.). This makes standard quadratic program-

ming solvers unsuitable for this type of problem.

In the following, we propose an algorithm that exploits

the special structure of the maximum-margin problem,

so that only a much smaller subset of constraints needs

to be explicitly examined. The algorithm is a general-

ization of the SVM algorithm for label sequence learn-

ing (Hofmann et al., 2002; Altun et al., 2003) and the

algorithm for inverse sequence alignment (Joachims,

2003). We will show how to compute arbitrarily close

approximations to all of the above SVM optimization

problems in polynomial time for a large range of struc-

tures and loss functions. Since the algorithm operates

on the dual program, we will ﬁrst derive the Wolfe dual

for the various soft margin formulations.

4.1. Dual Programs

We will denote by α

iy

the Lagrange multiplier enforc-

ing the margin constraint for label y = y

i

and exam-

ple (x

i

, y

i

). Using standard Lagrangian duality tech-

niques, one arrives at the following dual QP for the

hard margin case SVM

0

max

α

i,y=y

i

α

iy

−

1

2

i,y=y

i

j,

¯

y=y

j

α

iy

α

j

¯

y

δΨ

i

(y),δΨ

j

(

¯

y) (11a)

s.t. ∀i, ∀y = Y\y

i

: α

iy

≥ 0 . (11b)

A kernel K((x, y), (x

, y

)) can be used to replace the

inner products, since inner products in δΨcanbe

easily expressed as inner products of the original Ψ-

vectors.

For soft-margin optimization with slack re-scaling and

linear penalties (SVM

s

1

), additional box constraints

n

y=y

i

α

iy

(y

i

, y)

≤ C, ∀i (12)

are added to the dual. Quadratic slack penal-

ties (SVM

2

) lead to the same dual as SVM

0

after

altering the inner product to δΨ

i

(y),δΨ

j

(

¯

y) +

δij

n

C

√

(y

i

,y)

√

(y

j

,

¯

y)

. δij =1,ifi = j,else0.

Finally, in the case of margin re-scaling, the loss func-

tion aﬀects the linear part of the objective function

max

α

i,y

α

iy

(y

i

, y) − Q(α) (where the quadratic

part Q is unchanged from (11a)) and introduces stan-

dard box constraints n

y=y

i

α

iy

≤ C.

4.2. Algorithm

The algorithm we propose aims at ﬁnding a small set

of active constraints that ensures a suﬃciently accu-

rate solution. More precisely, it creates a nested se-

quence of successively tighter relaxations of the origi-

nal problem using a cutting plane method. The latter

is implemented as a variable selection approach in the

dual formulation. We will show that this is a valid

strategy, since there always exists a polynomially-sized

subset of constraints so that the corresponding solu-

tion fulﬁlls all constraints with a precision of at least .

This means, the remaining – potentially exponentially

many – constraints are guaranteed to be violated by

no more than , without the need for explicitly adding

them to the optimization problem.

We will base the optimization on the dual program

formulation which has two important advantages over

the primal QP. First, it only depends on inner prod-

ucts in the joint feature space deﬁned by Ψ, hence

allowing the use of kernel functions. Second, the con-

straint matrix of the dual program (for the L

1

-SVMs)

supports a natural problem decomposition, since it is

block diagonal, where each block corresponds to a spe-

ciﬁc training instance.

Pseudocode of the algorithm is depicted in Algo-

rithm 1. The algorithm applies to all SVM formula-

tions discussed above. The only diﬀerence is in the way

the cost function gets set up in step 5. The algorithm

maintains a working set S

i

for each training example

(x

i

, y

i

) to keep track of the selected constraints which

deﬁne the current relaxation. Iterating through the

training examples (x

i

, y

i

), the algorithm proceeds by

Algorithm 1 Algorithm for solving SVM

0

and the loss

re-scaling formulations SVM

s

1

and SVM

s

2

1: Input: (x

1

, y

1

),...,(x

n

, y

n

), C,

2: S

i

←∅for all i =1,...,n

3: repeat

4: for i =1,...,n do

5: set up cost function

SVM

s

1

: H(y) ≡ (1 −δΨ

i

(y), w) (y

i

, y)

SVM

s

2

: H(y) ≡ (1−δΨ

i

(y), w)

(y

i

, y)

SVM

m

1

: H(y) ≡(y

i

, y) −δΨ

i

(y), w

SVM

m

2

: H(y) ≡

(y

i

, y) −δΨ

i

(y), w

where w ≡

j

y

∈S

j

α

jy

δΨ

j

(y

).

6: compute

ˆ

y =argmax

y∈Y

H(y )

7: compute ξ

i

=max{0, max

y∈S

i

H(y )}

8: if H(

ˆ

y) >ξ

i

+ then

9: S

i

← S

i

∪{

ˆ

y}

10: α

S

← optimize dual over S, S = ∪

i

S

i

.

11: end if

12: end for

13: until no S

i

has changed during iteration

ﬁnding the (potentially) “most violated” constraint,

involving some output value

ˆ

y (line 6). If the (ap-

propriately scaled) margin violation of this constraint

exceeds the current value of ξ

i

by more than (line 8),

the dual variable corresponding to

ˆ

y is added to the

working set (line 9). This variable selection process in

the dual program corresponds to a successive strength-

ening of the primal problem by a cutting plane that

cuts oﬀ the current primal solution from the feasible

set. The chosen cutting plane corresponds to the con-

straint that determines the lowest feasible value for ξ

i

.

Once a constraint has been added, the solution is re-

computed wrt. S (line 10). Alternatively, we have also

devised a scheme where the optimization is restricted

to S

i

only, and where optimization over the full S is

performed much less frequently. This can be beneﬁcial

due to the block diagonal structure of the optimization

problems, which implies that variables α

jy

with j = i,

y ∈ S

j

can simply be “frozen” at their current val-

ues. Notice that all variables not included in their

respective working set are implicitly treated as 0. The

algorithm stops, if no constraint is violated by more

than . The presented algorithm is implemented and

available

1

as part of SVM

light

. Note that the SVM

optimization problems from iteration to iteration dif-

fer only by a single constraint. We therefore restart

the SVM optimizer from the current solution, which

greatly reduces the runtime. A convenient property of

both algorithms is that they have a very general and

well-deﬁned interface independent of the choice of Ψ

1

http://svmlight.joachims.org/

and . To apply the algorithm, it is suﬃcient to im-

plement the feature mapping Ψ(x, y) (either explicit or

via a joint kernel function), the loss function (y

i

, y),

as well as the maximization in step 6. All of those,

in particular the constraint/cut selection method, are

treated as black boxes. While the modeling of Ψ(x, y)

and (y

i

, y) is more or less straightforward, solving

the maximization problem for constraint selection typ-

ically requires exploiting the structure of Ψ.

4.3. Analysis

It is straightforward to show that the algorithm ﬁnds a

solution that is close to optimal (e.g. for the SVM

s

1

,

adding to each ξ

i

is a feasible point of the primal at

most C from the maximum). However, it is not im-

mediately obvious how fast the algorithm converges.

We will show in the following that the algorithm con-

verges in polynomial time for a large class of problems,

despite a possibly exponential or inﬁnite |Y|.

Let us begin with an elementary Lemma that will be

helpful for proving subsequent results. It quantiﬁes

how the dual objective changes, if one optimizes over

a single variable.

Lemma 1. Let J be a positive deﬁnite matrix and let

us deﬁne a concave quadratic program

W (α)=−

1

2

α

Jα + h, α s.t. α ≥ 0

and assume α ≥ 0 is given with α

r

=0. Then max-

imizing W with respect to α

r

while keeping all other

components ﬁxed will increase the objective by

(h

r

−

s

α

s

J

rs

)

2

2J

rr

provided that h

r

≥

s

α

s

J

rs

.

Proof. Denote by α[α

r

← β] the solution α with the

r-th coeﬃcient changed to β, then

W (α[α

r

← β]) − W (α)=β

h

r

−

s

α

s

J

rs

−

β

2

2

J

rr

The diﬀerence is maximized for

β

∗

=

h

r

−

s

α

s

J

rs

J

rr

Notice that β

∗

≥ 0, since h

r

≥

s

α

s

J

rs

and J

rr

>

0.

Using this Lemma, we can lower bound the improve-

ment of the dual objective in step 10 of Algorithm 1.

For brevity, let us focus on the case of SVM

s

2

. Simi-

lar results can be derived also for the other variants.

Proposition 2. Deﬁne

i

=max

y

(y

i

, y) and

R

i

=max

y

δΨ

i

(y). Then step 10 in Algorithm

1, improves the dual objective for SVM

s

2

at least by

1

2

2

(

i

R

2

i

+ n/C)

−1

.

Proof. Using the notation in Algorithm 1 one

can apply Lemma 1 with r =(i,

ˆ

y) denoting

the newly added constraint, h

r

=1,J

rr

=

δΨ

i

(

ˆ

y)

2

+

n

C(y

i

,

ˆ

y)

and

s

α

s

J

rs

= w, δΨ

i

(

ˆ

y) +

y=y

i

α

iy

n

C

√

(y

i

,

ˆ

y)

√

(y

i

,y)

. Note that α

r

=0. Us-

ing the fact that

y=y

i

nα

iy

C

√

(y

i

,y)

= ξ

i

, Lemma 1

shows the following increase of the objective function

when optimizing over α

r

alone:

1 −w,δΨ

i

(

ˆ

y)−

y=y

i

α

iy

n

C

√

(y

i

,

ˆ

y)

√

(y

i

,y)

2

2

δΨ

i

(

ˆ

y)

2

+

n

C(y

i

,

ˆ

y)

≥

2

2

δΨ

i

(

ˆ

y)

2

(y

i

,

ˆ

y)+

n

C

The step follows from the fact that ξ

i

≥ 0and

(y

i

,

ˆ

y)(1 −w ,δΨ

i

(

ˆ

y)) >ξ

i

+ , which is the con-

dition of step 8. Replacing the quantities in the de-

nominator by their upper limit proves the claim, since

jointly optimizing over more variables than just α

r

can

only further increase the dual objective.

This leads to the following polynomial bound on the

maximum size of S.

Theorem 1. With

¯

R =max

i

R

i

,

¯

=max

i

i

and for a given >0, Algorithm 1 for the

SVM

s

2

terminates after incrementally adding at most

−2

(C

¯

2

¯

R

2

+ n

¯

) constraints to the working set S.

Proof. With S = ∅ the optimal value of the dual is

0. In each iteration a constraint (i, y) is added that

is violated by at least , provided such a constraint

exists. After solving the S-relaxed QP in step 10, the

objective will increase by at least

1

2

2

(

¯

¯

R

2

+ n/C)

−1

according to Proposition 2. Hence after t constraints,

the dual objective will be at least t times this amount.

The result follows from the fact that the dual objective

is upper bounded by the minimum of the primal, which

in turn can be bounded by

1

2

C

¯

.

Note that the number of constraints in S does not de-

pend on |Y|. This is crucial, since |Y| is exponential or

inﬁnite for many interesting problems. For problems

where step 6 can be computed in polynomial time, the

overall algorithm has a runtime polynomial in n,

¯

R,

¯

,

1/, since at least one constraint will be added while