scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Support vector machine learning for interdependent and structured output spaces

TL;DR: This paper proposes to generalize multiclass Support Vector Machine learning in a formulation that involves features extracted jointly from inputs and outputs, and demonstrates the versatility and effectiveness of the method on problems ranging from supervised grammar learning and named-entity recognition, to taxonomic text classification and sequence alignment.
Abstract: Learning general functional dependencies is one of the main goals in machine learning. Recent progress in kernel-based methods has focused on designing flexible and powerful input representations. This paper addresses the complementary issue of problems involving complex outputs such as multiple dependent output variables and structured output spaces. We propose to generalize multiclass Support Vector Machine learning in a formulation that involves features extracted jointly from inputs and outputs. The resulting optimization problem is solved efficiently by a cutting plane algorithm that exploits the sparseness and structural decomposition of the problem. We demonstrate the versatility and effectiveness of our method on problems ranging from supervised grammar learning and named-entity recognition, to taxonomic text classification and sequence alignment.

Summary (4 min read)

1. Introduction

  • This paper deals with the general problem of learning a mapping from inputs x ∈.
  • The authors overcome this problem by specifying discriminant functions that exploit the structure and dependencies within Y.
  • The maximum margin algorithm the authors propose has advantages in terms of accuracy and tunability to specific loss functions.
  • A similar philosophy of using kernel methods for learning general dependencies was pursued in Kernel Dependency Estimation (KDE) (Weston et al., 2003).

2. Discriminants and Loss Functions

  • This score can thus be written as F (x,y;w) = 〈w,Ψ(x,y)〉, where Ψ(x,y) is a histogram vector counting how often each grammar rule gj occurs in the tree y. f(x;w) can be efficiently computed by finding the structure y ∈ Y that maximizes F (x,y;w) via the CKY algorithm (see Manning and Schuetze (1999)).
  • Learning over structured output spaces Y inevitably involves loss functions other than the standard zeroone classification loss (cf. Weston et al. (2003)).
  • In natural language parsing, a parse tree that differs from the correct parse in a few nodes only should be treated differently from a parse tree that is radically different.
  • Typically, the correctness of a predicted parse tree is measured by its F1 score (see e.g. Johnson (1999)), the harmonic mean of precision of recall as calculated based on the overlap of nodes between the trees.
  • For wparameterized hypothesis classes, the authors will also write R P (w) ≡ R P (f(·;w)) and similarly for the empirical risk.

3. Margins and Margin Maximization

  • If the set of inequalities in (5) is feasible, there will typically be more than one solution w∗.
  • This generalizes the maximum-margin principle employed in SVMs (Vapnik, 1998) to the more general case considered in this paper.
  • While there are several ways of doing this, the authors follow Crammer and Singer (2001) and introduce one slack variable for every non-linear constraint (4), which will result in an upper bound on the empirical risk and offers some additional algorithmic advantages.
  • As argued above, this is inappropriate for problems like natural language parsing, where |Y| is large.
  • In their opinion, a potential disadvantage of the margin scaling approach is that it may give significant weight to output values y ∈ Y that are not even close to being confusable with the target values yi, because every increase in the loss increases the required margin.

4. Support Vector Machine Learning

  • The key challenge in solving the QPs for the generalized SVM learning is the large number of margin constraints; more specifically the total number of constraints is n|Y|.
  • This makes standard quadratic programming solvers unsuitable for this type of problem.
  • The algorithm is a generalization of the SVM algorithm for label sequence learning (Hofmann et al., 2002; Altun et al., 2003) and the algorithm for inverse sequence alignment (Joachims, 2003).
  • The authors will show how to compute arbitrarily close approximations to all of the above SVM optimization problems in polynomial time for a large range of structures and loss functions.
  • Since the algorithm operates on the dual program, the authors will first derive the Wolfe dual for the various soft margin formulations.

4.1. Dual Programs

  • The authors will denote by αiy the Lagrange multiplier enforcing the margin constraint for label y = yi and example (xi,yi).
  • For soft-margin optimization with slack re-scaling and linear penalties (SVM s1 ), additional box constraints n ∑ y =yi αiy (yi,y) ≤ C, ∀i (12) are added to the dual.
  • Finally, in the case of margin re-scaling, the loss function affects the linear part of the objective function maxα ∑ i,y αiy (yi,y) − Q(α) (where the quadratic part Q is unchanged from (11a)) and introduces standard box constraints n ∑ y =yi αiy ≤ C.

4.2. Algorithm

  • The algorithm the authors propose aims at finding a small set of active constraints that ensures a sufficiently accurate solution.
  • The algorithm applies to all SVM formulations discussed above.
  • The algorithm maintains a working set Si for each training example (xi,yi) to keep track of the selected constraints which define the current relaxation.
  • This variable selection process in the dual program corresponds to a successive strengthening of the primal problem by a cutting plane that cuts off the current primal solution from the feasible set.
  • Notice that all variables not included in their respective working set are implicitly treated as 0.

4.3. Analysis

  • It is not immediately obvious how fast the algorithm converges.
  • The authors will show in the following that the algorithm converges in polynomial time for a large class of problems, despite a possibly exponential or infinite |Y|. Let us begin with an elementary Lemma that will be helpful for proving subsequent results.
  • Similar results can be derived also for the other variants.
  • This leads to the following polynomial bound on the maximum size of S. Theorem 1.
  • Hence after t constraints, the dual objective will be at least t times this amount.

5.1. Multiclass Classification

  • The authors algorithm can implement the conventional winnertakes-all (WTA) multiclass classification (Crammer & Singer, 2001) as follows.
  • These discriminant functions can be equivalently represented in the proposed framework by defining a joint feature map as follows Ψ(x,y) ≡ Φ(x)⊗ Λc(y).
  • Here Λc refers to the orthogonal encoding of the label y and ⊗ is the tensor product which forms all products between coefficients of the two argument vectors.

5.2. Classification with Taxonomies

  • The first generalization the authors propose is to make use of more interesting output features Λ than the orthogonal representation Λc. Notice that 〈Λ(y),Λ(y′)〉 will count the number of common predecessors.
  • The authors have performed experiments using a document collection released by the World Intellectual Property Organization (WIPO), which uses the International Patent Classification (IPC) scheme.
  • The authors have furthermore subsampled the training data to investigate the effect of the training set size.
  • Document parsing, tokenization and term normalization have been performed with the MindServer retrieval engine.
  • The results are summarized in Table 1 and show that the proposed hierarchical SVM learning architecture improves performance over the standard multiclass SVM in terms of classification accuracy as well as in terms of the tree loss.

5.3. Label Sequence Learning

  • It subsumes problems like segmenting or annotating observation sequences and has widespread applications in optical character recognition, natural language processing, information extraction, and computational biology.
  • The label set in this corpus consists of non-name and the beginning and continuation of person names, organizations, locations and miscellaneous names, resulting in a total of |Σ| = 9 different labels.
  • In the setup followed in Altun et al. (2003), the joint feature map Ψ(x,y) is the histogram of state transition plus a set of features describing the emissions.
  • The results given in Table 2 for the zero-one loss, compare the generative HMM with Conditional Random Fields (CRF) (Lafferty et al., 2001), Collins’ per- 2http://www.recommind.com ceptron and the SVM algorithm.
  • In addition, the SVM performs slightly better than the perceptron and CRFs, demonstrating the benefit of a large-margin approach.

5.4. Sequence Alignment

  • For a given pair of sequences x and z, alignment methods like the Smith-Waterman algorithm select the sequence of operations (e.g. insertion, substitution) â(x, z) = argmaxa∈A 〈w,Ψ(x, z,a)〉 that transforms x into y and that maximizes a linear objective function derived from the operation costs w. Ψ(x, z,a) is the histogram of alignment operations.
  • For each native sequence xi there is a most similar homologue sequence zi along with what is believed to be the (close to) optimal alignment ai.
  • The authors use the Smith-Waterman algorithm to implement the maxa.
  • The results are averaged over 10 train/test samples.
  • For the generative model, the authors report the results for δ = −0.2, which performs best on the test set.

5.5. Natural Language Parsing

  • The authors test the feasibility of their approach for learning a weighted context-free grammar on a subset of the Penn Treebank Wall Street Journal corpus.
  • All values of C between 10−1 to 102 gave comparable results.
  • While the zero-one loss achieves better accuracy (i.e. predicting the complete tree correctly), the F1score is only marginally better.
  • The authors conjecture that they can achieve further gains by incorporating more complex features into the grammar, which would be impossible or at best awkward to use in a generative PCFG model.
  • The re-scaling formulations lose time mostly on the argmax in line 6.

6. Conclusions

  • The authors formulated a Support Vector Method for supervised learning with structured and interdependent outputs.
  • It is based on a joint feature map over input/output pairs, which covers a large class of interesting models including weighted context-free grammars, hidden Markov models, and sequence alignment.
  • Furthermore, the approach is very flexible in its ability to handle application specific loss functions.
  • To solve the resulting optimization problems, the authors proposed a simple and general algorithm for which they prove convergence bounds.
  • Furthermore, the authors show that the generalization accuracy of their method is at least comparable or often exceeds conventional approaches for a wide range of problems.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Support Vector Machine Learning for
Interdependent and Structured Output Spaces
Ioannis Tsochantaridis it@cs.brown.edu
Thomas Hofmann th@cs.brown.edu
Department of Computer Science, Brown University, Providence, RI 02912
Thorsten Joachims tj@cs.cornell.edu
Department of Computer Science, Cornell University, Ithaca, NY 14853
Yasemin Altun altun@cs.brown.edu
Department of Computer Science, Brown University, Providence, RI 02912
Abstract
Learning general functional dependencies is
one of the main goals in machine learning.
Recent progress in kernel-based methods has
focused on designing flexible and powerful in-
put representations. This paper addresses
the complementary issue of problems involv-
ing complex outputs such as multiple depen-
dent output variables and structured output
spaces. We propose to generalize multiclass
Support Vector Machine learning in a formu-
lation that involves features extracted jointly
from inputs and outputs. The resulting op-
timization problem is solved efficiently by
a cutting plane algorithm that exploits the
sparseness and structural decomposition of
the problem. We demonstrate the versatility
and effectiveness of our method on problems
ranging from supervised grammar learning
and named-entity recognition, to taxonomic
text classification and sequence alignment.
1. Introduction
This paper deals with the general problem of learn-
ing a mapping from inputs x ∈Xto discrete outputs
y ∈Ybased on a training sample of input-output pairs
(x
1
, y
1
),...,(x
n
, y
n
) ∈X×Ydrawnfromsomefixed
but unknown probability distribution. Unlike the case
of multiclass classification where Y = {1, ..., k} with
interchangeable, arbitrarily numbered labels, we con-
sider structured output spaces Y.Elementsy ∈Y
may be, for instance, sequences, strings, labeled trees,
Appearing in Proceedings of the 21
st
International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
by the authors.
lattices, or graphs. Such problems arise in a variety of
applications, ranging from multilabel classification and
classification with class taxonomies, to label sequence
learning, sequence alignment learning, and supervised
grammar learning, to name just a few.
We approach these problems by generalizing large
margin methods, more specifically multi-class Support
Vector Machines (SVMs) (Weston & Watkins, 1998;
Crammer & Singer, 2001), to the broader problem of
learning structured responses. The naive approach of
treating each structure as a separate class is often in-
tractable, since it leads to a multiclass problem with a
very large number of classes. We overcome this prob-
lem by specifying discriminant functions that exploit
the structure and dependencies within Y.Inthatre-
spect, our approach follows the work of Collins (2002;
2004) on perceptron learning with a similar class of
discriminant functions. However, the maximum mar-
gin algorithm we propose has advantages in terms of
accuracy and tunability to specific loss functions. A
similar philosophy of using kernel methods for learning
general dependencies was pursued in Kernel Depen-
dency Estimation (KDE) (Weston et al., 2003). Yet,
the use of separate kernels for inputs and outputs and
the use of kernel PCA with standard regression tech-
niques significantly differs from our formulation, which
is a more straightforward and natural generalization of
multiclass SVMs.
2. Discriminants and Loss Functions
We are interested in the general problem of learning
functions f : X→Ybased on a training sample of
input-output pairs. As an illustrating example, con-
sider the case of natural language parsing, where the
function f maps a given sentence x to a parse tree

Figure 1. Illustration of natural language parsing model.
y. This is depicted graphically in Figure 1. The ap-
proach we pursue is to learn a discriminant function
F : Y→over input/output pairs from which
we can derive a prediction by maximizing F over the
response variable for a specific given input x. Hence,
the general form of our hypotheses f is
f(x; w ) = argmax
y∈Y
F (x, y; w) , (1)
where w denotes a parameter vector. It might be use-
ful to think of F as a w-parameterized family of cost
functions, which we try to design in such a way that
the minimum of F (x, ·; w) is at the desired output y
for inputs x of interest. Throughout this paper, we
assume F to be linear in some combined feature repre-
sentation of inputs and outputs Ψ(x, y),
F (x, y; w)=w, Ψ(x, y). (2)
The specific form of Ψ depends on the nature of the
problem and special cases will be discussed subse-
quently.
Using again natural language parsing as an illustrative
example, we can chose F such that we get a model that
is isomorphic to a Probabilistic Context Free Grammar
(PCFG). Each node in a parse tree y for a sentence
x corresponds to grammar rule g
j
, which in turn has
ascorew
j
. All valid parse trees y (i.e. trees with a
designated start symbol S as the root and the words in
the sentence x as the leaves) for a sentence x are scored
by the sum of the w
j
of their nodes. This score can
thus be written as F (x, y; w)=w, Ψ(x, y), where
Ψ(x, y) is a histogram vector counting how often each
grammar rule g
j
occurs in the tree y. f(x; w )can
be efficiently computed by finding the structure y Y
that maximizes F (x, y; w) via the CKY algorithm (see
Manning and Schuetze (1999)).
Learning over structured output spaces Y inevitably
involves loss functions other than the standard zero-
one classification loss (cf. Weston et al. (2003)). For
example, in natural language parsing, a parse tree
that differs from the correct parse in a few nodes only
should be treated differently from a parse tree that
is radically different. Typically, the correctness of a
predicted parse tree is measured by its F
1
score (see
e.g. Johnson (1999)), the harmonic mean of precision
of recall as calculated based on the overlap of nodes
between the trees. We thus assume the availability of
a bounded loss function : Y where (y,
ˆ
y)
quantifies the loss associated with a prediction
ˆ
y,ifthe
true output value is y.IfP (x, y) denotes the data gen-
erating distribution, then the goal is to find a function
f within a given hypothesis class such that the risk
R
P
(f)=
Y
(y,f(x)) dP (x, y) . (3)
is minimized. We assume that P is unknown, but that
a finite training set of pairs S = {(x
i
, y
i
) ∈X×Y: i =
1,...,n} generated i.i.d. according to P is given. The
performance of a function f on the training sample
S is described by the empirical risk R
S
(f). For w-
parameterized hypothesis classes, we will also write
R
P
(w) ≡R
P
(f(·; w)) and similarly for the empirical
risk.
3. Margins and Margin Maximization
First, we consider the separable case in which there
exists a function f parameterized by w such that the
empirical risk is zero. If we assume that (y, y
) > 0
for y = y
and (y, y) = 0, then the condition of zero
training error can then be compactly written as a set
of non-linear constraints
i :max
y∈Y\y
i
{w, Ψ(x
i
, y)} < w, Ψ(x
i
, y
i
) . (4)
Each nonlinear inequality in (4) can be equivalently
replaced by |Y| 1 linear inequalities, resulting in a
total of n|Y| n linear constraints,
i, y ∈Y\y
i
: wΨ
i
(y) > 0 , (5)
where we have defined the shorthand δΨ
i
(y)
Ψ(x
i
, y
i
) Ψ(x
i
, y).
If the set of inequalities in (5) is feasible, there will
typically be more than one solution w
. To specify
a unique solution, we propose to select the w with
w≤1 for which the score of the correct label y
i
is uniformly most different from the closest runner-
up
ˆ
y
i
(w) = argmax
y=y
i
w, Ψ(x
i
, y). This general-
izes the maximum-margin principle employed in SVMs
(Vapnik, 1998) to the more general case considered in
this paper. The resulting hard-margin optimization

problem is
SVM
0
:min
w
1
2
w
2
(6a)
i, y ∈Y\y
i
: wΨ
i
(y)≥1 . (6b)
To allow errors in the training set, we introduce slack
variables and propose to optimize a soft-margin crite-
rion. While there are several ways of doing this, we
follow Crammer and Singer (2001) and introduce one
slack variable for every non-linear constraint (4), which
will result in an upper bound on the empirical risk and
offers some additional algorithmic advantages. Adding
a penalty term that is linear in the slack variables to
the objective results in the quadratic program
SVM
1
:min
w, ξ
1
2
w
2
+
C
n
n
i=1
ξ
i
, s.t. i, ξ
i
0 (7a)
i, y ∈Y\y
i
: wΨ
i
(y)≥1 ξ
i
. (7b)
Alternatively, we can also penalize margin violations
by a quadratic term
C
2n
i
ξ
2
i
leading to an analogue
optimization problem which we refer to as SVM
2
.In
both cases, C>0 is a constant that controls the trade-
off between training error minimization and margin
maximization.
SVM
1
implicitly considers the zero-one classification
loss. As argued above, this is inappropriate for prob-
lems like natural language parsing, where |Y| is large.
We now propose two approaches that generalize the
above formulations to the case of arbitrary loss func-
tions . Our first approach is to re-scale the slack vari-
ables according to the loss incurred in each of the linear
constraints. Intuitively, violating a margin constraint
involving a y = y
i
with high loss (y
i
, y) should be
penalized more severely than a violation involving an
output value with smaller loss. This can be accom-
plished by multiplying the violation by the loss, or
equivalently, by scaling slack variables with the inverse
loss, which yields the problem
SVM
s
1
:min
w, ξ
1
2
w
2
+
C
n
n
i=1
ξ
i
, s.t. i, ξ
i
0 (8)
i, y ∈Y\y
i
: wΨ
i
(y)≥1
ξ
i
(y
i
,y)
. (9)
A justification for this formulation is given by the sub-
sequent proposition (proof omitted).
Proposition 1. Denote by (w
, ξ
) the optimal solu-
tion to SVM
s
1
.Then
1
n
n
i=1
ξ
i
is an upper bound
on the empirical risk R
S
(w
).
The optimization problem SVM
s
2
can be derived
analogously, where (y
i
, y) is replaced by
(y
i
, y)
in order to obtain an upper bound on the empirical
risk.
A second way to include loss functions is to re-scale
the margin as proposed by Taskar et al. (2004) for
the special case of the Hamming loss. The margin
constraints in this setting take the following form:
i, y ∈Y\y
i
: wΨ
i
(y)≥(y
i
, y) ξ
i
(10)
This set of constraints yield an optimization prob-
lem SVM
m
1
which also results in an upper bound on
R
S
(w
). In our opinion, a potential disadvantage of
the margin scaling approach is that it may give signif-
icant weight to output values y ∈Ythat are not even
close to being confusable with the target values y
i
,be-
cause every increase in the loss increases the required
margin.
4. Support Vector Machine Learning
The key challenge in solving the QPs for the gener-
alized SVM learning is the large number of margin
constraints; more specifically the total number of con-
straints is n|Y|. In many cases, |Y| may be extremely
large, in particular, if Y is a product space of some
sort (e.g. in grammar learning, label sequence learn-
ing, etc.). This makes standard quadratic program-
ming solvers unsuitable for this type of problem.
In the following, we propose an algorithm that exploits
the special structure of the maximum-margin problem,
so that only a much smaller subset of constraints needs
to be explicitly examined. The algorithm is a general-
ization of the SVM algorithm for label sequence learn-
ing (Hofmann et al., 2002; Altun et al., 2003) and the
algorithm for inverse sequence alignment (Joachims,
2003). We will show how to compute arbitrarily close
approximations to all of the above SVM optimization
problems in polynomial time for a large range of struc-
tures and loss functions. Since the algorithm operates
on the dual program, we will first derive the Wolfe dual
for the various soft margin formulations.
4.1. Dual Programs
We will denote by α
iy
the Lagrange multiplier enforc-
ing the margin constraint for label y = y
i
and exam-
ple (x
i
, y
i
). Using standard Lagrangian duality tech-
niques, one arrives at the following dual QP for the
hard margin case SVM
0
max
α
i,y=y
i
α
iy
1
2
i,y=y
i
j,
¯
y=y
j
α
iy
α
j
¯
y
δΨ
i
(y)Ψ
j
(
¯
y) (11a)
s.t. i, y = Y\y
i
: α
iy
0 . (11b)

A kernel K((x, y), (x
, y
)) can be used to replace the
inner products, since inner products in δΨcanbe
easily expressed as inner products of the original Ψ-
vectors.
For soft-margin optimization with slack re-scaling and
linear penalties (SVM
s
1
), additional box constraints
n
y=y
i
α
iy
(y
i
, y)
C, i (12)
are added to the dual. Quadratic slack penal-
ties (SVM
2
) lead to the same dual as SVM
0
after
altering the inner product to δΨ
i
(y)Ψ
j
(
¯
y) +
δij
n
C
(y
i
,y)
(y
j
,
¯
y)
. δij =1,ifi = j,else0.
Finally, in the case of margin re-scaling, the loss func-
tion affects the linear part of the objective function
max
α
i,y
α
iy
(y
i
, y) Q(α) (where the quadratic
part Q is unchanged from (11a)) and introduces stan-
dard box constraints n
y=y
i
α
iy
C.
4.2. Algorithm
The algorithm we propose aims at finding a small set
of active constraints that ensures a sufficiently accu-
rate solution. More precisely, it creates a nested se-
quence of successively tighter relaxations of the origi-
nal problem using a cutting plane method. The latter
is implemented as a variable selection approach in the
dual formulation. We will show that this is a valid
strategy, since there always exists a polynomially-sized
subset of constraints so that the corresponding solu-
tion fulfills all constraints with a precision of at least .
This means, the remaining potentially exponentially
many constraints are guaranteed to be violated by
no more than , without the need for explicitly adding
them to the optimization problem.
We will base the optimization on the dual program
formulation which has two important advantages over
the primal QP. First, it only depends on inner prod-
ucts in the joint feature space defined by Ψ, hence
allowing the use of kernel functions. Second, the con-
straint matrix of the dual program (for the L
1
-SVMs)
supports a natural problem decomposition, since it is
block diagonal, where each block corresponds to a spe-
cific training instance.
Pseudocode of the algorithm is depicted in Algo-
rithm 1. The algorithm applies to all SVM formula-
tions discussed above. The only difference is in the way
the cost function gets set up in step 5. The algorithm
maintains a working set S
i
for each training example
(x
i
, y
i
) to keep track of the selected constraints which
define the current relaxation. Iterating through the
training examples (x
i
, y
i
), the algorithm proceeds by
Algorithm 1 Algorithm for solving SVM
0
and the loss
re-scaling formulations SVM
s
1
and SVM
s
2
1: Input: (x
1
, y
1
),...,(x
n
, y
n
), C,
2: S
i
←∅for all i =1,...,n
3: repeat
4: for i =1,...,n do
5: set up cost function
SVM
s
1
: H(y) (1 −δΨ
i
(y), w) (y
i
, y)
SVM
s
2
: H(y) (1−δΨ
i
(y), w)
(y
i
, y)
SVM
m
1
: H(y) ≡(y
i
, y) −δΨ
i
(y), w
SVM
m
2
: H(y)
(y
i
, y) −δΨ
i
(y), w
where w
j
y
S
j
α
jy
δΨ
j
(y
).
6: compute
ˆ
y =argmax
yY
H(y )
7: compute ξ
i
=max{0, max
yS
i
H(y )}
8: if H(
ˆ
y)
i
+ then
9: S
i
S
i
∪{
ˆ
y}
10: α
S
optimize dual over S, S =
i
S
i
.
11: end if
12: end for
13: until no S
i
has changed during iteration
finding the (potentially) “most violated” constraint,
involving some output value
ˆ
y (line 6). If the (ap-
propriately scaled) margin violation of this constraint
exceeds the current value of ξ
i
by more than (line 8),
the dual variable corresponding to
ˆ
y is added to the
working set (line 9). This variable selection process in
the dual program corresponds to a successive strength-
ening of the primal problem by a cutting plane that
cuts off the current primal solution from the feasible
set. The chosen cutting plane corresponds to the con-
straint that determines the lowest feasible value for ξ
i
.
Once a constraint has been added, the solution is re-
computed wrt. S (line 10). Alternatively, we have also
devised a scheme where the optimization is restricted
to S
i
only, and where optimization over the full S is
performed much less frequently. This can be beneficial
due to the block diagonal structure of the optimization
problems, which implies that variables α
jy
with j = i,
y S
j
can simply be “frozen” at their current val-
ues. Notice that all variables not included in their
respective working set are implicitly treated as 0. The
algorithm stops, if no constraint is violated by more
than . The presented algorithm is implemented and
available
1
as part of SVM
light
. Note that the SVM
optimization problems from iteration to iteration dif-
fer only by a single constraint. We therefore restart
the SVM optimizer from the current solution, which
greatly reduces the runtime. A convenient property of
both algorithms is that they have a very general and
well-defined interface independent of the choice of Ψ
1
http://svmlight.joachims.org/

and . To apply the algorithm, it is sufficient to im-
plement the feature mapping Ψ(x, y) (either explicit or
via a joint kernel function), the loss function (y
i
, y),
as well as the maximization in step 6. All of those,
in particular the constraint/cut selection method, are
treated as black boxes. While the modeling of Ψ(x, y)
and (y
i
, y) is more or less straightforward, solving
the maximization problem for constraint selection typ-
ically requires exploiting the structure of Ψ.
4.3. Analysis
It is straightforward to show that the algorithm finds a
solution that is close to optimal (e.g. for the SVM
s
1
,
adding to each ξ
i
is a feasible point of the primal at
most C from the maximum). However, it is not im-
mediately obvious how fast the algorithm converges.
We will show in the following that the algorithm con-
verges in polynomial time for a large class of problems,
despite a possibly exponential or infinite |Y|.
Let us begin with an elementary Lemma that will be
helpful for proving subsequent results. It quantifies
how the dual objective changes, if one optimizes over
a single variable.
Lemma 1. Let J be a positive definite matrix and let
us define a concave quadratic program
W (α)=
1
2
α
+ h, α s.t. α 0
and assume α 0 is given with α
r
=0. Then max-
imizing W with respect to α
r
while keeping all other
components fixed will increase the objective by
(h
r
s
α
s
J
rs
)
2
2J
rr
provided that h
r
s
α
s
J
rs
.
Proof. Denote by α[α
r
β] the solution α with the
r-th coefficient changed to β, then
W (α[α
r
β]) W (α)=β
h
r
s
α
s
J
rs
β
2
2
J
rr
The difference is maximized for
β
=
h
r
s
α
s
J
rs
J
rr
Notice that β
0, since h
r
s
α
s
J
rs
and J
rr
>
0.
Using this Lemma, we can lower bound the improve-
ment of the dual objective in step 10 of Algorithm 1.
For brevity, let us focus on the case of SVM
s
2
. Simi-
lar results can be derived also for the other variants.
Proposition 2. Define
i
=max
y
(y
i
, y) and
R
i
=max
y
δΨ
i
(y). Then step 10 in Algorithm
1, improves the dual objective for SVM
s
2
at least by
1
2
2
(
i
R
2
i
+ n/C)
1
.
Proof. Using the notation in Algorithm 1 one
can apply Lemma 1 with r =(i,
ˆ
y) denoting
the newly added constraint, h
r
=1,J
rr
=
δΨ
i
(
ˆ
y)
2
+
n
C(y
i
,
ˆ
y)
and
s
α
s
J
rs
= w, δΨ
i
(
ˆ
y) +
y=y
i
α
iy
n
C
(y
i
,
ˆ
y)
(y
i
,y)
. Note that α
r
=0. Us-
ing the fact that
y=y
i
iy
C
(y
i
,y)
= ξ
i
, Lemma 1
shows the following increase of the objective function
when optimizing over α
r
alone:
1 −wΨ
i
(
ˆ
y)−
y=y
i
α
iy
n
C
(y
i
,
ˆ
y)
(y
i
,y)
2
2
δΨ
i
(
ˆ
y)
2
+
n
C(y
i
,
ˆ
y)
2
2
δΨ
i
(
ˆ
y)
2
(y
i
,
ˆ
y)+
n
C
The step follows from the fact that ξ
i
0and
(y
i
,
ˆ
y)(1 −w Ψ
i
(
ˆ
y))
i
+ , which is the con-
dition of step 8. Replacing the quantities in the de-
nominator by their upper limit proves the claim, since
jointly optimizing over more variables than just α
r
can
only further increase the dual objective.
This leads to the following polynomial bound on the
maximum size of S.
Theorem 1. With
¯
R =max
i
R
i
,
¯
=max
i
i
and for a given >0, Algorithm 1 for the
SVM
s
2
terminates after incrementally adding at most
2
(C
¯
2
¯
R
2
+ n
¯
) constraints to the working set S.
Proof. With S = the optimal value of the dual is
0. In each iteration a constraint (i, y) is added that
is violated by at least , provided such a constraint
exists. After solving the S-relaxed QP in step 10, the
objective will increase by at least
1
2
2
(
¯
¯
R
2
+ n/C)
1
according to Proposition 2. Hence after t constraints,
the dual objective will be at least t times this amount.
The result follows from the fact that the dual objective
is upper bounded by the minimum of the primal, which
in turn can be bounded by
1
2
C
¯
.
Note that the number of constraints in S does not de-
pend on |Y|. This is crucial, since |Y| is exponential or
infinite for many interesting problems. For problems
where step 6 can be computed in polynomial time, the
overall algorithm has a runtime polynomial in n,
¯
R,
¯
,
1/, since at least one constraint will be added while

Citations
More filters
Book

[...]

01 Jan 2015
TL;DR: The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course.
Abstract: Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides an extensive theoretical account of the fundamental ideas underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics of the field, the book covers a wide array of central topics that have not been addressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for an advanced undergraduate or beginning graduate course, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics, and engineering.

2,986 citations

Book

[...]

Tie-Yan Liu1
27 Jun 2009
TL;DR: Three major approaches to learning to rank are introduced, i.e., the pointwise, pairwise, and listwise approaches, the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures are analyzed, and the performance of these approaches on the LETOR benchmark datasets is evaluated.
Abstract: This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.

2,244 citations

Journal Article

[...]

TL;DR: This paper proposes to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation and presents a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems.
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

2,243 citations


Cites background from "Support vector machine learning for..."

  • [...]

  • [...]

Journal ArticleDOI

[...]

TL;DR: This paper aims to provide a timely review on this area with emphasis on state-of-the-art multi-label learning algorithms with relevant analyses and discussions.
Abstract: Multi-label learning studies the problem where each example is represented by a single instance while associated with a set of labels simultaneously. During the past decade, significant amount of progresses have been made toward this emerging machine learning paradigm. This paper aims to provide a timely review on this area with emphasis on state-of-the-art multi-label learning algorithms. Firstly, fundamentals on multi-label learning including formal definition and evaluation metrics are given. Secondly and primarily, eight representative multi-label learning algorithms are scrutinized under common notations with relevant analyses and discussions. Thirdly, several related learning settings are briefly summarized. As a conclusion, online resources and open research problems on multi-label learning are outlined for reference purposes.

1,917 citations


Cites background from "Support vector machine learning for..."

  • [...]

Journal Article

[...]

TL;DR: This work presents a unified view for online classification, regression, and uni-class problems, and proves worst case loss bounds for various algorithms for both the realizable case and the non-realizable case.
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

1,573 citations

References
More filters
Book

[...]

Vladimir Vapnik1
01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

38,164 citations

[...]

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,121 citations


"Support vector machine learning for..." refers methods in this paper

  • [...]

  • [...]

  • [...]

  • [...]

Proceedings Article

[...]

28 Jun 2001
TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

12,343 citations

[...]

01 Jan 2005

11,364 citations

Book

[...]

28 May 1999
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

9,004 citations


"Support vector machine learning for..." refers methods in this paper

  • [...]

  • [...]