What contributions have the authors mentioned in the paper "Batch and online learning algorithms for nonconvex neyman-pearson classification" ?

The authors describe and evaluate two algorithms for Neyman-Pearson ( NP ) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. The authors investigated batch algorithm based on DC programming and stochastic gradient method well suited for large scale datasets. Empirical evidences illustrate the potential of the proposed methods.

What dataset was used for the q-value optimization experiments?

The q-value optimization experiments were carried out using a proteomics dataset consisting of 139410 samples with positive and negative samples equally represented [Spivak et al. 2009].

How many pairs of parameters are used for ONP-SVM?

As the authors tested 100 pairs of parameters (C+, C−) for AC-SVM and only 36 pairs (γ, ν) for ONP-SVM, the computing time gain is highly appreciable.

How many values of (C+, C) were searched in the hyperbox?

In the case of AC-SVM, 15×15 values of (C+, C−) were searched in the hyperbox [0.001, 1000]2 for Pageblocks and [0.01, 100]2 for the other datasets.

How many documents are used as the training set?

In order to make the problem large scale, the authors use the official testing set (781,265 documents) as the training set and the official training set (23,149 documents) as the testing set.

Why do the authors not report results obtained with SVMPerf?

The authors do not report results obtained with SVMPerf [Joachims 2005] because, running the algorithm for plain kernel machines is “painfully slow”, according to the SVMPerf web site.

What is the algorithm for calculating the cost of asymmetric SVM?

• Compute βi ← {1 if yif(xi) < −η 0 otherwise• Update λ← λ(1 + ν(P̂fa − ρ)) until convergence.iterations start with β = 0, the first iteration solves the classical asymmetric cost SVM problem whose solution is progressively improved by taking the nonconvexity into account and updating λ in order to achieve the target Pfa.

What is the difference between the two convex functions?

Since the analytical expression of the ramp loss (5) is a difference `1(z)− `2(z) of two convex functions, the full Lagrangian can also be expressed as the difference of two convex functions amenable to DC programming [Tao and An 1998].

How many units are used in the neural network?

The neural network replicates the structure of the state-of-the-art model (qRanker) [Spivak et al. 2009] with a single hidden layer with 5 units.

What is the classifier for each method?

The best classifier for each method was selected using the validation criterion [Davenport et al. 2010]:Jval = Pnd + max(0,Pfa − ρ)/ρ .ACM Journal Name, Vol. V, No. N, Month 20YY.

(Open Access) Batch and online learning algorithms for nonconvex neyman-pearson classification (2011) | Gilles Gasso

Q: What is the third flavor of generative approach?

A third flavor of generative approach addresses the estimation of class-condition distributions by Parzen window [Bounsiar et al. 2008].

Q: What is the earliest attempt to estimate class-conditional distributions?

One of the earliest attempts [Streit 1990] uses multi-layered neural network to estimate class-conditional distributions as mixture of Gaussians.

Q: What is the approach to replace the 0–1 loss in Pnd and P?

Their approach consists in replacing the 0–1 loss in P̃nd and P̃fa by a continuous nonconvex approximation such as the sigmoid loss`(z) = 11 + eηz (4)or the ramp loss`(z) = max { 0, 12η (η − z)} −max { 0, − 12η (η + z)} .

Batch and online learning algorithms for Nonconvex

Neyman-Pearson classiﬁcation

GILLES GASSO

(a)

LITIS, INSA Rouen, France.

ARISTIDIS PAPPAIOANNOU

(b)

MARINA SPIVAK

(c)

EON BOTTOU

(d)

NEC Labs America, Princeton.

We describe and evaluate two algorithms for Neyman-Pearson (NP) classiﬁcation problem which

has been recently shown to be of a particular importance for bipartite ranking problems. NP

classiﬁcation is a nonconvex problem involving a constraint on false negatives rate. We investigated

batch algorithm based on DC programming and stochastic gradient method well suited for large

scale datasets. Empirical evidences illustrate the potential of the proposed methods.

Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Classiﬁer Design and evalua-

tion

General Terms: Algorithms

Additional Key Words and Phrases: Neyman-Pearson, Nonconvex SVM, DC algorithm, online

learning

1. INTRODUCTION

Consider a binary classiﬁcation problem with patterns x ∈ X and classes y = ±1

obeying an unknown probability distribution dP (x, y). The probabilities of non

detection P

and of false alarm P

measure the two kinds of errors made by a

discriminant function f :

(f)=P



f(x) ≤ 0 | y = 1



, P

(f)=P



f(x) ≥ 0 | y =−1



The statistical decision theory recognizes the need to associate diﬀerent costs to

these two types of errors. This leads us to searching a classiﬁer f that minimizes

the Asymmetric Cost (AC) formulation :

min

(f) + C

−

(f) . (1)

Although C

and C

−

have a meaningful interpretation, it is often very diﬃcult

to specify these costs in real situations such as medical diagnosis or fraud detec-

tion. There are also cases where such costs have no meaningful interpretation, for

(a)

Gilles Gasso, gilles.gasso@insa-rouen.fr.

(b)

Aristidis Papaioannou, aristidis.papaioannou@gmail.com, was also aﬃliated with Ecole Poly-

technique F´ed´erale de Lausanne, Switzerland. He is now with Google, Mountain View, CA.

(c)

Marina Spivak, spivak.marina@gmail.com, was also aﬃliated with New York University. She

is now in University of Washingtom, WA.

(d)

L´eon Bottou, leon@bottou.org, is now with Microsoft, Bellevue, WA.

ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–0??.

2 · Gasso et al.

instance, as discussed in section 4, when one uses a classiﬁcation framework to

approach a false discovery problem.

In contrast, the Neyman-Pearson (NP) formulation,

min

(f) subject to P

(f) ≤ ρ (2)

requires only the speciﬁcation of the maximal false alarm rate ρ and can be mean-

ingfully applied to false discovery problems.

It is well known that the optimal decision function for both problems are obtained

by thresholding the optimal ranking function

∗

(x) = P



y = +1 | x



, (3)

that is

∗

= r

∗

(x) − C

−

/(C

+ C

−

) ,

∗

= r

∗

(x) − min{r such that P

∗

− r) ≤ ρ} .

Although this result suggests equivalent capabilities, it misses several important

points. Firstly, when using a ﬁnite training set D = {(x

, y

) . . . (x

, y

)} we must

work with the empirical counterparts of P

and P

(f)=

i∈D

f (x

)≤0

(f)=

−

i∈D

−

f (x

)≥0

where D

and D

−

represent respectively the set of positives and negatives with

cardinality n

and n

−

. We must also choose the decision function f within a

restricted class H that is unlikely to contain the optimal decision function. This

approach is supported by standard results in statistical learning theory [e.g. Vapnik

1998] and their extension to the Neyman-Pearson formulation [Scott and Nowak

2005].

Secondly, the empirical counterparts of problems (1) and (2) involve the 0–1

loss function I

y f (x)≤0

that is neither continuous nor convex. Replacing this 0–

1 loss with the SVM Hinge loss has been studied for both the Asymmetric Cost

[Bach et al. 2006] and Neyman-Pearson [Davenport et al. 2010] formulations. This

substitution introduces additional complexities. In particular, in order to hit the

speciﬁed goals on P

and P

, one must use asymmetric costs that are diﬀerent

from C

and C

−

. Both works eventually rely on hyperparameter searches in the

asymmetric cost space.

An alternative approach consists in ﬁrst learning a scoring function that orders

input patterns like the optimal ranking function (3). Both problems (1) and (2)

are then reduced to the determination of a suitable threshold [e.g. Cortes and

Mohri 2004]. However it is quite diﬃcult to ensure that the ranking function is

most accurate in the threshold area. Theoretical investigations of this problem

conclude that Neyman-Pearson classiﬁcation remains an important primitive for

such focussed ranking algorithms [Cl´emen¸con and Vayatis 2007; 2009].

This contribution proposes two practical and eﬃcient algorithms for NP classiﬁ-

cation using nonconvex but continuous and mostly diﬀerentiable loss functions. In

particular, these algorithms are shown to work using asymmetric costs that main-

tain a clear relation with the speciﬁed goals. The ﬁrst algorithm leverages modern

ACM Journal Name, Vol. V, No. N, Month 20YY.

Batch and online learning algorithms for Nonconvex Neyman-Pearson classiﬁcation · 3

−1 −0.5 0 0.5 1

0.2

0.4

0.6

0.8

0−1 loss

Ramp loss

Sigmoid loss

atanloss

erfloss

Fig. 1. Approximations of 0-1 loss. `

atan

(z) = 1/2+arctan(−βz)/π and `

erf

(z) = 1/2+erf(−βz)/2.

Deﬁnitions of the other cost functions are detailed in the text.

nonconvex optimization techniques [Tao and An 1998]. The second algorithm is a

stochastic gradient algorithm suitable for very large datasets. Various experimental

results illustrate their properties.

2. EMPIRICAL RISK NP FORMULATION

Our approach consists in replacing the 0–1 loss in

and

by a continuous

nonconvex approximation such as the sigmoid loss

`(z) =

1 + e

ηz

(4)

or the ramp loss

`(z) = max



2η

(η − z)



− max



0, −

2η

(η + z)



. (5)

The positive parameter η determines how close the nonconvex costs are to the 0–1

loss and those approximations tend toward the 0–1 loss as η tends to 0. Figure 1

illustrates such approximations. The selection of an optimization algorithm usually

dictates the choice of an approximation. The diﬀerentiable sigmoid loss lends itself

to gradient descent, whereas the ramp loss is attractive with dual optimization

algorithms.

Following common practice, we also add a regularization term Ω(f) to control

the capacity of our classiﬁers. We therefore seek the solution of

min

f∈H

Ω(f) + C

(f) subject to

(f) ≤ ρ (6)

where C ∈ R

is the regularization parameter and

(f)=

i∈D



f(x

)



(f)=

−

i∈D

−



f(x

)



For instance, in the case of a Neyman-Pearson SVM (NP-SVM), the discriminant

functions is f(x) = f

(x) + b with f

taken from a RKHS induced by a kernel



x, x



, and the regularizer is Ω(f) = kf

ACM Journal Name, Vol. V, No. N, Month 20YY.

4 · Gasso et al.

The nonconvex optimization problem (6) comes with the usual caveats and ben-

eﬁts. We can only obtain a local minimum of (6). On the other hand, we can

obtain a local minimum that is better than the solution of any convex relaxation

of (6), simply by initializing the nonconvex search using the solution of the convex

relaxation.

2.1 Previous work

The NP classiﬁcation problem has been extensively studied. Past methods can be

roughly divided in two categories: generative and discriminative.

One of the earliest attempts [Streit 1990] uses multi-layered neural network to

estimate class-conditional distributions as mixture of Gaussians. The discriminant

function is then inferred with a likelihood ratio test. In the same vein, recent

methods [Kim et al. 2006] assume the class-conditional distributions are Gaussian

with means µ

and covariances Σ

, and consider a linear classiﬁer f(x) = hw, xi+

b. This amounts to solving (2) with the following deﬁnitions

= Φ





−

b + w





= Φ





b + w

−





where Φ is the cumulative distribution function of standard normal distribution and

and

are empirical estimations. Since the Gaussian assumption proves too

restrictive, some authors [Huang et al. 2006; Kim et al. 2006] replace the cumulative

Φ by a Chebyshev bound Ψ(u) = [u]

/(1 + [u]

), [u]

= max(0, u). The scheme

is extended to nonlinear discrimination using the kernel trick. A third ﬂavor of

generative approach addresses the estimation of class-condition distributions by

Parzen window [Bounsiar et al. 2008]. These generative methods share the same

drawbacks: (1) the ﬁnal classiﬁers are derived from estimated distributions whose

accuracy is questionable when the datasets are small, and (2) the kernel version of

these models lacks sparsity because all the examples are involved in the model.

On the discriminative side, the Asymmetric Cost SVM [Bach et al. 2006; Dav-

enport et al. 2010]. introduces costs C

and C

−

in the SVM formulation. As

mentioned before, even if the true asymmetric costs were known, the beneﬁts of the

convex loss, such as the guaranteed convergence to global optimum, are balanced

by the necessity of searching for diﬀerent asymmetric costs to achieve the desired

NP constraint [Bach et al. 2006]. SVMPerf [Joachims 2005] optimizes in polyno-

mial time a convex upper bound of any performance measures computable from

the confusion table. Since

and

can be computed from the confusion table,

SVMPerf can address the Neyman-Pearson problem. Computing times grow very

quickly with the number of examples n, typically with degree four for linear models,

worse for nonlinear models. Finally, most similar to our approach, [Mozer et al.

2002] also consider sigmoid approximation of 0-1 loss. Compared to this work, our

contributions are three-fold: we extend the nonconvex NP idea to SVM in order

to beneﬁt from oﬀ-the-shelf SVM solvers, we propose a stochastic approach to deal

with large scale datasets, and we extend the potential of our approaches to related

problems such as q-value optimization (section 4).

ACM Journal Name, Vol. V, No. N, Month 20YY.

Batch and online learning algorithms for Nonconvex Neyman-Pearson classiﬁcation · 5

Algorithm 1 Uzawa Algorithm

Set initial value for λ ≥ 0. Pick small gain ν > 0.

repeat

f ← argmin

f∈H

L(f, λ)

λ ← max { 0, λ + ν ∇

L(f, λ) }

until convergence

3. SOLVING NON-CONVEX NP PROBLEMS

The algorithms discussed in this paper ﬁnd a local minimum of (6) by searching a

local saddle point (f, λ) ∈ H × R

of the related Lagrangian

L(f, λ) = Ω(f) + C

(f) + λ (

(f) − ρ) . (7)

The appendix summarizes several results that apply to nonconvex optimization.

Local saddle points of the Lagrangian (7) are always feasible local minima of (6).

Conversely, assuming diﬀerentiability, the local minima of (6) are always critical

points of the Lagrangian.

3.1 Uzawa algorithm

The Uzawa algorithm [Arrow et al. 1958] is a simple iterative procedure for ﬁnding

a saddle point of the Lagrangian (7). Each iteration of the algorithm ﬁrst computes

a minimum f

∗

of the Lagrangian for the current value of λ, and then performs a

small gradient ascent step in λ (algorithm 1) with ∇

−ρ. The convergence

of the Uzawa algorithm is not obvious because the function λ 7→ L(f

∗

, λ) can

easily contain discontinuities. However a simple argument (see theorem 3 in the

appendix) shows that

∗

) is a nonincreasing function of λ. Therefore the sign

of the gradient ∇

L =

−ρ correctly indicates whether λ is above or below its

target value. In general we prefer using a multiplicative update λ ← λ(1+ν ∇

because it keeps the λ positive. This makes very little diﬀerence in practice: the

key is to adjust λ in very small increments, for instance using a very small gain ν.

The two algorithms discussed in this paper are essentially derived from the Uzawa

algorithm. They diﬀer in the minimization step. The ﬁrst algorithm uses the

approach [Tao and An 1998] and is suitable for kernel machines. The second

algorithm relies on a stochastic gradient approach [Tsypkin 1971; Andrieu et al.

2007] suitable for processing large datasets.

3.2 Batch learning of NP-SVM

The most diﬃcult step in the Uzawa algorithm is minimization of L over f for λ

ﬁxed. In the case of the SVM classiﬁer, the Lagrangian (7) reads

L =

kfk

+ C

i∈D

`(y

f(x

)) + C

−

i∈D

−

`(y

f(x

)) − λρ

The acronym DC stands for “Diﬀerence of Convex functions”.

ACM Journal Name, Vol. V, No. N, Month 20YY.

Batch and online learning algorithms for nonconvex neyman-pearson classification

Figures

Citations

Adaptation and learning in automatic systems

DC programming and DCA: thirty years of developments

Optimization Methods for Large-Scale Machine Learning

Incremental Learning From Stream Data

Distributed Online One-Class Support Vector Machine for Anomaly Detection Over Networks

References

Statistical learning theory

Statistical significance for genomewide studies

Advances in kernel methods: support vector learning

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

Fast training of support vector machines using sequential minimal optimization

Related Papers (5)

Tuning Support Vector Machines for Minimax and Neyman-Pearson Classification

LIBSVM: A library for support vector machines

Nonconvex Online Support Vector Machines

A Neyman-Pearson approach to statistical learning

Online convex programming and generalized infinitesimal gradient ascent

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Batch and online learning algorithms for nonconvex neyman-pearson classification" ?

Q2. What dataset was used for the q-value optimization experiments?

Q3. How many pairs of parameters are used for ONP-SVM?

Q4. What is the third flavor of generative approach?

Q5. How many values of (C+, C) were searched in the hyperbox?

Q6. What is the earliest attempt to estimate class-conditional distributions?

Q7. How many documents are used as the training set?

Q8. Why do the authors not report results obtained with SVMPerf?

Q9. What is the algorithm for calculating the cost of asymmetric SVM?

Q10. What is the approach to replace the 0–1 loss in Pnd and P?

Q11. What is the difference between the two convex functions?

Q12. How many units are used in the neural network?

Q13. What is the classifier for each method?