scispace - formally typeset
Open AccessJournal ArticleDOI

Batch and online learning algorithms for nonconvex neyman-pearson classification

Reads0
Chats0
TLDR
Two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems are described and evaluated.
Abstract
We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. NP classification is a nonconvex problem involving a constraint on false negatives rate. We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large-scale datasets. Empirical evidences illustrate the potential of the proposed methods.

read more

Content maybe subject to copyright    Report

Batch and online learning algorithms for Nonconvex
Neyman-Pearson classification
GILLES GASSO
(a)
LITIS, INSA Rouen, France.
ARISTIDIS PAPPAIOANNOU
(b)
MARINA SPIVAK
(c)
L
´
EON BOTTOU
(d)
NEC Labs America, Princeton.
We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which
has been recently shown to be of a particular importance for bipartite ranking problems. NP
classification is a nonconvex problem involving a constraint on false negatives rate. We investigated
batch algorithm based on DC programming and stochastic gradient method well suited for large
scale datasets. Empirical evidences illustrate the potential of the proposed methods.
Categories and Subject Descriptors: I.5.2 [Pattern Recognition]: Classifier Design and evalua-
tion
General Terms: Algorithms
Additional Key Words and Phrases: Neyman-Pearson, Nonconvex SVM, DC algorithm, online
learning
1. INTRODUCTION
Consider a binary classification problem with patterns x X and classes y = ±1
obeying an unknown probability distribution dP (x, y). The probabilities of non
detection P
nd
and of false alarm P
fa
measure the two kinds of errors made by a
discriminant function f :
P
nd
(f)=P
f(x) 0 | y = 1
, P
fa
(f)=P
f(x) 0 | y =1
The statistical decision theory recognizes the need to associate different costs to
these two types of errors. This leads us to searching a classifier f that minimizes
the Asymmetric Cost (AC) formulation :
min
f
C
+
P
nd
(f) + C
P
fa
(f) . (1)
Although C
+
and C
have a meaningful interpretation, it is often very difficult
to specify these costs in real situations such as medical diagnosis or fraud detec-
tion. There are also cases where such costs have no meaningful interpretation, for
(a)
Gilles Gasso, gilles.gasso@insa-rouen.fr.
(b)
Aristidis Papaioannou, aristidis.papaioannou@gmail.com, was also affiliated with Ecole Poly-
technique ed´erale de Lausanne, Switzerland. He is now with Google, Mountain View, CA.
(c)
Marina Spivak, spivak.marina@gmail.com, was also affiliated with New York University. She
is now in University of Washingtom, WA.
(d)
eon Bottou, leon@bottou.org, is now with Microsoft, Bellevue, WA.
ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–0??.

2 · Gasso et al.
instance, as discussed in section 4, when one uses a classification framework to
approach a false discovery problem.
In contrast, the Neyman-Pearson (NP) formulation,
min
f
P
nd
(f) subject to P
fa
(f) ρ (2)
requires only the specification of the maximal false alarm rate ρ and can be mean-
ingfully applied to false discovery problems.
It is well known that the optimal decision function for both problems are obtained
by thresholding the optimal ranking function
r
(x) = P
y = +1 | x
, (3)
that is
f
AC
= r
(x) C
/(C
+
+ C
) ,
f
NP
= r
(x) min{r such that P
fa
(r
r) ρ} .
Although this result suggests equivalent capabilities, it misses several important
points. Firstly, when using a finite training set D = {(x
1
, y
1
) . . . (x
n
, y
n
)} we must
work with the empirical counterparts of P
nd
and P
fa
:
˜
P
nd
(f)=
1
n
+
X
i∈D
+
I
f (x
i
)0
,
˜
P
fa
(f)=
1
n
X
i∈D
I
f (x
i
)0
where D
+
and D
represent respectively the set of positives and negatives with
cardinality n
+
and n
. We must also choose the decision function f within a
restricted class H that is unlikely to contain the optimal decision function. This
approach is supported by standard results in statistical learning theory [e.g. Vapnik
1998] and their extension to the Neyman-Pearson formulation [Scott and Nowak
2005].
Secondly, the empirical counterparts of problems (1) and (2) involve the 0–1
loss function I
y f (x)0
that is neither continuous nor convex. Replacing this 0–
1 loss with the SVM Hinge loss has been studied for both the Asymmetric Cost
[Bach et al. 2006] and Neyman-Pearson [Davenport et al. 2010] formulations. This
substitution introduces additional complexities. In particular, in order to hit the
specified goals on P
nd
and P
fa
, one must use asymmetric costs that are different
from C
+
and C
. Both works eventually rely on hyperparameter searches in the
asymmetric cost space.
An alternative approach consists in first learning a scoring function that orders
input patterns like the optimal ranking function (3). Both problems (1) and (2)
are then reduced to the determination of a suitable threshold [e.g. Cortes and
Mohri 2004]. However it is quite difficult to ensure that the ranking function is
most accurate in the threshold area. Theoretical investigations of this problem
conclude that Neyman-Pearson classification remains an important primitive for
such focussed ranking algorithms [Cl´emen¸con and Vayatis 2007; 2009].
This contribution proposes two practical and efficient algorithms for NP classifi-
cation using nonconvex but continuous and mostly differentiable loss functions. In
particular, these algorithms are shown to work using asymmetric costs that main-
tain a clear relation with the specified goals. The first algorithm leverages modern
ACM Journal Name, Vol. V, No. N, Month 20YY.

Batch and online learning algorithms for Nonconvex Neyman-Pearson classification · 3
−1 −0.5 0 0.5 1
0
0.2
0.4
0.6
0.8
1
0−1 loss
Ramp loss
Sigmoid loss
atanloss
erfloss
Fig. 1. Approximations of 0-1 loss. `
atan
(z) = 1/2+arctan(βz) and `
erf
(z) = 1/2+erf(βz)/2.
Definitions of the other cost functions are detailed in the text.
nonconvex optimization techniques [Tao and An 1998]. The second algorithm is a
stochastic gradient algorithm suitable for very large datasets. Various experimental
results illustrate their properties.
2. EMPIRICAL RISK NP FORMULATION
Our approach consists in replacing the 0–1 loss in
˜
P
nd
and
˜
P
fa
by a continuous
nonconvex approximation such as the sigmoid loss
`(z) =
1
1 + e
ηz
(4)
or the ramp loss
`(z) = max
0,
1
2η
(η z)
max
0,
1
2η
(η + z)
. (5)
The positive parameter η determines how close the nonconvex costs are to the 0–1
loss and those approximations tend toward the 0–1 loss as η tends to 0. Figure 1
illustrates such approximations. The selection of an optimization algorithm usually
dictates the choice of an approximation. The differentiable sigmoid loss lends itself
to gradient descent, whereas the ramp loss is attractive with dual optimization
algorithms.
Following common practice, we also add a regularization term Ω(f) to control
the capacity of our classifiers. We therefore seek the solution of
min
f∈H
Ω(f) + C
ˆ
P
nd
(f) subject to
ˆ
P
fa
(f) ρ (6)
where C R
+
is the regularization parameter and
ˆ
P
nd
(f)=
1
n
+
X
i∈D
+
`
y
i
f(x
i
)
,
ˆ
P
fa
(f)=
1
n
X
i∈D
`
y
i
f(x
i
)
.
For instance, in the case of a Neyman-Pearson SVM (NP-SVM), the discriminant
functions is f(x) = f
0
(x) + b with f
0
taken from a RKHS induced by a kernel
k
x, x
0
, and the regularizer is Ω(f) = kf
0
k
2
H
.
ACM Journal Name, Vol. V, No. N, Month 20YY.

4 · Gasso et al.
The nonconvex optimization problem (6) comes with the usual caveats and ben-
efits. We can only obtain a local minimum of (6). On the other hand, we can
obtain a local minimum that is better than the solution of any convex relaxation
of (6), simply by initializing the nonconvex search using the solution of the convex
relaxation.
2.1 Previous work
The NP classification problem has been extensively studied. Past methods can be
roughly divided in two categories: generative and discriminative.
One of the earliest attempts [Streit 1990] uses multi-layered neural network to
estimate class-conditional distributions as mixture of Gaussians. The discriminant
function is then inferred with a likelihood ratio test. In the same vein, recent
methods [Kim et al. 2006] assume the class-conditional distributions are Gaussian
with means µ
±
and covariances Σ
±
, and consider a linear classifier f(x) = hw, xi+
b. This amounts to solving (2) with the following definitions
ˆ
P
nd
= Φ
b + w
>
ˆ
µ
+
q
w
>
ˆ
Σ
+
w
,
ˆ
P
fa
= Φ
b + w
>
ˆ
µ
q
w
>
ˆ
Σ
w
,
where Φ is the cumulative distribution function of standard normal distribution and
ˆ
µ
±
and
ˆ
Σ
±
are empirical estimations. Since the Gaussian assumption proves too
restrictive, some authors [Huang et al. 2006; Kim et al. 2006] replace the cumulative
Φ by a Chebyshev bound Ψ(u) = [u]
2
+
/(1 + [u]
2
+
), [u]
+
= max(0, u). The scheme
is extended to nonlinear discrimination using the kernel trick. A third flavor of
generative approach addresses the estimation of class-condition distributions by
Parzen window [Bounsiar et al. 2008]. These generative methods share the same
drawbacks: (1) the final classifiers are derived from estimated distributions whose
accuracy is questionable when the datasets are small, and (2) the kernel version of
these models lacks sparsity because all the examples are involved in the model.
On the discriminative side, the Asymmetric Cost SVM [Bach et al. 2006; Dav-
enport et al. 2010]. introduces costs C
+
and C
in the SVM formulation. As
mentioned before, even if the true asymmetric costs were known, the benefits of the
convex loss, such as the guaranteed convergence to global optimum, are balanced
by the necessity of searching for different asymmetric costs to achieve the desired
NP constraint [Bach et al. 2006]. SVMPerf [Joachims 2005] optimizes in polyno-
mial time a convex upper bound of any performance measures computable from
the confusion table. Since
˜
P
fa
and
˜
P
nd
can be computed from the confusion table,
SVMPerf can address the Neyman-Pearson problem. Computing times grow very
quickly with the number of examples n, typically with degree four for linear models,
worse for nonlinear models. Finally, most similar to our approach, [Mozer et al.
2002] also consider sigmoid approximation of 0-1 loss. Compared to this work, our
contributions are three-fold: we extend the nonconvex NP idea to SVM in order
to benefit from off-the-shelf SVM solvers, we propose a stochastic approach to deal
with large scale datasets, and we extend the potential of our approaches to related
problems such as q-value optimization (section 4).
ACM Journal Name, Vol. V, No. N, Month 20YY.

Batch and online learning algorithms for Nonconvex Neyman-Pearson classification · 5
Algorithm 1 Uzawa Algorithm
Set initial value for λ 0. Pick small gain ν > 0.
repeat
f argmin
f∈H
L(f, λ)
λ max { 0, λ + ν
λ
L(f, λ) }
until convergence
3. SOLVING NON-CONVEX NP PROBLEMS
The algorithms discussed in this paper find a local minimum of (6) by searching a
local saddle point (f, λ) H × R
+
of the related Lagrangian
L(f, λ) = Ω(f) + C
ˆ
P
nd
(f) + λ (
ˆ
P
fa
(f) ρ) . (7)
The appendix summarizes several results that apply to nonconvex optimization.
Local saddle points of the Lagrangian (7) are always feasible local minima of (6).
Conversely, assuming differentiability, the local minima of (6) are always critical
points of the Lagrangian.
3.1 Uzawa algorithm
The Uzawa algorithm [Arrow et al. 1958] is a simple iterative procedure for finding
a saddle point of the Lagrangian (7). Each iteration of the algorithm first computes
a minimum f
λ
of the Lagrangian for the current value of λ, and then performs a
small gradient ascent step in λ (algorithm 1) with
λ
L=
ˆ
P
fa
ρ. The convergence
of the Uzawa algorithm is not obvious because the function λ 7→ L(f
λ
, λ) can
easily contain discontinuities. However a simple argument (see theorem 3 in the
appendix) shows that
ˆ
P
fa
(f
λ
) is a nonincreasing function of λ. Therefore the sign
of the gradient
λ
L =
ˆ
P
fa
ρ correctly indicates whether λ is above or below its
target value. In general we prefer using a multiplicative update λ λ(1+ν
λ
L)
because it keeps the λ positive. This makes very little difference in practice: the
key is to adjust λ in very small increments, for instance using a very small gain ν.
The two algorithms discussed in this paper are essentially derived from the Uzawa
algorithm. They differ in the minimization step. The first algorithm uses the
DC
1
approach [Tao and An 1998] and is suitable for kernel machines. The second
algorithm relies on a stochastic gradient approach [Tsypkin 1971; Andrieu et al.
2007] suitable for processing large datasets.
3.2 Batch learning of NP-SVM
The most difficult step in the Uzawa algorithm is minimization of L over f for λ
fixed. In the case of the SVM classifier, the Lagrangian (7) reads
L =
1
2
kfk
2
H
+ C
+
X
i∈D
+
`(y
i
f(x
i
)) + C
X
i∈D
`(y
i
f(x
i
)) λρ
1
The acronym DC stands for “Difference of Convex functions”.
ACM Journal Name, Vol. V, No. N, Month 20YY.

Citations
More filters
Journal ArticleDOI

Adaptation and learning in automatic systems

TL;DR: In this paper, the authors propose a concrete-concrete approach to the problem of concretization.Concrete-convex, concrete, and concrete-decrease.
Journal ArticleDOI

DC programming and DCA: thirty years of developments

TL;DR: A short survey on thirty years of developments of DC (Difference of Convex functions) programming and DCA (DC Algorithms) which constitute the backbone of nonconvex programming and global optimization.
Posted Content

Optimization Methods for Large-Scale Machine Learning

TL;DR: A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.
Journal ArticleDOI

Incremental Learning From Stream Data

TL;DR: This paper proposes a general adaptive incremental learning framework named ADAIN that is capable of learning from continuous raw data, accumulating experience over time, and using such knowledge to improve future learning and prediction performance.
Journal ArticleDOI

Distributed Online One-Class Support Vector Machine for Anomaly Detection Over Networks

TL;DR: Compared with other state-of-the-art anomaly detection methods, the proposed distributed algorithms not only show good anomaly detection performance, but also require relatively short running time and low CPU memory consumption.
References
More filters

Statistical learning theory

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Journal ArticleDOI

Statistical significance for genomewide studies

TL;DR: This work proposes an approach to measuring statistical significance in genomewide studies based on the concept of the false discovery rate, which offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted.
Proceedings ArticleDOI

Advances in kernel methods: support vector learning

TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

J. C. Platt
TL;DR: SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.
Book

Fast training of support vector machines using sequential minimal optimization

TL;DR: In this article, the authors proposed a new algorithm for training Support Vector Machines (SVM) called SMO (Sequential Minimal Optimization), which breaks this large QP problem into a series of smallest possible QP problems.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Batch and online learning algorithms for nonconvex neyman-pearson classification" ?

The authors describe and evaluate two algorithms for Neyman-Pearson ( NP ) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. The authors investigated batch algorithm based on DC programming and stochastic gradient method well suited for large scale datasets. Empirical evidences illustrate the potential of the proposed methods. 

The q-value optimization experiments were carried out using a proteomics dataset consisting of 139410 samples with positive and negative samples equally represented [Spivak et al. 2009]. 

As the authors tested 100 pairs of parameters (C+, C−) for AC-SVM and only 36 pairs (γ, ν) for ONP-SVM, the computing time gain is highly appreciable. 

A third flavor of generative approach addresses the estimation of class-condition distributions by Parzen window [Bounsiar et al. 2008]. 

In the case of AC-SVM, 15×15 values of (C+, C−) were searched in the hyperbox [0.001, 1000]2 for Pageblocks and [0.01, 100]2 for the other datasets. 

One of the earliest attempts [Streit 1990] uses multi-layered neural network to estimate class-conditional distributions as mixture of Gaussians. 

In order to make the problem large scale, the authors use the official testing set (781,265 documents) as the training set and the official training set (23,149 documents) as the testing set. 

The authors do not report results obtained with SVMPerf [Joachims 2005] because, running the algorithm for plain kernel machines is “painfully slow”, according to the SVMPerf web site. 

• Compute βi ← {1 if yif(xi) < −η 0 otherwise• Update λ← λ(1 + ν(P̂fa − ρ)) until convergence.iterations start with β = 0, the first iteration solves the classical asymmetric cost SVM problem whose solution is progressively improved by taking the nonconvexity into account and updating λ in order to achieve the target Pfa. 

Their approach consists in replacing the 0–1 loss in P̃nd and P̃fa by a continuous nonconvex approximation such as the sigmoid loss`(z) = 11 + eηz (4)or the ramp loss`(z) = max { 0, 12η (η − z)} −max { 0, − 12η (η + z)} . 

Since the analytical expression of the ramp loss (5) is a difference `1(z)− `2(z) of two convex functions, the full Lagrangian can also be expressed as the difference of two convex functions amenable to DC programming [Tao and An 1998]. 

The neural network replicates the structure of the state-of-the-art model (qRanker) [Spivak et al. 2009] with a single hidden layer with 5 units. 

The best classifier for each method was selected using the validation criterion [Davenport et al. 2010]:Jval = Pnd + max(0,Pfa − ρ)/ρ .ACM Journal Name, Vol. V, No. N, Month 20YY.