scispace - formally typeset
Open AccessProceedings ArticleDOI

Class decomposition via clustering: a new framework for low-variance classifiers

TLDR
Experimental results on real-world domains show an advantage in predictive accuracy when clustering is used as a preprocessing step to classification, mainly on classifiers characterized by high bias but low variance.
Abstract
We propose a preprocessing step to classification that applies a clustering algorithm to the training set to discover local patterns in the attribute or input space. We demonstrate how this knowledge can be exploited to enhance the predictive accuracy of simple classifiers. Our focus is mainly on classifiers characterized by high bias but low variance (e.g., linear classifiers); these classifiers experience difficulty in delineating class boundaries over the input space when a class distributes in complex ways. Decomposing classes into clusters makes the new class distribution easier to approximate and provides a viable way to reduce bias while limiting the growth in variance. Experimental results on real-world domains show an advantage in predictive accuracy when clustering is used as a preprocessing step to classification.

read more

Content maybe subject to copyright    Report

Class Decomposition Via Clustering:
A New Framework For Low-Variance Classifiers
Ricardo Vilalta, Murali-Krishna Achari, and Christoph F. Eick
Department of Computer Science
University of Houston
Houston TX, 77204-3010, USA
{vilalta, amkchari, ceick}@cs.uh.edu
Abstract
In this paper we propose a pre-processing step to clas-
sification that applies a clustering algorithm to the training
set to discover local patterns in the attribute or input space.
We demonstrate how this knowledge can be exploited to en-
hance the predictive accuracy of simple classifiers. Our fo-
cus is mainly on classifiers characterized by high bias but
low variance (e.g., linear classifiers); these classifiers ex-
perience difficulty in delineating class boundaries over the
input space when a class distributes in complex ways. De-
composing classes into clusters makes the new class distri-
bution easier to approximate and provides a viable way to
reduce bias while limiting the growth in variance. Experi-
mental results on real-world domains show an advantage
in predictive accuracy when clustering is used as a pre-
processing step to classification.
1 INTRODUCTION
Classification and clustering stand as central techniques
in data analysis. Classification aims at deriving a predic-
tion model from labelled data. The model is intended to
capture correlations between the feature variables and the
target variable to predict the class label of new data objects.
Clustering is a useful tool in revealing patterns in unlabelled
data; the goal is to discover how data objects gather into nat-
ural groups. The work described in this paper explores how
classification algorithms can benefit from class density in-
formation that is obtained using clustering. These informa-
tion can be exploited to improve the quality of the decision
boundaries during classification and enhance the prediction
accuracy of simple classifiers. We demonstrate how using
classification and clustering techniques in conjunction ad-
dresses key issues in learning theory (e.g., locality vs ca-
pacity or bias vs variance) and provides an attractive new
family of classification models.
Our goal is to exploit the information derived from a
clustering algorithm to increase the complexity of sim-
ple classifiers characterized by low variance and high bias.
These algorithms, commonly referred to as model-based or
parametric-based, encompass a small class of approximat-
ing functions and exhibit limited flexibility in their deci-
sion boundaries. Examples include linear classifiers, prob-
abilistic classifiers based on the attribute-independence as-
sumption (e.g., Naive Bayes), and single logical rules. The
question we address is how to increase the complexity of
these classifiers to tradeoff bias for variance in an effective
manner. Since these models start off with simple represen-
tations, increasing their complexity is expected to improve
their generalization performance while still retaining their
ability to output models amenable to interpretation.
Our approach consists of increasing the degree of com-
plexity of the decision boundaries of a simple classifier by
augmenting the number of boundaries per class. The idea
is to transform the classification problem by decomposing
each class into clusters. By relabelling the examples cov-
ered by each cluster with a new class label, the simple
classifier generates an increased number of boundaries per
class, and is then armed to cope with complex distributions
where classes cover different regions of the input space. Not
every cluster is relabelled with a new class; our algorithm
explores the space of possible new class assignments in a
greedy manner maximizing predictive accuracy. In sum-
mary our approach comprises the following modules:
1. A pre-processing step to classification that consists of
clustering examples that belong to the same class. This
identifies regions of high class density.
2. A search for a configuration of class assignments over
the set of clusters that optimizes predictive accuracy.
This increases the number of decision boundaries per
class.

3. A function that maps the predicted class label of a test
example to one of the original classes. This transforms
the auxiliary set of new classes into the original set of
classes.
We test our methodology on twenty datasets from the
University of California at Irvine repository, using two sim-
ple classifiers: Naive Bayes and a Support Vector Machine
with a polynomial kernel of degree one. Results denote a
significant increase in predictive accuracy when our class-
decomposition approach is applied (Section 6). To con-
clude, empirical results support our goal statement that pre-
identifying local patterns in the data through clustering is a
helpful tool in improving the performance of simple classi-
fiers.
The paper organization is described next. Section 2 in-
troduces background information and our problem state-
ment. Section 3 details our class decomposition approach
via clustering to improve the performance of simple classi-
fiers. Section 4 uses the VC dimension to understand the in-
crease in representational power gained with our approach.
Section 5 reviews related work. Section 6 reports an empir-
ical assessment of our approach. Finally, Section 7 states
our summary and future work.
2 PROBLEM STATEMENT
2.1 SIMPLE DISCRIMINANT FUNCTIONS
Let (X
1
, X
2
, · · · , X
n
) be an n-component vector-valued
random variable, where each X
i
represents an attribute or
feature; the space of all possible attribute vectors is called
the attribute or input space X . Let {y
1
, y
2
, · · · , y
k
} be the
possible classes, categories, or states of nature; the space of
all possible classes is called the output space Y. A classifier
receives as input a set of training examples T = {(x, y)},
where x = (x
1
, x
2
, · · · , x
n
) is a vector or point in the input
space (x
i
is the value of attribute X
i
) and y is a point in the
output space. The outcome of the classifier is a function h
(or hypothesis) mapping the input space to the output space,
h : X Y.
We consider the case where a classifier defines a discrim-
inant function for each class g
j
(x), j = 1, 2, · · · , k and
chooses the class corresponding to the discriminant func-
tion with highest value (ties are broken arbitrarily):
h(x) = y
m
iff g
m
(x) g
j
(x) (1)
Possibly, the simplest case is that of a linear discrimi-
nant function, where the approximation is based on a linear
model:
g
j
(x) = w
0
+
n
X
i=1
w
i
x
i
(2)
where each w
i
, 0 i n, is a coefficient that must be
learned by the classification algorithm.
We will also consider probabilistic classifiers where
the discriminant functions are proportional to the posterior
probabilities of a class given the input vector x, P (y
j
|x).
The classifier, also known as Naive Bayes, assumes feature
independence given the class [7]:
g
j
(x) = P (y
j
n
i
P (x
i
|y
j
) (3)
where P (y
j
) is the a priori probability of class y
j
, and
Π
n
i
P (x
i
|y
j
) is a simple product approximation of P (x|y
j
),
called the likelihood or class-conditional probability.
2.2 THE BIAS-VARIANCE TRADEOFF
Simple discriminant functions tend to output poor func-
tion approximations when the data distributes in complex
ways. Our goal is to increase the complexity of simple clas-
sifiers to obtain better function approximations. Since our
training set comprises a limited number of examples and we
do not know the form of the true target distribution, our goal
is inevitably subject to the bias-variance dilemma in statisti-
cal inference [9, 10]. The dilemma is based on the fact that
prediction error can be decomposed into a bias and a vari-
ance component
1
; ideally we would like to have classifiers
with low bias and low variance but these components are
inversely related.
On the one hand, simple classifiers, commonly referred
to as model-based or parametric-based –and the subject of
our study–, encompass a small class of approximating func-
tions and exhibit limited flexibility on their decision bound-
aries. Their small repertoire of functions produces high
bias (since the best approximating function may lie far from
the target function) but low variance (since there is little
dependence on local irregularities in the data). Examples
include linear classifiers, probabilistic classifiers such as
Naive Bayes, and single logical rules.
On the other hand, increasing the complexity of the clas-
sifier reduces the bias but increases the variance. Complex
classifiers, also referred to as model free or parametric-free,
encompass a large class of approximating functions; they
exhibit flexible decision boundaries (low bias) but are sen-
sitive to small variations in the data (high variance). Exam-
ples include neural networks with a large number of hidden
units and k-nearest neighbor classifiers with small values
for k.
Our problem statement can be rephrased as follows: how
can we decrease the bias (i.e., increase the complexity) of
our simple classifiers without drastically increasing the vari-
ance component? Notice our goal sets forth in a direction
1
A third component, the irreducible error or Bayes error, cannot be
eliminated or tradeoff.

(a)
X1
X2
(b)
X1
X2
Figure 1. (a) A high-order polynomial improves the classification of a linear classifier at the expense
of increased variance. (b) Increasing the number of linear discriminants guided by local patterns
increases complexity with lower impact on variance.
orthogonal to combination methods like bagging [5] and
boosting [8] where the goal is to reduce the variance com-
ponent in generalization error by voting on variants of the
training data.
2.3 INCREASING COMPLEXITY THROUGH
ADDITIONAL BOUNDARIES
Our solution is to exploit information about the distribu-
tion of examples through a pre-processing step that iden-
tifies natural clusters in data. As an illustration, Figure 1
shows a two dimensional input space with two classes (pos-
itive + and negative ). The distribution of examples pre-
cludes a simple linear classifier attaining good performance
(Figure 1a, bold line). One way to increase the complex-
ity of the classifier is to enlarge the original space of linear
combinations to allow for more flexibility on the decision
boundaries, for example by adding higher order polynomi-
als (Figure 1a, dashed line). But this comes at the expense
of increased variance and possibly data overfitting.
Alternatively, one can retain the same space of linear
functions but increase the number of decision boundaries
per class (Figure 1b). This increases the complexity of the
classifier but with less impact on variance (Section 4). The
trick lies on identifying regions of high class density within
subsets of examples of the same class which we accomplish
through clustering. The next sections provide a detail de-
scription of our approach.
3 CLASS DECOMPOSITION VIA CLUS-
TERING
Our solution comprises three modules: 1) a decomposi-
tion of classes into clusters; 2) a search for an optimal class
assignment configuration; and 3) a function mapping pre-
dictions to the original set of class labels. We explain each
module in turn.
Algorithm 1: Mapping-Process
Input: clustering method C, dataset T
Output: new dataset T
0
MAPPING-PROCESS(C,T )
(1) Separate T into subsets {T
j
}
(2) where T
j
= {(x, y) T |y = y
j
}
(3) foreach T
j
(4) Apply clustering C on T
j
(5) Let {C
j
p
} be the set of clusters
(6) foreach example e = (x, y
j
)
(7) Let p be the cluster index for x
(8) Create example e
0
= (x, y
0
j
)
(9) where y
0
j
= (y
j
, p)
(10) Add e
0
to T
0
(11) end
(12) end
(13) return T
0
Figure 2. The process to transform dataset
T into a new dataset T
0
using a clustering
algorithm.
3.1 CLASS DECOMPOSITION
The first module pre-processes the training data by clus-
tering examples that belong to the same class as shown in
Algorithm 1 (Figure 2). We proceed by first separating
dataset T into sets of examples of the same class. That is
T is separated into different sets of examples T = {T
j
},
where each T
j
comprises all examples in T labelled with
class y
j
, T
j
= {(x, y) T |y = y
j
}.
For each set T
j
we apply a clustering algorithm C to find
sets of examples (i.e., clusters) grouped together according
to some distance metric over the input space
2
. Let {c
j
i
} be
the set of such clusters. We map the set of examples in
T
j
into a new set T
0
j
by renaming every class label to in-
dicate not only the class but also the cluster to which each
example belongs. One simple way to do this is by mak-
2
We consider a flat type of clustering (as opposed to hierarchical) where
each object is assigned to exactly only cluster.

X1
X2
{(x,y’)| y’ = (+,2)}
{(x,y’)| y’ = (-,1)}
{(x,y’)| y’ = (+,1)}
Figure 3. The mapping process relabels ex-
amples to encode both class and cluster.
ing each class label a pair (a, b), where the first element
represents the original class and the second element repre-
sents the cluster that the example falls into. In that case,
T
0
j
= {(x, y
0
j
)}, where y
0
j
= (y
j
, i) whenever example x is
assigned to cluster c
j
i
.
An illustration of the transformation above is shown in
Figure 3. We assume a two-dimensional input space where
examples belong to either class positive (+) or negative ().
Let’s suppose the clustering algorithm separates class pos-
itive into two clusters, while class negative is grouped into
one single cluster. The transformation relabels every exam-
ple to encode class and cluster label. As a result, dataset T
0
has now three different classes. Finally the new dataset T
0
is simply the union of all sets of examples of the same class
relabelled according to the cluster to which each example
belongs, T
0
=
S
k
j=1
T
0
j
.
In summary, the first module maps training set T into
another dataset T
0
through a class-decomposition process.
The mapping leaves the input space X intact but changes
the output space Y into a (possibly) larger space Y
0
(i.e.,
|Y
0
| |Y|, where | · | is the cardinality of the space).
3.2 A SEARCH FOR THE OPTIMAL CLASS
ASSIGNMENT
Increasing the number of classes according to the num-
ber of induced clusters does not always yield optimal per-
formance. As an illustration, Figure 4 shows a distribution
of examples where the positive class decomposes into three
clusters. Constructing a linear classifier separately on each
cluster generates decision boundaries that cause misclassi-
fications (top positive clusters in Fig 4, bold lines). One
solution is to maintain the lower cluster while reverting part
of the decomposition process by merging the top clusters
into one cluster. This creates a decision boundary (Fig. 4,
dashed line) that allows separating the two classes without
errors.
Our second module explores the space of possible ways
X1
X2
Figure 4. An example where merging clusters
can further increase accuracy performance.
to merge clusters derived from the first step. Following
the same notation as before, a class label will be a pair
(a, b), where the first element represents the original class
label and the second element represents the cluster that the
example falls into; but the difference now is that two or
more clusters may correspond to the same second element
(i.e., element b), which can be interpreted as having clus-
ters merged into a single cluster. In Figure 4, for example,
the class decomposition process (module 1) produces four
new class labels: (+, 1), (+, 2), (+, 3), and (, 1). If we
observe an increase in predictive accuracy by merging the
two top positive clusters into one positive cluster, module 2
would recommend a class assignment based on three labels
only: (+, 1), (+, 2), (, 1), which is better suited for a sim-
ple classifier.
Our goal is to explore the space of possible ways to
merge clusters obtained during the first step, until we find
a configuration that maximizes predictive accuracy (over a
validation set different from the training set). The space of
possible configurations corresponds to the space of all sub-
sets of clusters, with each subset being assigned the same
cluster index (i.e., being assigned the same class label). Ob-
viously one cannot explore this space exhaustively. If class
y
j
is decomposed into n
j
clusters, the number of different
configurations has an upper bound of O(2
n
j
). To avoid an
exhaustive search we follow a heuristic greedy approach.
Figure 5, Algorithm 2, describes our approach. The
search starts by evaluating predictive accuracy assuming
each cluster is mapped to a separate index. Next we start
looking for pairs of clusters (e.g. {c
j
1
, c
j
2
}) and compute
predictive accuracy assuming the two clusters on each pair
are mapped to the same index. We then take those pairs
for which predictive accuracy increased and rank them ac-
cordingly. To enforce a mutually exclusive list of clusters
we prune every cluster pair where one cluster appears on
another pair with higher rank.
Next we construct 3-element cluster sets by adding sin-
gle clusters to the remaining 2-element cluster sets found

Algorithm 2: Merge-Clusters-Process
Input: initial dataset T
0
Output: modified dataset T
0
MERGE(T
0
)
(1) foreach class y
j
(2) Let C
j
= {c
j
i
} be the set of clusters
(3) Let L
1
= C
j
(4) Let i = 2 be the search level
(5) repeat
(6) L
i
form subsets of clusters of size i
(7) by combining L
i1
with C
j
(8) Evaluate and rank all new subsets
(9) Prune lower rank subsets with
(10) duplicated clusters
(11) i i + 1
(12) until no accuracy improvement
(13) T
0
change T
0
such that examples covered
(14) by clusters within the same subset have same
(15) class label
(16) end
(17) return T
0
Figure 5. Improving predictive accuracy by
merging clusters of examples on each class.
in the previous step, and evaluate their predictive accuracy
(now assuming that all three clusters are mapped to the same
index). We keep those for which predictive accuracy in-
creased and apply pruning as we described before. The al-
gorithm terminates when no new cluster sets of higher car-
dinality can be produced from the cluster sets in the previ-
ous iteration. Finally, we prune any lower cardinality cluster
sets that have a cluster in common (i.e., that overlap) with
a higher cardinality set. At that point we assign the clusters
on each subset the same index (i.e., the same class label).
As an illustration, assume class y
1
decomposes into six
different clusters. Initially each cluster forms a unique
set and is assigned a different index: {c
1
1
}, {c
1
2
}, · · · , {c
1
6
}.
Now assume the following cluster pairs show im-
provement in predictive accuracy (ranked accordingly):
{c
1
2
, c
1
3
}, {c
1
1
, c
1
4
}, {c
1
1
, c
1
5
}. The last cluster pair is elimi-
nated since cluster c
1
1
appears on a higher ranking pair. At
the next level assume the following 3-element cluster is ob-
tained: {c
1
2
, c
1
3
, c
1
5
}. If no more cluster sets are produced,
then the last step simply prunes lower cardinality cluster
sets that have a cluster in common. The final configuration
indicates how clusters are merged together. For example,
{c
1
2
, c
1
3
, c
1
5
}, {c
1
1
, c
1
4
}, {c
1
6
} indicates clusters two, three, and
ve are merged into a single cluster, the same holding true
for clusters one and four; cluster six is not merged. The final
training set divides class y
1
into three new categories.
3.3 IMPROVING COMPUTATIONAL EFFI-
CIENCY
Clearly, searching over the space of cluster subsets de-
mands excessive computational power. To ease the burden
of estimating predictive accuracy often, we note that chang-
ing class assignments over the clusters on a particular class
does not affect the discriminant functions corresponding to
other classes. Therefore in estimating predictive accuracy
one can keep all discriminant functions fixed except for the
one corresponding to the class under analysis. This reduces
the computational cost of our approach by a factor propor-
tional to the number of classes.
3.4 CLASSIFICATION OF EXAMPLES
Our last module shows how to assess the performance
of the linear classifier over the extended output space. This
is necessary during the search over the space of subsets of
clusters (Section 3.2), and while estimating final predictive
accuracy.
During learning, the simple classifier is trained over
dataset T
0
producing a hypothesis h
0
mapping points from
input space X to the new output space Y
0
. During classifi-
cation, hypothesis h
0
will output a prediction consisting of
a class label and a cluster label, h
0
(x) = (a, b). To know
the actual prediction in the original output space Y we sim-
ply remove the cluster index. Essentially, we predict class
label y
j
whenever example x is assigned to any of the clus-
ters (or subsets of clusters) of class y
j
. As an illustration,
assume the prediction of an example x is h
0
(x) = (, 1),
then our final prediction simply disregards the cluster index
and assigns x to the negative class.
Our class decomposition process aims at eliminating dis-
tributions unfavorable to simple classifiers where a class
spreads out into multiple regions. As each cluster (or group
of clusters) is transformed into a class of its own, each class
sits in a tight region and becomes easier to separate away
using simple decision boundaries.
4 COMPLEXITY AND THE THE VC DI-
MENSION
In this section we use a measure of complexity known
as the VC dimension to compare the increase in representa-
tional power gained by augmenting the number of decision
boundaries of a simple classifier (our approach) to the in-
crease gained by augmenting the flexibility of the decision
boundaries. Consider that a simple classifier has a small
class of functions φ from which to draw a hypothesis. If
we wish to make our class φ stronger we must increase
the representational power of its member functions. Re-
call, however, that adding too much representational power

Citations
More filters
Journal ArticleDOI

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network.

TL;DR: In this article, a deep CNN, called Decompose, Transfer, and Compose (DeTraC), was used for the classification of COVID-19 chest X-ray images.
Posted ContentDOI

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network

TL;DR: This paper validate and adapt the previously developed CNN, called Decompose, Transfer, and Compose (DeTraC), for the classification of COVID-19 chest X-ray images, and shows the capability of DeTraC in the detection of CO VID-19 cases from a comprehensive image dataset collected from several hospitals around the world.
Posted Content

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network

TL;DR: In this article, DeTraC was used for the classification of COVID-19 chest X-ray images from normal, and severe acute respiratory syndrome cases, achieving a high accuracy of 95.12% with a sensitivity of 97.91%, a specificity of 91.87%, and a precision of 93.36%.
Journal ArticleDOI

DeTrac: Transfer Learning of Class Decomposed Medical Images in Convolutional Neural Networks

TL;DR: The paper presents Decompose, Transfer, and Compose (DeTraC) approach, a novel CNN architecture based on class decomposition to improve the performance of medical image classification using transfer learning andclass decomposition approach.
Journal ArticleDOI

A genetic algorithm approach to optimising random forests applied to class engineered data

TL;DR: A thorough experimental study on 22 real data sets was conducted, and the results prove the superiority of the proposed method in boosting up the accuracy and the applicability of the method to other areas of application.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Frequently Asked Questions (20)
Q1. How can a learning algorithm look for a good model?

in tradeoffs such as bias vs variance or capacity vs empirical risk, a learning algorithm can look for a good model by first trying large complexity steps, for example by increasing the degree of the class of polynomials. 

In this paper the authors propose a pre-processing step to classification that applies a clustering algorithm to the training set to discover local patterns in the attribute or input space. The authors demonstrate how this knowledge can be exploited to enhance the predictive accuracy of simple classifiers. Decomposing classes into clusters makes the new class distribution easier to approximate and provides a viable way to reduce bias while limiting the growth in variance. 

Future work will look for ways to improve the computational efficiency of their approach ( as sug- gested in Section 3. 3 ). Future work will address the feasibility of dynamically varying the growth rate of the complexity of the class of functions during model selection. Such model can then be refined using smaller complexity steps by augmenting the number of classifiers per class, as suggested in their approach. 

Their class decomposition process aims at eliminating distributions unfavorable to simple classifiers where a class spreads out into multiple regions. 

The model is intended to capture correlations between the feature variables and the target variable to predict the class label of new data objects. 

The clustering algorithm follows the Expectation Maximization (EM) technique [14]; it groups examples into clusters by modelling each cluster through a probability density function. 

The trick lies on identifying regions of high class density within subsets of examples of the same class which the authors accomplish through clustering. 

If no more cluster sets are produced,then the last step simply prunes lower cardinality cluster sets that have a cluster in common. 

This has the advantage of using all examples belonging to the same class for analysis, whereas in decision tree induction, the continuous partitioning of the data progressively lessens the statistical support of every decision rule, an effect known as the fragmentation problem. 

Implementations of the SVM, Naive Bayes, and EM-clustering are part of the WEKA machine-learning class library [18], set with default values. 

One limitation of their approach is the amount of CPU time necessary to find the best class-assignment configuration (Section 3.2). 

One way to increase the complexity of the classifier is to enlarge the original space of linear combinations to allow for more flexibility on the decision boundaries, for example by adding higher order polynomials (Figure 1a, dashed line). 

The authors map the set of examples in Tj into a new set T ′j by renaming every class label to indicate not only the class but also the cluster to which each example belongs. 

The authors test their methodology on twenty datasets from the University of California at Irvine repository, using two simple classifiers: Naive Bayes and a Support Vector Machine with a polynomial kernel of degree one. 

The results above indicate that the complexity of a simple classifier, as measured by the VC dimension, grows at a slower rate with increased boundaries than with more flexible boundaries. 

The authors propose an approach to improve the accuracy of simple classifiers through a pre-processing step that applies a clustering algorithm over examples belonging to the same class. 

When clustering does not seem to improve performance, their approach simply reverts the effects of clustering, leaving the original dataset intact. 

Next the authors start looking for pairs of clusters (e.g. {cj1 , c j 2 }) and computepredictive accuracy assuming the two clusters on each pair are mapped to the same index. 

The authors consider the case where a classifier defines a discriminant function for each class gj(x), j = 1, 2, · · · , k and chooses the class corresponding to the discriminant function with highest value (ties are broken arbitrarily):h(x) = ym iff gm(x) ≥ gj(x) (1)Possibly, the simplest case is that of a linear discriminant function, where the approximation is based on a linear model:gj(x) = w0 +n∑i=1wixi (2)where each wi, 0 ≤ i ≤ n, is a coefficient that must be learned by the classification algorithm. 

Here the authors provide evidence showing that their approach increases the representational power of φ in small steps to avoid a large increase in variance.