How can a learning algorithm look for a good model?

in tradeoffs such as bias vs variance or capacity vs empirical risk, a learning algorithm can look for a good model by first trying large complexity steps, for example by increasing the degree of the class of polynomials.

What future works have the authors mentioned in the paper "Class decomposition via clustering: a new framework for low-variance classifiers" ?

Future work will look for ways to improve the computational efficiency of their approach ( as sug- gested in Section 3. 3 ). Future work will address the feasibility of dynamically varying the growth rate of the complexity of the class of functions during model selection. Such model can then be refined using smaller complexity steps by augmenting the number of classifiers per class, as suggested in their approach.

What is the way to group examples into clusters?

The clustering algorithm follows the Expectation Maximization (EM) technique [14]; it groups examples into clusters by modelling each cluster through a probability density function.

What is the advantage of using all examples belonging to the same class for analysis?

This has the advantage of using all examples belonging to the same class for analysis, whereas in decision tree induction, the continuous partitioning of the data progressively lessens the statistical support of every decision rule, an effect known as the fragmentation problem.

What is the default value for the clustering algorithm?

Implementations of the SVM, Naive Bayes, and EM-clustering are part of the WEKA machine-learning class library [18], set with default values.

What is the limitation of their approach?

One limitation of their approach is the amount of CPU time necessary to find the best class-assignment configuration (Section 3.2).

What is the way to increase the complexity of a classifier?

One way to increase the complexity of the classifier is to enlarge the original space of linear combinations to allow for more flexibility on the decision boundaries, for example by adding higher order polynomials (Figure 1a, dashed line).

What is the VC dimension of a simple classifier?

The results above indicate that the complexity of a simple classifier, as measured by the VC dimension, grows at a slower rate with increased boundaries than with more flexible boundaries.

What is the way to improve classifier accuracy?

The authors propose an approach to improve the accuracy of simple classifiers through a pre-processing step that applies a clustering algorithm over examples belonging to the same class.

What does the algorithm do when it does not improve performance?

When clustering does not seem to improve performance, their approach simply reverts the effects of clustering, leaving the original dataset intact.

What is the algorithm's approach to calculating clusters?

Next the authors start looking for pairs of clusters (e.g. {cj1 , c j 2 }) and computepredictive accuracy assuming the two clusters on each pair are mapped to the same index.

What is the case where a classifier defines a discriminant function for each class?

The authors consider the case where a classifier defines a discriminant function for each class gj(x), j = 1, 2, · · · , k and chooses the class corresponding to the discriminant function with highest value (ties are broken arbitrarily):h(x) = ym iff gm(x) ≥ gj(x) (1)Possibly, the simplest case is that of a linear discriminant function, where the approximation is based on a linear model:gj(x) = w0 +n∑i=1wixi (2)where each wi, 0 ≤ i ≤ n, is a coefficient that must be learned by the classification algorithm.

How do the authors increase the representational power of in small steps?

Here the authors provide evidence showing that their approach increases the representational power of φ in small steps to avoid a large increase in variance.

(Open Access) Class decomposition via clustering: a new framework for low-variance classifiers (2003) | Ricardo Vilalta

Q: What are the contributions mentioned in the paper "Class decomposition via clustering: a new framework for low-variance classifiers" ?

In this paper the authors propose a pre-processing step to classification that applies a clustering algorithm to the training set to discover local patterns in the attribute or input space. The authors demonstrate how this knowledge can be exploited to enhance the predictive accuracy of simple classifiers. Decomposing classes into clusters makes the new class distribution easier to approximate and provides a viable way to reduce bias while limiting the growth in variance.

Q: What is the purpose of the class decomposition process?

Their class decomposition process aims at eliminating distributions unfavorable to simple classifiers where a class spreads out into multiple regions.

Q: What is the last step to prune cluster sets?

If no more cluster sets are produced,then the last step simply prunes lower cardinality cluster sets that have a cluster in common.

Class Decomposition Via Clustering:

A New Framework For Low-Variance Classiﬁers

Ricardo Vilalta, Murali-Krishna Achari, and Christoph F. Eick

Department of Computer Science

University of Houston

Houston TX, 77204-3010, USA

{vilalta, amkchari, ceick}@cs.uh.edu

Abstract

In this paper we propose a pre-processing step to clas-

siﬁcation that applies a clustering algorithm to the training

set to discover local patterns in the attribute or input space.

We demonstrate how this knowledge can be exploited to en-

hance the predictive accuracy of simple classiﬁers. Our fo-

cus is mainly on classiﬁers characterized by high bias but

low variance (e.g., linear classiﬁers); these classiﬁers ex-

perience difﬁculty in delineating class boundaries over the

input space when a class distributes in complex ways. De-

composing classes into clusters makes the new class distri-

bution easier to approximate and provides a viable way to

reduce bias while limiting the growth in variance. Experi-

mental results on real-world domains show an advantage

in predictive accuracy when clustering is used as a pre-

processing step to classiﬁcation.

1 INTRODUCTION

Classiﬁcation and clustering stand as central techniques

in data analysis. Classiﬁcation aims at deriving a predic-

tion model from labelled data. The model is intended to

capture correlations between the feature variables and the

target variable to predict the class label of new data objects.

Clustering is a useful tool in revealing patterns in unlabelled

data; the goal is to discover how data objects gather into nat-

ural groups. The work described in this paper explores how

classiﬁcation algorithms can beneﬁt from class density in-

formation that is obtained using clustering. These informa-

tion can be exploited to improve the quality of the decision

boundaries during classiﬁcation and enhance the prediction

accuracy of simple classiﬁers. We demonstrate how using

classiﬁcation and clustering techniques in conjunction ad-

dresses key issues in learning theory (e.g., locality vs ca-

pacity or bias vs variance) and provides an attractive new

family of classiﬁcation models.

Our goal is to exploit the information derived from a

clustering algorithm to increase the complexity of sim-

ple classiﬁers characterized by low variance and high bias.

These algorithms, commonly referred to as model-based or

parametric-based, encompass a small class of approximat-

ing functions and exhibit limited ﬂexibility in their deci-

sion boundaries. Examples include linear classiﬁers, prob-

abilistic classiﬁers based on the attribute-independence as-

sumption (e.g., Naive Bayes), and single logical rules. The

question we address is how to increase the complexity of

these classiﬁers to tradeoff bias for variance in an effective

manner. Since these models start off with simple represen-

tations, increasing their complexity is expected to improve

their generalization performance while still retaining their

ability to output models amenable to interpretation.

Our approach consists of increasing the degree of com-

plexity of the decision boundaries of a simple classiﬁer by

augmenting the number of boundaries per class. The idea

is to transform the classiﬁcation problem by decomposing

each class into clusters. By relabelling the examples cov-

ered by each cluster with a new class label, the simple

classiﬁer generates an increased number of boundaries per

class, and is then armed to cope with complex distributions

where classes cover different regions of the input space. Not

every cluster is relabelled with a new class; our algorithm

explores the space of possible new class assignments in a

greedy manner maximizing predictive accuracy. In sum-

mary our approach comprises the following modules:

1. A pre-processing step to classiﬁcation that consists of

clustering examples that belong to the same class. This

identiﬁes regions of high class density.

2. A search for a conﬁguration of class assignments over

the set of clusters that optimizes predictive accuracy.

This increases the number of decision boundaries per

class.

3. A function that maps the predicted class label of a test

example to one of the original classes. This transforms

the auxiliary set of new classes into the original set of

classes.

We test our methodology on twenty datasets from the

University of California at Irvine repository, using two sim-

ple classiﬁers: Naive Bayes and a Support Vector Machine

with a polynomial kernel of degree one. Results denote a

signiﬁcant increase in predictive accuracy when our class-

decomposition approach is applied (Section 6). To con-

clude, empirical results support our goal statement that pre-

identifying local patterns in the data through clustering is a

helpful tool in improving the performance of simple classi-

ﬁers.

The paper organization is described next. Section 2 in-

troduces background information and our problem state-

ment. Section 3 details our class decomposition approach

via clustering to improve the performance of simple classi-

ﬁers. Section 4 uses the VC dimension to understand the in-

crease in representational power gained with our approach.

Section 5 reviews related work. Section 6 reports an empir-

ical assessment of our approach. Finally, Section 7 states

our summary and future work.

2 PROBLEM STATEMENT

2.1 SIMPLE DISCRIMINANT FUNCTIONS

Let (X

, X

, · · · , X

) be an n-component vector-valued

random variable, where each X

represents an attribute or

feature; the space of all possible attribute vectors is called

the attribute or input space X . Let {y

, y

, · · · , y

} be the

possible classes, categories, or states of nature; the space of

all possible classes is called the output space Y. A classiﬁer

receives as input a set of training examples T = {(x, y)},

where x = (x

, x

, · · · , x

) is a vector or point in the input

space (x

is the value of attribute X

) and y is a point in the

output space. The outcome of the classiﬁer is a function h

(or hypothesis) mapping the input space to the output space,

h : X → Y.

We consider the case where a classiﬁer deﬁnes a discrim-

inant function for each class g

(x), j = 1, 2, · · · , k and

chooses the class corresponding to the discriminant func-

tion with highest value (ties are broken arbitrarily):

h(x) = y

iff g

(x) ≥ g

(x) (1)

Possibly, the simplest case is that of a linear discrimi-

nant function, where the approximation is based on a linear

model:

(x) = w

i=1

(2)

where each w

, 0 ≤ i ≤ n, is a coefﬁcient that must be

learned by the classiﬁcation algorithm.

We will also consider probabilistic classiﬁers where

the discriminant functions are proportional to the posterior

probabilities of a class given the input vector x, P (y

|x).

The classiﬁer, also known as Naive Bayes, assumes feature

independence given the class [7]:

(x) = P (y

)Π

P (x

) (3)

where P (y

) is the a priori probability of class y

, and

P (x

) is a simple product approximation of P (x|y

called the likelihood or class-conditional probability.

2.2 THE BIAS-VARIANCE TRADEOFF

Simple discriminant functions tend to output poor func-

tion approximations when the data distributes in complex

ways. Our goal is to increase the complexity of simple clas-

siﬁers to obtain better function approximations. Since our

training set comprises a limited number of examples and we

do not know the form of the true target distribution, our goal

is inevitably subject to the bias-variance dilemma in statisti-

cal inference [9, 10]. The dilemma is based on the fact that

prediction error can be decomposed into a bias and a vari-

ance component

; ideally we would like to have classiﬁers

with low bias and low variance but these components are

inversely related.

On the one hand, simple classiﬁers, commonly referred

to as model-based or parametric-based –and the subject of

our study–, encompass a small class of approximating func-

tions and exhibit limited ﬂexibility on their decision bound-

aries. Their small repertoire of functions produces high

bias (since the best approximating function may lie far from

the target function) but low variance (since there is little

dependence on local irregularities in the data). Examples

include linear classiﬁers, probabilistic classiﬁers such as

Naive Bayes, and single logical rules.

On the other hand, increasing the complexity of the clas-

siﬁer reduces the bias but increases the variance. Complex

classiﬁers, also referred to as model free or parametric-free,

encompass a large class of approximating functions; they

exhibit ﬂexible decision boundaries (low bias) but are sen-

sitive to small variations in the data (high variance). Exam-

ples include neural networks with a large number of hidden

units and k-nearest neighbor classiﬁers with small values

for k.

Our problem statement can be rephrased as follows: how

can we decrease the bias (i.e., increase the complexity) of

our simple classiﬁers without drastically increasing the vari-

ance component? Notice our goal sets forth in a direction

A third component, the irreducible error or Bayes error, cannot be

eliminated or tradeoff.

(a)

(b)

Figure 1. (a) A high-order polynomial improves the classiﬁcation of a linear classiﬁer at the expense

of increased variance. (b) Increasing the number of linear discriminants guided by local patterns

increases complexity with lower impact on variance.

orthogonal to combination methods like bagging [5] and

boosting [8] where the goal is to reduce the variance com-

ponent in generalization error by voting on variants of the

training data.

2.3 INCREASING COMPLEXITY THROUGH

ADDITIONAL BOUNDARIES

Our solution is to exploit information about the distribu-

tion of examples through a pre-processing step that iden-

tiﬁes natural clusters in data. As an illustration, Figure 1

shows a two dimensional input space with two classes (pos-

itive + and negative −). The distribution of examples pre-

cludes a simple linear classiﬁer attaining good performance

(Figure 1a, bold line). One way to increase the complex-

ity of the classiﬁer is to enlarge the original space of linear

combinations to allow for more ﬂexibility on the decision

boundaries, for example by adding higher order polynomi-

als (Figure 1a, dashed line). But this comes at the expense

of increased variance and possibly data overﬁtting.

Alternatively, one can retain the same space of linear

functions but increase the number of decision boundaries

per class (Figure 1b). This increases the complexity of the

classiﬁer but with less impact on variance (Section 4). The

trick lies on identifying regions of high class density within

subsets of examples of the same class which we accomplish

through clustering. The next sections provide a detail de-

scription of our approach.

3 CLASS DECOMPOSITION VIA CLUS-

TERING

Our solution comprises three modules: 1) a decomposi-

tion of classes into clusters; 2) a search for an optimal class

assignment conﬁguration; and 3) a function mapping pre-

dictions to the original set of class labels. We explain each

module in turn.

Algorithm 1: Mapping-Process

Input: clustering method C, dataset T

Output: new dataset T

MAPPING-PROCESS(C,T )

(1) Separate T into subsets {T

}

(2) where T

= {(x, y) ∈ T |y = y

}

(3) foreach T

(4) Apply clustering C on T

(5) Let {C

} be the set of clusters

(6) foreach example e = (x, y

)

(7) Let p be the cluster index for x

(8) Create example e

= (x, y

)

(9) where y

= (y

, p)

(10) Add e

to T

(11) end

(12) end

(13) return T

Figure 2. The process to transform dataset

T into a new dataset T

using a clustering

algorithm.

3.1 CLASS DECOMPOSITION

The ﬁrst module pre-processes the training data by clus-

tering examples that belong to the same class as shown in

Algorithm 1 (Figure 2). We proceed by ﬁrst separating

dataset T into sets of examples of the same class. That is

T is separated into different sets of examples T = {T

where each T

comprises all examples in T labelled with

class y

, T

= {(x, y) ∈ T |y = y

For each set T

we apply a clustering algorithm C to ﬁnd

sets of examples (i.e., clusters) grouped together according

to some distance metric over the input space

. Let {c

} be

the set of such clusters. We map the set of examples in

into a new set T

by renaming every class label to in-

dicate not only the class but also the cluster to which each

example belongs. One simple way to do this is by mak-

We consider a ﬂat type of clustering (as opposed to hierarchical) where

each object is assigned to exactly only cluster.

{(x,y’)| y’ = (+,2)}

{(x,y’)| y’ = (-,1)}

{(x,y’)| y’ = (+,1)}

Figure 3. The mapping process relabels ex-

amples to encode both class and cluster.

ing each class label a pair (a, b), where the ﬁrst element

represents the original class and the second element repre-

sents the cluster that the example falls into. In that case,

= {(x, y

)}, where y

= (y

, i) whenever example x is

assigned to cluster c

An illustration of the transformation above is shown in

Figure 3. We assume a two-dimensional input space where

examples belong to either class positive (+) or negative (−).

Let’s suppose the clustering algorithm separates class pos-

itive into two clusters, while class negative is grouped into

one single cluster. The transformation relabels every exam-

ple to encode class and cluster label. As a result, dataset T

has now three different classes. Finally the new dataset T

is simply the union of all sets of examples of the same class

relabelled according to the cluster to which each example

belongs, T

j=1

In summary, the ﬁrst module maps training set T into

another dataset T

through a class-decomposition process.

The mapping leaves the input space X intact but changes

the output space Y into a (possibly) larger space Y

(i.e.,

| ≥ |Y|, where | · | is the cardinality of the space).

3.2 A SEARCH FOR THE OPTIMAL CLASS

ASSIGNMENT

Increasing the number of classes according to the num-

ber of induced clusters does not always yield optimal per-

formance. As an illustration, Figure 4 shows a distribution

of examples where the positive class decomposes into three

clusters. Constructing a linear classiﬁer separately on each

cluster generates decision boundaries that cause misclassi-

ﬁcations (top positive clusters in Fig 4, bold lines). One

solution is to maintain the lower cluster while reverting part

of the decomposition process by merging the top clusters

into one cluster. This creates a decision boundary (Fig. 4,

dashed line) that allows separating the two classes without

errors.

Our second module explores the space of possible ways

Figure 4. An example where merging clusters

can further increase accuracy performance.

to merge clusters derived from the ﬁrst step. Following

the same notation as before, a class label will be a pair

(a, b), where the ﬁrst element represents the original class

label and the second element represents the cluster that the

example falls into; but the difference now is that two or

more clusters may correspond to the same second element

(i.e., element b), which can be interpreted as having clus-

ters merged into a single cluster. In Figure 4, for example,

the class decomposition process (module 1) produces four

new class labels: (+, 1), (+, 2), (+, 3), and (−, 1). If we

observe an increase in predictive accuracy by merging the

two top positive clusters into one positive cluster, module 2

would recommend a class assignment based on three labels

only: (+, 1), (+, 2), (−, 1), which is better suited for a sim-

ple classiﬁer.

Our goal is to explore the space of possible ways to

merge clusters obtained during the ﬁrst step, until we ﬁnd

a conﬁguration that maximizes predictive accuracy (over a

validation set different from the training set). The space of

possible conﬁgurations corresponds to the space of all sub-

sets of clusters, with each subset being assigned the same

cluster index (i.e., being assigned the same class label). Ob-

viously one cannot explore this space exhaustively. If class

is decomposed into n

clusters, the number of different

conﬁgurations has an upper bound of O(2

). To avoid an

exhaustive search we follow a heuristic greedy approach.

Figure 5, Algorithm 2, describes our approach. The

search starts by evaluating predictive accuracy assuming

each cluster is mapped to a separate index. Next we start

looking for pairs of clusters (e.g. {c

, c

}) and compute

predictive accuracy assuming the two clusters on each pair

are mapped to the same index. We then take those pairs

for which predictive accuracy increased and rank them ac-

cordingly. To enforce a mutually exclusive list of clusters

we prune every cluster pair where one cluster appears on

another pair with higher rank.

Next we construct 3-element cluster sets by adding sin-

gle clusters to the remaining 2-element cluster sets found

Algorithm 2: Merge-Clusters-Process

Input: initial dataset T

Output: modiﬁed dataset T

MERGE(T

)

(1) foreach class y

(2) Let C

= {c

} be the set of clusters

(3) Let L

= C

(4) Let i = 2 be the search level

(5) repeat

(6) L

← form subsets of clusters of size i

(7) by combining L

i−1

with C

(8) Evaluate and rank all new subsets

(9) Prune lower rank subsets with

(10) duplicated clusters

(11) i ← i + 1

(12) until no accuracy improvement

(13) T

← change T

such that examples covered

(14) by clusters within the same subset have same

(15) class label

(16) end

(17) return T

Figure 5. Improving predictive accuracy by

merging clusters of examples on each class.

in the previous step, and evaluate their predictive accuracy

(now assuming that all three clusters are mapped to the same

index). We keep those for which predictive accuracy in-

creased and apply pruning as we described before. The al-

gorithm terminates when no new cluster sets of higher car-

dinality can be produced from the cluster sets in the previ-

ous iteration. Finally, we prune any lower cardinality cluster

sets that have a cluster in common (i.e., that overlap) with

a higher cardinality set. At that point we assign the clusters

on each subset the same index (i.e., the same class label).

As an illustration, assume class y

decomposes into six

different clusters. Initially each cluster forms a unique

set and is assigned a different index: {c

}, {c

}, · · · , {c

Now assume the following cluster pairs show im-

provement in predictive accuracy (ranked accordingly):

, c

}, {c

, c

}, {c

, c

}. The last cluster pair is elimi-

nated since cluster c

appears on a higher ranking pair. At

the next level assume the following 3-element cluster is ob-

tained: {c

, c

}. If no more cluster sets are produced,

then the last step simply prunes lower cardinality cluster

sets that have a cluster in common. The ﬁnal conﬁguration

indicates how clusters are merged together. For example,

, c

}, {c

, c

}, {c

} indicates clusters two, three, and

ﬁve are merged into a single cluster, the same holding true

for clusters one and four; cluster six is not merged. The ﬁnal

training set divides class y

into three new categories.

3.3 IMPROVING COMPUTATIONAL EFFI-

CIENCY

Clearly, searching over the space of cluster subsets de-

mands excessive computational power. To ease the burden

of estimating predictive accuracy often, we note that chang-

ing class assignments over the clusters on a particular class

does not affect the discriminant functions corresponding to

other classes. Therefore in estimating predictive accuracy

one can keep all discriminant functions ﬁxed except for the

one corresponding to the class under analysis. This reduces

the computational cost of our approach by a factor propor-

tional to the number of classes.

3.4 CLASSIFICATION OF EXAMPLES

Our last module shows how to assess the performance

of the linear classiﬁer over the extended output space. This

is necessary during the search over the space of subsets of

clusters (Section 3.2), and while estimating ﬁnal predictive

accuracy.

During learning, the simple classiﬁer is trained over

dataset T

producing a hypothesis h

mapping points from

input space X to the new output space Y

. During classiﬁ-

cation, hypothesis h

will output a prediction consisting of

a class label and a cluster label, h

(x) = (a, b). To know

the actual prediction in the original output space Y we sim-

ply remove the cluster index. Essentially, we predict class

label y

whenever example x is assigned to any of the clus-

ters (or subsets of clusters) of class y

. As an illustration,

assume the prediction of an example x is h

(x) = (−, 1),

then our ﬁnal prediction simply disregards the cluster index

and assigns x to the negative class.

Our class decomposition process aims at eliminating dis-

tributions unfavorable to simple classiﬁers where a class

spreads out into multiple regions. As each cluster (or group

of clusters) is transformed into a class of its own, each class

sits in a tight region and becomes easier to separate away

using simple decision boundaries.

4 COMPLEXITY AND THE THE VC DI-

MENSION

In this section we use a measure of complexity known

as the VC dimension to compare the increase in representa-

tional power gained by augmenting the number of decision

boundaries of a simple classiﬁer (our approach) to the in-

crease gained by augmenting the ﬂexibility of the decision

boundaries. Consider that a simple classiﬁer has a small

class of functions φ from which to draw a hypothesis. If

we wish to make our class φ stronger we must increase

the representational power of its member functions. Re-

call, however, that adding too much representational power

Class decomposition via clustering: a new framework for low-variance classifiers

Figures

Citations

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network.

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network

Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network

DeTrac: Transfer Learning of Class Decomposed Medical Images in Convolutional Neural Networks

A genetic algorithm approach to optimising random forests applied to class engineered data

References

The Nature of Statistical Learning Theory

Classification and Regression Trees.

C4.5: Programs for Machine Learning

Pattern Classification

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Related Papers (5)

Random Forests

Bagging predictors

Imbalanced Data Classification Using Improved Clustering Algorithm and Under-Sampling Method

COG: local decomposition for rare class analysis

Do we need hundreds of classifiers to solve real world classification problems

Frequently Asked Questions (20)

Q1. How can a learning algorithm look for a good model?

Q2. What are the contributions mentioned in the paper "Class decomposition via clustering: a new framework for low-variance classifiers" ?

Q3. What future works have the authors mentioned in the paper "Class decomposition via clustering: a new framework for low-variance classifiers" ?

Q4. What is the purpose of the class decomposition process?

Q5. What is the purpose of the model?

Q6. What is the way to group examples into clusters?

Q7. What is the trick to reducing the bias of simple linear classifiers?

Q8. What is the last step to prune cluster sets?

Q9. What is the advantage of using all examples belonging to the same class for analysis?

Q10. What is the default value for the clustering algorithm?

Q11. What is the limitation of their approach?

Q12. What is the way to increase the complexity of a classifier?

Q13. How do the authors map the set of examples into a new set?

Q14. What is the method used to improve classifier performance?

Q15. What is the VC dimension of a simple classifier?

Q16. What is the way to improve classifier accuracy?

Q17. What does the algorithm do when it does not improve performance?

Q18. What is the algorithm's approach to calculating clusters?

Q19. What is the case where a classifier defines a discriminant function for each class?

Q20. How do the authors increase the representational power of in small steps?