Ensembles of nested dichotomies for multi-class problems

doi:10.1145/1015330.1015363

Ensembles of Nested Dichotomies for Multi-Class Problems

Eibe Frank eibe@cs.waikato.ac.nz

Department of Computer Science, University of Waikato, Hamilton, New Zealand

Stefan Kramer kramer@in.tum.de

Institut f¨ur Informatik, Technische Universit¨at M¨unchen, M¨unchen, Germany

Abstract

Nested dichotomies are a standard statisti-

cal technique for tackling certain polytomous

classiﬁcation problems with logistic regres-

sion. They can be represented as binary trees

that recursively split a multi-class classiﬁca-

tion task into a system of dichotomies and

provide a statistically sound way of applying

two-class learning algorithms to multi-class

problems (assuming these algorithms gener-

ate class probability estimates). However,

there are usually many candidate trees for a

given problem and in the standard approach

the choice of a particular tree is based on do-

main knowledge that may not be available in

practice. An alternative is to treat every sys-

tem of nested dichotomies as equally likely

and to form an ensemble classiﬁer based on

this assumption. We show that this approach

produces more accurate classiﬁcations than

applying C4.5 and logistic regression directly

to multi-class problems. Our results also

show that ensembles of nested dichotomies

produce more accurate classiﬁers than pair-

wise classiﬁcation if both techniques are used

with C4.5, and comparable results for logis-

tic regression. Compared to error-correcting

output codes, they are preferable if logistic

regression is used, and comparable in the case

of C4.5. An additional beneﬁt is that they

generate class probability estimates. Conse-

quently they appear to be a good general-

purpose method for applying binary classi-

ﬁers to multi-class problems.

Appearing in Proceedings of the 21

st

International Confer-

ence on Machine Learning, Banﬀ, Canada, 2004. Copyright

2004 by the authors.

1. Introduction

A system of nested dichotomies (Fox, 1997) is a binary

tree that recursively splits a set of classes from a multi-

class classiﬁcation problem into smaller and smaller

subsets. In statistics, nested dichotomies are a stan-

dard technique for tackling polytomous (i.e. multi-

class) classiﬁcation problems with logistic regression

by ﬁtting binary logistic models to the individual di-

chotomous (i.e. two-class) classiﬁcation problems at

the tree’s internal nodes, and a hierarchical decompo-

sition of classes has also been considered by authors

in neighboring areas (Goodman, 2001; Bengio, 2002).

However, nested dichotomies are only recommended

if a “particular choice of dichotomies is substantively

compelling” (Fox, 1997) based on domain knowledge.

There are usually many possible tree structures that

can be generated for a given set of classes, and in

many practical applications—namely, where the class

is truly a nominal quantity and does not exhibit any

structure—there is no a priori reason to prefer one

particular tree structure over another one. However,

in that case it makes sense to assume that every hier-

archy of nested dichotomies is equally likely and to use

an ensemble of these hierarchies for prediction. This

is the approach we propose and evaluate in this paper.

Using C4.5 and logistic regression as base learners

we show that ensembles of nested dichotomies pro-

duce more accurate classiﬁcations than applying these

learners directly to multi-class problems. We also show

that they compare favorably to three other popular

techniques for converting a multi-class classiﬁcation

task into a set of binary classiﬁcation problems: the

simple “one-vs-rest” method, error-correcting output

codes (Dietterich & Bakiri, 1995), and pairwise clas-

siﬁcation (F¨urnkranz, 2002). More speciﬁcally, we

show that ensembles of nested dichotomies produce

more accurate classiﬁers than the one-vs-rest method

for both C4.5 and logistic regression; that they are

more accurate than pairwise classiﬁcation in the case

of C4.5, and comparable in the case of logistic regres-

sion; and that, compared to error-correcting output

codes, nested dichotomies have a distinct edge if lo-

gistic regression is used, and are on par if C4.5 is em-

ployed. In addition, they have the nice property that

they do not require any form of post-processing to re-

turn class probability estimates. They do have the

drawback that they require the base learner to produce

class probability estimates but this is not a severe lim-

itation given that most practical learning algorithms

are able to do so or can be made to do so.

This paper is structured as follows. In Section 2 we

describe more precisely how nested dichotomies work.

In Section 3 we present the idea of using ensembles

of nested dichotomies. In Section 4 this approach is

evaluated and compared to other techniques for tack-

ling multi-class problems. Related work is discussed

in Section 5. Section 6 summarizes the main ﬁndings

of this paper.

2. Nested Dichotomies

Nested dichotomies can be represented as binary trees

that, at each node, divide the set of classes A associ-

ated with the node into two subsets B and C that are

mutually exclusively and taken together contain all the

classes in A. The nested dichotomies’ root node con-

tains all the classes of the corresponding multi-class

classiﬁcation problem. Each leaf node contains a sin-

gle class (i.e. for an n-class problem, there are n leaf

nodes and n − 1 internal nodes). To build a classiﬁer

based on such a tree structure we do the following:

at every internal node we store the instances pertain-

ing to the classes associated with that node, and no

other instances; then we group the classes pertaining

to each node into two subsets, so that each subset holds

the classes associated with exactly one of the node’s

two successor nodes; and ﬁnally we build binary classi-

ﬁers for the resulting two-class problems. This process

creates a tree structure with binary classiﬁers at the

internal nodes.

We assume that the binary classiﬁers produce class

probability estimates. For example, they could be

logistic regression models. The question is how to

combine the estimates from the individual two-class

problems to obtain class probability estimates for the

original multi-class problem. It turns out that the in-

dividual dichotomies are statistically independent be-

cause they are nested (Fox, 1997), enabling us to form

multi-class probability estimates simply by multiply-

ing together the probability estimates obtained from

the two-class models. More speciﬁcally, let C

i1

and

C

i2

be the two subsets of classes generated by a split

of the set of classes C

i

at internal node i of the tree

(i.e. the subsets associated with the successor nodes),

and let p(c ∈ C

i1

|x, c ∈ C

i

) and p(c ∈ C

i2

|x, c ∈ C

i

)

be the conditional probability distribution estimated

by the two-class model at node i for a given instance

x. Then the estimated class probability distribution

for the original multi-class problem is given by:

p(c = C|x) =

n−1

Y

i=1

(I(c ∈ C

i1

)p(c ∈ C

i1

|x, c ∈ C

i

) +

I(c ∈ C

i2

)p(c ∈ C

i2

|x, c ∈ C

i

)),

where I(.) is the indicator function, and the product

is over all the internal nodes of the tree.

Note that not all nodes have to actually be exam-

ined to compute this probability for a particular class

value. Evaluating the path to the leaf associated with

that class is suﬃcient. Let p(c ∈ C

i1

|x, c ∈ C

i

) and

p(c ∈ C

i2

|x, c ∈ C

i

) be the labels of the edges connect-

ing node i to the nodes associated with C

i1

and C

i2

respectively. Then computing p(c|x) amounts to ﬁnd-

ing the single path from the root to a leaf for which

c is in the set of classes associated with each node

along the path, multiplying together the probability

estimates encountered along the way.

Consider Figure 1, which shows two of the 15 possible

nested dichotomies for a four-class classiﬁcation prob-

lem. Using the tree in Figure 1a the probability of

class 4 for an instance x is given by

p(c = 4|x) = p(c ∈ {3, 4}|x) ×

p(c ∈ {4}|x, c ∈ {3, 4}).

Based on the tree in Figure 1b we have

p(c = 4|x) = p(c ∈ {2, 3, 4}|x) ×

p(c ∈ {3, 4}|x, c ∈ {2, 3, 4}) ×

p(c ∈ {4}|x, c ∈ {3, 4}).

Both trees represent equally valid class probability

estimators—like all other trees that can be generated

for this problem. However, the estimates obtained

from diﬀerent trees will usually diﬀer because they in-

volve diﬀerent two-class learning problems. If there

is no a priori reason to prefer a particular nested

dichotomy—e.g., because some classes are known to be

related in some fashion—there is no reason to trust one

of the estimates more than the others. Consequently it

makes sense to treat all possible trees as equally likely

and form overall class probability estimates by averag-

ing the estimates obtained from diﬀerent trees. This is

the approach we investigate in the rest of this paper.

{1,2,3,4}

{1,2} {3,4}

{1} {2} {3} {4}

p

(

c

∈ {3,4}|

x

)

p

(

c

∈ {4}|

x, c

∈ {3,4})

p

(

c

∈ {3}|

x, c

∈ {3,4})

p

(

c

∈ {2}|

x, c

∈ {1,2})

p

(

c

∈ {1}|

x, c

∈ {1,2})

p

(

c

∈ {1,2}|

x

)

{1,2,3,4}

{1} {2,3,4}

{2} {3,4}

{3} {4}

p

(

c

∈ {2,3,4}|

x

)

p

(

c

∈ {1}|

x

)

p

(

c

∈ {3,4}|

x, c

∈ {2,3,4})

p

(

c

∈ {2}|

x, c

∈ {2,3,4})

p

(

c

∈ {3}|

x, c

∈ {3,4})

p

(

c

∈ {4}|

x, c

∈ {3,4})

(a) (b)

Figure 1. Two diﬀerent systems of nested dichotomies for a classiﬁcation problem with four classes.

3. Ensembles of Nested Dichotomies

The number of possible trees for an n-class problem

grows extremely quickly. It is given by the following

recurrence relation:

T (n) = (2n − 3) × T (n − 1),

where T (1) = 1, because there are 2(n−1)−1 = 2n−3

distinct possibilities to add a new class into a tree for

n − 1 classes—one for each node.

Expanding the recurrence relation results in T (n) =

(2n − 3) × (2n − 5) × . . . × 3 × 1, and, using the double

factorial, this can be written as T (n) = (2n − 3)!!. For

two classes we have T (2) = 1, for three T (3) = 3, for

four T (4) = 15, and for ﬁve T (5) = 105.

The growth in the number of trees makes it impossible

to generate them exhaustively in a brute-force manner

even for problems with a moderate number of classes.

This is the case even if we cache models for the indi-

vidual two-class problems that are encountered when

building each tree.

1

There are (3

n

−(2

n+1

−1))/2 pos-

sible two-class problems for an n-class dataset. The

term 3

n

arises because a class can be either in the ﬁrst

subset, the second one, or absent; the term (2

n+1

− 1)

because we need to subtract all problems where either

one of the two subsets is empty; and the factor 1/2

from the fact that the two resulting subsets can be

swapped without any eﬀect on the classiﬁer. Hence

there are 6 possible two-class problems for a problem

with 3 classes, 25 for a problem with 4 classes, 90 for

a problem with 5 classes, etc.

Given these growth rates we chose to evaluate the per-

formance of ensembles of randomly generated trees.

(Of course, only the structure of each tree was gener-

ated randomly. We applied a standard learning scheme

1

Note that diﬀerent trees may exhibit some two-class

problems that are identical.

at each internal node of the randomly sampled trees.)

More speciﬁcally, we took a random sample from the

space of all distinct trees for a given n-class problem

(for simplicity, based on sampling with replacement),

and formed class probability estimates for a given in-

stance x by averaging the estimates obtained from the

individual ensemble members. Because of the uniform

sampling process these averages form an unbiased es-

timate of the estimates that would have been obtained

by building the complete ensemble of all possible dis-

tinct trees for a given n-class problem.

4. Empirical Comparison

We performed experiments with 21 multi-class

datasets from the UCI repository (Blake & Merz,

1998), summarized in Table 1. Two learning schemes

were employed: C4.5 and logistic regression.

2

We used

these two because (a) they produce class probability

estimates, (b) they inhabit opposite ends of the bias-

variance spectrum, and (c) they can deal with multiple

classes directly without having to convert a multi-class

problem into a set of two-class problems (in the case

of logistic regression, by optimizing the multinomial

likelihood directly). The latter condition is important

for testing whether any of the multi-class “wrapper”

methods that we included in our experimental com-

parison can actually improve upon the performance of

the learning schemes applied directly to the multi-class

problems.

To compare the performance of the diﬀerent learning

schemes for each dataset, we estimated classiﬁcation

accuracy based on 50 runs of the stratiﬁed hold-out

method, in each run using 66% of the data for train-

ing and the rest for testing. We tested for signiﬁcant

2

As implemented in Weka version 3.4.1 (Witten &

Frank, 2000).

Dataset Num. % Miss. Num. Nom. Num.

insts atts atts class.

anneal 898 0.0 6 32 6

arrhythmia 452 0.3 206 73 16

audiology 226 2.0 0 69 24

autos 205 1.1 15 10 7

bal.-scale 625 0.0 4 0 3

ecoli 336 0.0 7 0 8

glass 214 0.0 9 0 7

hypothyroid 3772 6.0 23 6 4

iris 150 0.0 4 0 3

letter 20000 0.0 16 0 26

lymph 148 0.0 3 15 4

optdigits 5620 0.0 64 0 10

pendigits 10992 0.0 16 0 10

prim.-tumor 339 3.9 0 17 22

segment 2310 0.0 19 0 7

soybean 683 9.8 0 35 19

splice 3190 0.0 0 61 3

vehicle 846 0.0 18 0 4

vowel 990 0.0 10 3 11

waveform 5000 0.0 40 0 3

zoo 101 0.0 1 15 7

Table 1. Datasets used for the experiments

diﬀerences in accuracy by using the corrected resam-

pled t-test at the 5% signiﬁcance level. This test has

been shown to have Type I error at the signiﬁcance

level and low Type II error if used in conjunction with

the hold-out method (Nadeau & Bengio, 2003).

In the ﬁrst set of experiments, we compared ensembles

of nested dichotomies (ENDs) with several other stan-

dard multi-class methods. In the second set we varied

the number of ensemble members to see whether this

has any impact on the performance of ENDs.

4.1. Comparison to other approaches for

multi-class learning

In the ﬁrst set of experiments we used ENDs consisting

of 20 ensemble members (i.e. each classiﬁer consisted

of 20 trees of nested dichotomies) to compare to other

multi-class schemes. As the experimental results in

the next section will show, 20 ensemble members are

often suﬃcient to get close to optimum performance.

We used both C4.5 and logistic regression to build the

ENDs. The same experiments were repeated for both

standard C4.5 and polytomous logistic regression ap-

plied directly to the multi-class problems. In addi-

tion, the following other multi-class-to-binary conver-

sion methods were compared with ENDs: one-vs-rest,

pairwise classiﬁcation, random error-correcting output

codes, and exhaustive error-correcting output codes.

One-vs-rest creates n dichotomies for an n-class prob-

lem, in each case learning one of the n classes against

all the other classes (i.e. there is one classiﬁer for each

class). At classiﬁcation time, the class that gets max-

imum probability from its corresponding classiﬁer is

predicted. Pairwise classiﬁcation learns a classiﬁer for

each pair of classes, ignoring the instances pertaining

to the other classes (i.e. there are n × (n − 1)/2 classi-

ﬁers). A prediction is obtained by voting, where each

classiﬁer casts a vote for either one of the two classes it

was built from. The class with the maximum number

of votes is predicted.

In error-correcting output codes (ECOCs), each class

is assigned a binary code vector of length k, which

make up the row vectors of a code matrix. These

row vectors determine the set of k dichotomies to be

learned, corresponding to the column vectors of the

code matrix. At prediction time, a vector of classiﬁca-

tions is obtained by collecting the predictions from the

individual k classiﬁers learned from the dichotomies.

The original approach to ECOCs predicts the class

whose corresponding row vector has minimum Ham-

ming distance to the vector of 0/1 predictions obtained

from the k classiﬁers (Dietterich & Bakiri, 1995). How-

ever, accuracy can be improved by using loss-based de-

coding (Allwein et al., 2000). In our case this means

using the predicted class probabilities rather than the

0/1 predictions. For C4.5 we used the absolute error

of the probability estimates as the loss function and

for logistic regression the negative log-likelihood. We

veriﬁed that this indeed produced more accurate clas-

siﬁers than the Hamming distance.

3

Random error-correcting output codes (RECOCs) are

based on the fact that random code vectors have good

error-correcting properties. We used random code vec-

tors of length k = 2 × n, where n is the number of

classes. Code vectors consisting only of 0s or only of 1s

were discarded. This results in a code matrix with row

vectors of length 2×n and column vectors of length n.

Code matrices with column vectors exhibiting only 0s

or only 1s were also discarded. In contrast to random

codes, exhaustive error correcting codes (EECOCs)

are generated deterministically. They are maximum-

length code vectors of length 2

n−1

− 1, where the re-

sulting dichotomies (i.e. column vectors) correspond

to every possible n-bit conﬁguration, excluding com-

plements and vectors exhibiting only 0s or only 1s. We

applied EECOCs to benchmark problems with up to

11 classes.

Table 2 shows the results obtained for C4.5 and Ta-

ble 3 those obtained for logistic regression (LR). They

3

Note that we also evaluated loss-based decoding for

pairwise classiﬁcation (Allwein et al., 2000) but did not

observe a signiﬁcant improvement over voting.

Dataset (#classes) END C4.5 1-vs-rest 1-vs-1 RECOCs EECOCs

anneal (6) 98.15±0.75 98.45±0.72 98.32±0.69 97.77±0.87 98.08±0.56 98.35±0.70

arrhythmia (16) 72.98±2.41 65.37±3.09• 62.60±3.54• 66.23±3.06• 71.24±2.55

audiology (24) 78.23±3.70 77.91±3.19 65.66±6.48• 77.03±4.11 80.49±3.68

autos (7) 73.77±3.73 73.20±5.56 71.15±5.10 65.62±6.20 69.18±6.34 75.09±5.06

balance-scale (3) 80.00±2.22 78.47±2.34 78.62±2.43 79.38±2.18 78.82±2.76 78.62±2.43

ecoli (8) 84.48±2.84 81.36±3.09 80.67±3.41 82.62±3.33 82.80±3.08 85.22±2.26

glass (7) 70.67±4.59 67.29±5.51 65.32±4.68 68.77±4.72 67.10±5.08 70.95±5.06

hypothyroid (4) 99.49±0.20 99.49±0.13 99.45±0.21 99.41±0.19 99.43±0.23 99.48±0.20

iris (3) 93.96±3.12 94.12±3.19 93.92±3.18 94.12±3.19 94.00±3.20 93.92±3.18

letter (26) 94.56±0.28 86.34±0.52• 84.99±0.43• 90.10±0.36• 94.62±0.29

lymphography (4) 77.67±5.05 76.30±4.98 76.75±5.43 77.54±5.59 75.71±4.10 76.94±5.51

optdigits (10) 97.24±0.36 89.45±0.67• 89.28±0.72• 94.01±0.55• 95.82±0.56• 98.13±0.27◦

pendigits (10) 98.66±0.19 95.90±0.31• 94.77±0.39• 96.41±0.34• 98.32±0.30 99.12±0.14◦

primary-tumor (22) 44.84±2.65 38.98±2.59• 38.54±3.69• 42.36±2.96 45.58±3.81

segment (7) 97.25±0.61 95.86±0.81• 94.93±0.77• 95.90±0.75• 96.35±0.88 97.44±0.70

soybean (19) 93.75±1.29 88.75±2.14• 89.41±1.79• 92.26±1.53 93.43±1.45

splice (3) 94.30±0.87 93.34±0.89 94.16±0.80 94.27±0.69 92.46±2.33 94.16±0.80

vehicle (4) 73.48±1.97 71.27±2.15 70.30±2.56 70.27±2.30 70.03±3.34 72.86±2.22

vowel (11) 87.56±2.63 75.82±2.59• 72.53±3.19• 75.60±3.05• 86.08±2.50 93.17±1.98◦

waveform (3) 78.61±1.76 75.00±0.98• 72.49±1.18• 75.80±1.02• 72.86±1.04• 72.49±1.18•

zoo (7) 93.31±3.47 93.14±2.94 92.27±2.66 91.63±3.47 90.04±4.38 92.02±3.92

•, ◦ statistically signiﬁcant improvement or degradation

Table 2. Comparison of diﬀerent multi-class methods for C4.5.

Dataset (#classes) END LR 1-vs-rest 1-vs-1 RECOCs EECOCs

anneal (6) 99.33±0.59 98.93±0.78 99.12±0.69 99.10±0.68 98.97±0.76 99.17±0.65

arrhythmia (16) 58.48±3.18 52.76±4.06• 48.91±3.60• 60.84±3.11 48.80±3.84•

audiology (24) 81.52±3.89 75.44±4.36 74.69±4.47• 74.91±4.36• 71.58±4.69•

autos (7) 70.44±5.98 64.74±5.46 61.59±5.51 70.83±6.15 61.82±5.17 64.51±5.54

balance-scale (3) 87.23±1.15 88.78±1.19 87.11±1.27 89.25±1.26◦ 87.74±1.58 87.11±1.27

ecoli (8) 85.73±2.52 84.57±2.59 85.28±2.44 84.28±2.72 84.74±3.23 85.77±2.54

glass (7) 64.19±5.45 63.06±5.09 63.06±5.54 62.29±5.59 60.92±4.35 62.59±4.56

hypothyroid (4) 96.82±0.53 96.66±0.42 95.28±0.43• 97.40±0.40 95.02±0.78• 95.40±0.39•

iris (3) 95.73±2.79 95.25±3.32 95.49±2.78 95.80±2.96 91.61±8.23 95.37±2.96

letter (26) 76.12±0.72 77.21±0.34◦ 72.17±0.41• 84.14±0.34◦ 47.83±2.39•

lymphography (4) 77.95±5.57 77.12±6.16 76.97±5.41 78.50±6.21 76.42±5.82 76.54±5.71

optdigits (10) 97.00±0.37 93.17±0.58• 94.28±0.56• 96.96±0.33 92.58±0.97• 95.12±0.44•

pendigits (10) 95.42±0.58 95.47±0.34 93.53±0.40• 97.57±0.28◦ 84.94±2.28• 89.63±0.54•

primary-tumor (22) 44.05±3.13 35.56±3.79• 39.10±3.72• 38.25±3.86• 45.28±3.10

segment (7) 94.40±0.73 95.28±0.59 92.05±0.67• 95.67±0.64◦ 89.72±2.55• 92.53±0.68•

soybean (19) 93.05±1.45 89.99±3.04 89.96±2.80• 90.62±1.45• 92.29±1.42

splice (3) 92.48±1.17 89.01±1.23• 90.82±1.00• 89.20±0.97• 91.04±1.51 91.65±1.00

vehicle (4) 80.03±1.84 79.27±1.97 78.62±2.11 79.15±1.81 76.20±4.05 79.12±1.94

vowel (11) 80.12±3.08 78.09±2.99 65.23±2.62• 88.42±1.86◦ 40.38±4.87• 51.83±2.56•

waveform (3) 86.39±0.73 86.47±0.71 86.57±0.71 86.16±0.70 84.19±2.73 86.57±0.71

zoo (7) 95.25±3.26 90.23±6.85 92.19±5.40 94.73±3.21 91.93±4.81 93.34±5.05

•, ◦ statistically signiﬁcant improvement or degradation

Table 3. Comparison of diﬀerent multi-class methods for logistic regression.

Ensembles of nested dichotomies for multi-class problems

Figures

Citations

The WEKA data mining software: an update

Do we need hundreds of classifiers to solve real world classification problems

An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes

A review on the combination of binary classifiers in multiclass problems

Machine Learning Based on Attribute Interactions

References

UCI Repository of machine learning databases

Data mining: practical machine learning tools and techniques with Java implementations

Solving multiclass learning problems via error-correcting output codes

UCI Repository of Machine Learning Databases

Solving Multiclass Learning Problems via Error-Correcting Output Codes

Related Papers (5)

Bagging predictors

Random Forests

C4.5: Programs for Machine Learning

UCI Machine Learning Repository

Statistical Comparisons of Classifiers over Multiple Data Sets