scispace - formally typeset
Open AccessProceedings ArticleDOI

Ensembles of nested dichotomies for multi-class problems

Reads0
Chats0
TLDR
Ensembles of nested dichotomies appear to be a good general-purpose method for applying binary classifiers to multi-class problems and are preferable if logistic regression is used, and comparable in the case of C4.5.
Abstract
Nested dichotomies are a standard statistical technique for tackling certain polytomous classification problems with logistic regression. They can be represented as binary trees that recursively split a multi-class classification task into a system of dichotomies and provide a statistically sound way of applying two-class learning algorithms to multi-class problems (assuming these algorithms generate class probability estimates). However, there are usually many candidate trees for a given problem and in the standard approach the choice of a particular tree is based on domain knowledge that may not be available in practice. An alternative is to treat every system of nested dichotomies as equally likely and to form an ensemble classifier based on this assumption. We show that this approach produces more accurate classifications than applying C4.5 and logistic regression directly to multi-class problems. Our results also show that ensembles of nested dichotomies produce more accurate classifiers than pairwise classification if both techniques are used with C4.5, and comparable results for logistic regression. Compared to error-correcting output codes, they are preferable if logistic regression is used, and comparable in the case of C4.5. An additional benefit is that they generate class probability estimates. Consequently they appear to be a good general-purpose method for applying binary classifiers to multi-class problems.

read more

Content maybe subject to copyright    Report

Ensembles of Nested Dichotomies for Multi-Class Problems
Eibe Frank eibe@cs.waikato.ac.nz
Department of Computer Science, University of Waikato, Hamilton, New Zealand
Stefan Kramer kramer@in.tum.de
Institut f¨ur Informatik, Technische Universit¨at unchen, unchen, Germany
Abstract
Nested dichotomies are a standard statisti-
cal technique for tackling certain polytomous
classification problems with logistic regres-
sion. They can be represented as binary trees
that recursively split a multi-class classifica-
tion task into a system of dichotomies and
provide a statistically sound way of applying
two-class learning algorithms to multi-class
problems (assuming these algorithms gener-
ate class probability estimates). However,
there are usually many candidate trees for a
given problem and in the standard approach
the choice of a particular tree is based on do-
main knowledge that may not be available in
practice. An alternative is to treat every sys-
tem of nested dichotomies as equally likely
and to form an ensemble classifier based on
this assumption. We show that this approach
produces more accurate classifications than
applying C4.5 and logistic regression directly
to multi-class problems. Our results also
show that ensembles of nested dichotomies
produce more accurate classifiers than pair-
wise classification if both techniques are used
with C4.5, and comparable results for logis-
tic regression. Compared to error-correcting
output codes, they are preferable if logistic
regression is used, and comparable in the case
of C4.5. An additional benefit is that they
generate class probability estimates. Conse-
quently they appear to be a good general-
purpose method for applying binary classi-
fiers to multi-class problems.
Appearing in Proceedings of the 21
st
International Confer-
ence on Machine Learning, Banff, Canada, 2004. Copyright
2004 by the authors.
1. Introduction
A system of nested dichotomies (Fox, 1997) is a binary
tree that recursively splits a set of classes from a multi-
class classification problem into smaller and smaller
subsets. In statistics, nested dichotomies are a stan-
dard technique for tackling polytomous (i.e. multi-
class) classification problems with logistic regression
by fitting binary logistic models to the individual di-
chotomous (i.e. two-class) classification problems at
the tree’s internal nodes, and a hierarchical decompo-
sition of classes has also been considered by authors
in neighboring areas (Goodman, 2001; Bengio, 2002).
However, nested dichotomies are only recommended
if a “particular choice of dichotomies is substantively
compelling” (Fox, 1997) based on domain knowledge.
There are usually many possible tree structures that
can be generated for a given set of classes, and in
many practical applications—namely, where the class
is truly a nominal quantity and does not exhibit any
structure—there is no a priori reason to prefer one
particular tree structure over another one. However,
in that case it makes sense to assume that every hier-
archy of nested dichotomies is equally likely and to use
an ensemble of these hierarchies for prediction. This
is the approach we propose and evaluate in this paper.
Using C4.5 and logistic regression as base learners
we show that ensembles of nested dichotomies pro-
duce more accurate classifications than applying these
learners directly to multi-class problems. We also show
that they compare favorably to three other popular
techniques for converting a multi-class classification
task into a set of binary classification problems: the
simple “one-vs-rest” method, error-correcting output
codes (Dietterich & Bakiri, 1995), and pairwise clas-
sification (F¨urnkranz, 2002). More specifically, we
show that ensembles of nested dichotomies produce
more accurate classifiers than the one-vs-rest method
for both C4.5 and logistic regression; that they are
more accurate than pairwise classification in the case
of C4.5, and comparable in the case of logistic regres-

sion; and that, compared to error-correcting output
codes, nested dichotomies have a distinct edge if lo-
gistic regression is used, and are on par if C4.5 is em-
ployed. In addition, they have the nice property that
they do not require any form of post-processing to re-
turn class probability estimates. They do have the
drawback that they require the base learner to produce
class probability estimates but this is not a severe lim-
itation given that most practical learning algorithms
are able to do so or can be made to do so.
This paper is structured as follows. In Section 2 we
describe more precisely how nested dichotomies work.
In Section 3 we present the idea of using ensembles
of nested dichotomies. In Section 4 this approach is
evaluated and compared to other techniques for tack-
ling multi-class problems. Related work is discussed
in Section 5. Section 6 summarizes the main findings
of this paper.
2. Nested Dichotomies
Nested dichotomies can be represented as binary trees
that, at each node, divide the set of classes A associ-
ated with the node into two subsets B and C that are
mutually exclusively and taken together contain all the
classes in A. The nested dichotomies’ root node con-
tains all the classes of the corresponding multi-class
classification problem. Each leaf node contains a sin-
gle class (i.e. for an n-class problem, there are n leaf
nodes and n 1 internal nodes). To build a classifier
based on such a tree structure we do the following:
at every internal node we store the instances pertain-
ing to the classes associated with that node, and no
other instances; then we group the classes pertaining
to each node into two subsets, so that each subset holds
the classes associated with exactly one of the node’s
two successor nodes; and finally we build binary classi-
fiers for the resulting two-class problems. This process
creates a tree structure with binary classifiers at the
internal nodes.
We assume that the binary classifiers produce class
probability estimates. For example, they could be
logistic regression models. The question is how to
combine the estimates from the individual two-class
problems to obtain class probability estimates for the
original multi-class problem. It turns out that the in-
dividual dichotomies are statistically independent be-
cause they are nested (Fox, 1997), enabling us to form
multi-class probability estimates simply by multiply-
ing together the probability estimates obtained from
the two-class models. More specifically, let C
i1
and
C
i2
be the two subsets of classes generated by a split
of the set of classes C
i
at internal node i of the tree
(i.e. the subsets associated with the successor nodes),
and let p(c C
i1
|x, c C
i
) and p(c C
i2
|x, c C
i
)
be the conditional probability distribution estimated
by the two-class model at node i for a given instance
x. Then the estimated class probability distribution
for the original multi-class problem is given by:
p(c = C|x) =
n1
Y
i=1
(I(c C
i1
)p(c C
i1
|x, c C
i
) +
I(c C
i2
)p(c C
i2
|x, c C
i
)),
where I(.) is the indicator function, and the product
is over all the internal nodes of the tree.
Note that not all nodes have to actually be exam-
ined to compute this probability for a particular class
value. Evaluating the path to the leaf associated with
that class is sufficient. Let p(c C
i1
|x, c C
i
) and
p(c C
i2
|x, c C
i
) be the labels of the edges connect-
ing node i to the nodes associated with C
i1
and C
i2
respectively. Then computing p(c|x) amounts to find-
ing the single path from the root to a leaf for which
c is in the set of classes associated with each node
along the path, multiplying together the probability
estimates encountered along the way.
Consider Figure 1, which shows two of the 15 possible
nested dichotomies for a four-class classification prob-
lem. Using the tree in Figure 1a the probability of
class 4 for an instance x is given by
p(c = 4|x) = p(c {3, 4}|x) ×
p(c {4}|x, c {3, 4}).
Based on the tree in Figure 1b we have
p(c = 4|x) = p(c {2, 3, 4}|x) ×
p(c {3, 4}|x, c {2, 3, 4}) ×
p(c {4}|x, c {3, 4}).
Both trees represent equally valid class probability
estimators—like all other trees that can be generated
for this problem. However, the estimates obtained
from different trees will usually differ because they in-
volve different two-class learning problems. If there
is no a priori reason to prefer a particular nested
dichotomy—e.g., because some classes are known to be
related in some fashion—there is no reason to trust one
of the estimates more than the others. Consequently it
makes sense to treat all possible trees as equally likely
and form overall class probability estimates by averag-
ing the estimates obtained from different trees. This is
the approach we investigate in the rest of this paper.

{1,2,3,4}
{1,2} {3,4}
{1} {2} {3} {4}
p
(
c
∈ {3,4}|
x
)
p
(
c
∈ {4}|
x, c
∈ {3,4})
p
(
c
∈ {3}|
x, c
∈ {3,4})
p
(
c
∈ {2}|
x, c
∈ {1,2})
p
(
c
∈ {1}|
x, c
∈ {1,2})
p
(
c
∈ {1,2}|
x
)
{1,2,3,4}
{1} {2,3,4}
{2} {3,4}
{3} {4}
p
(
c
∈ {2,3,4}|
x
)
p
(
c
∈ {1}|
x
)
p
(
c
∈ {3,4}|
x, c
∈ {2,3,4})
p
(
c
∈ {2}|
x, c
∈ {2,3,4})
p
(
c
∈ {3}|
x, c
∈ {3,4})
p
(
c
∈ {4}|
x, c
∈ {3,4})
(a) (b)
Figure 1. Two different systems of nested dichotomies for a classification problem with four classes.
3. Ensembles of Nested Dichotomies
The number of possible trees for an n-class problem
grows extremely quickly. It is given by the following
recurrence relation:
T (n) = (2n 3) × T (n 1),
where T (1) = 1, because there are 2(n1)1 = 2n3
distinct possibilities to add a new class into a tree for
n 1 classes—one for each node.
Expanding the recurrence relation results in T (n) =
(2n 3) × (2n 5) × . . . × 3 × 1, and, using the double
factorial, this can be written as T (n) = (2n 3)!!. For
two classes we have T (2) = 1, for three T (3) = 3, for
four T (4) = 15, and for five T (5) = 105.
The growth in the number of trees makes it impossible
to generate them exhaustively in a brute-force manner
even for problems with a moderate number of classes.
This is the case even if we cache models for the indi-
vidual two-class problems that are encountered when
building each tree.
1
There are (3
n
(2
n+1
1))/2 pos-
sible two-class problems for an n-class dataset. The
term 3
n
arises because a class can be either in the first
subset, the second one, or absent; the term (2
n+1
1)
because we need to subtract all problems where either
one of the two subsets is empty; and the factor 1/2
from the fact that the two resulting subsets can be
swapped without any effect on the classifier. Hence
there are 6 possible two-class problems for a problem
with 3 classes, 25 for a problem with 4 classes, 90 for
a problem with 5 classes, etc.
Given these growth rates we chose to evaluate the per-
formance of ensembles of randomly generated trees.
(Of course, only the structure of each tree was gener-
ated randomly. We applied a standard learning scheme
1
Note that different trees may exhibit some two-class
problems that are identical.
at each internal node of the randomly sampled trees.)
More specifically, we took a random sample from the
space of all distinct trees for a given n-class problem
(for simplicity, based on sampling with replacement),
and formed class probability estimates for a given in-
stance x by averaging the estimates obtained from the
individual ensemble members. Because of the uniform
sampling process these averages form an unbiased es-
timate of the estimates that would have been obtained
by building the complete ensemble of all possible dis-
tinct trees for a given n-class problem.
4. Empirical Comparison
We performed experiments with 21 multi-class
datasets from the UCI repository (Blake & Merz,
1998), summarized in Table 1. Two learning schemes
were employed: C4.5 and logistic regression.
2
We used
these two because (a) they produce class probability
estimates, (b) they inhabit opposite ends of the bias-
variance spectrum, and (c) they can deal with multiple
classes directly without having to convert a multi-class
problem into a set of two-class problems (in the case
of logistic regression, by optimizing the multinomial
likelihood directly). The latter condition is important
for testing whether any of the multi-class “wrapper”
methods that we included in our experimental com-
parison can actually improve upon the performance of
the learning schemes applied directly to the multi-class
problems.
To compare the performance of the different learning
schemes for each dataset, we estimated classification
accuracy based on 50 runs of the stratified hold-out
method, in each run using 66% of the data for train-
ing and the rest for testing. We tested for significant
2
As implemented in Weka version 3.4.1 (Witten &
Frank, 2000).

Dataset Num. % Miss. Num. Nom. Num.
insts atts atts class.
anneal 898 0.0 6 32 6
arrhythmia 452 0.3 206 73 16
audiology 226 2.0 0 69 24
autos 205 1.1 15 10 7
bal.-scale 625 0.0 4 0 3
ecoli 336 0.0 7 0 8
glass 214 0.0 9 0 7
hypothyroid 3772 6.0 23 6 4
iris 150 0.0 4 0 3
letter 20000 0.0 16 0 26
lymph 148 0.0 3 15 4
optdigits 5620 0.0 64 0 10
pendigits 10992 0.0 16 0 10
prim.-tumor 339 3.9 0 17 22
segment 2310 0.0 19 0 7
soybean 683 9.8 0 35 19
splice 3190 0.0 0 61 3
vehicle 846 0.0 18 0 4
vowel 990 0.0 10 3 11
waveform 5000 0.0 40 0 3
zoo 101 0.0 1 15 7
Table 1. Datasets used for the experiments
differences in accuracy by using the corrected resam-
pled t-test at the 5% significance level. This test has
been shown to have Type I error at the significance
level and low Type II error if used in conjunction with
the hold-out method (Nadeau & Bengio, 2003).
In the first set of experiments, we compared ensembles
of nested dichotomies (ENDs) with several other stan-
dard multi-class methods. In the second set we varied
the number of ensemble members to see whether this
has any impact on the performance of ENDs.
4.1. Comparison to other approaches for
multi-class learning
In the first set of experiments we used ENDs consisting
of 20 ensemble members (i.e. each classifier consisted
of 20 trees of nested dichotomies) to compare to other
multi-class schemes. As the experimental results in
the next section will show, 20 ensemble members are
often sufficient to get close to optimum performance.
We used both C4.5 and logistic regression to build the
ENDs. The same experiments were repeated for both
standard C4.5 and polytomous logistic regression ap-
plied directly to the multi-class problems. In addi-
tion, the following other multi-class-to-binary conver-
sion methods were compared with ENDs: one-vs-rest,
pairwise classification, random error-correcting output
codes, and exhaustive error-correcting output codes.
One-vs-rest creates n dichotomies for an n-class prob-
lem, in each case learning one of the n classes against
all the other classes (i.e. there is one classifier for each
class). At classification time, the class that gets max-
imum probability from its corresponding classifier is
predicted. Pairwise classification learns a classifier for
each pair of classes, ignoring the instances pertaining
to the other classes (i.e. there are n × (n 1)/2 classi-
fiers). A prediction is obtained by voting, where each
classifier casts a vote for either one of the two classes it
was built from. The class with the maximum number
of votes is predicted.
In error-correcting output codes (ECOCs), each class
is assigned a binary code vector of length k, which
make up the row vectors of a code matrix. These
row vectors determine the set of k dichotomies to be
learned, corresponding to the column vectors of the
code matrix. At prediction time, a vector of classifica-
tions is obtained by collecting the predictions from the
individual k classifiers learned from the dichotomies.
The original approach to ECOCs predicts the class
whose corresponding row vector has minimum Ham-
ming distance to the vector of 0/1 predictions obtained
from the k classifiers (Dietterich & Bakiri, 1995). How-
ever, accuracy can be improved by using loss-based de-
coding (Allwein et al., 2000). In our case this means
using the predicted class probabilities rather than the
0/1 predictions. For C4.5 we used the absolute error
of the probability estimates as the loss function and
for logistic regression the negative log-likelihood. We
verified that this indeed produced more accurate clas-
sifiers than the Hamming distance.
3
Random error-correcting output codes (RECOCs) are
based on the fact that random code vectors have good
error-correcting properties. We used random code vec-
tors of length k = 2 × n, where n is the number of
classes. Code vectors consisting only of 0s or only of 1s
were discarded. This results in a code matrix with row
vectors of length 2×n and column vectors of length n.
Code matrices with column vectors exhibiting only 0s
or only 1s were also discarded. In contrast to random
codes, exhaustive error correcting codes (EECOCs)
are generated deterministically. They are maximum-
length code vectors of length 2
n1
1, where the re-
sulting dichotomies (i.e. column vectors) correspond
to every possible n-bit configuration, excluding com-
plements and vectors exhibiting only 0s or only 1s. We
applied EECOCs to benchmark problems with up to
11 classes.
Table 2 shows the results obtained for C4.5 and Ta-
ble 3 those obtained for logistic regression (LR). They
3
Note that we also evaluated loss-based decoding for
pairwise classification (Allwein et al., 2000) but did not
observe a significant improvement over voting.

Dataset (#classes) END C4.5 1-vs-rest 1-vs-1 RECOCs EECOCs
anneal (6) 98.15±0.75 98.45±0.72 98.32±0.69 97.77±0.87 98.08±0.56 98.35±0.70
arrhythmia (16) 72.98±2.41 65.37±3.09 62.60±3.54 66.23±3.06 71.24±2.55
audiology (24) 78.23±3.70 77.91±3.19 65.66±6.48 77.03±4.11 80.49±3.68
autos (7) 73.77±3.73 73.20±5.56 71.15±5.10 65.62±6.20 69.18±6.34 75.09±5.06
balance-scale (3) 80.00±2.22 78.47±2.34 78.62±2.43 79.38±2.18 78.82±2.76 78.62±2.43
ecoli (8) 84.48±2.84 81.36±3.09 80.67±3.41 82.62±3.33 82.80±3.08 85.22±2.26
glass (7) 70.67±4.59 67.29±5.51 65.32±4.68 68.77±4.72 67.10±5.08 70.95±5.06
hypothyroid (4) 99.49±0.20 99.49±0.13 99.45±0.21 99.41±0.19 99.43±0.23 99.48±0.20
iris (3) 93.96±3.12 94.12±3.19 93.92±3.18 94.12±3.19 94.00±3.20 93.92±3.18
letter (26) 94.56±0.28 86.34±0.52 84.99±0.43 90.10±0.36 94.62±0.29
lymphography (4) 77.67±5.05 76.30±4.98 76.75±5.43 77.54±5.59 75.71±4.10 76.94±5.51
optdigits (10) 97.24±0.36 89.45±0.67 89.28±0.72 94.01±0.55 95.82±0.56 98.13±0.27
pendigits (10) 98.66±0.19 95.90±0.31 94.77±0.39 96.41±0.34 98.32±0.30 99.12±0.14
primary-tumor (22) 44.84±2.65 38.98±2.59 38.54±3.69 42.36±2.96 45.58±3.81
segment (7) 97.25±0.61 95.86±0.81 94.93±0.77 95.90±0.75 96.35±0.88 97.44±0.70
soybean (19) 93.75±1.29 88.75±2.14 89.41±1.79 92.26±1.53 93.43±1.45
splice (3) 94.30±0.87 93.34±0.89 94.16±0.80 94.27±0.69 92.46±2.33 94.16±0.80
vehicle (4) 73.48±1.97 71.27±2.15 70.30±2.56 70.27±2.30 70.03±3.34 72.86±2.22
vowel (11) 87.56±2.63 75.82±2.59 72.53±3.19 75.60±3.05 86.08±2.50 93.17±1.98
waveform (3) 78.61±1.76 75.00±0.98 72.49±1.18 75.80±1.02 72.86±1.04 72.49±1.18
zoo (7) 93.31±3.47 93.14±2.94 92.27±2.66 91.63±3.47 90.04±4.38 92.02±3.92
, statistically significant improvement or degradation
Table 2. Comparison of different multi-class methods for C4.5.
Dataset (#classes) END LR 1-vs-rest 1-vs-1 RECOCs EECOCs
anneal (6) 99.33±0.59 98.93±0.78 99.12±0.69 99.10±0.68 98.97±0.76 99.17±0.65
arrhythmia (16) 58.48±3.18 52.76±4.06 48.91±3.60 60.84±3.11 48.80±3.84
audiology (24) 81.52±3.89 75.44±4.36 74.69±4.47 74.91±4.36 71.58±4.69
autos (7) 70.44±5.98 64.74±5.46 61.59±5.51 70.83±6.15 61.82±5.17 64.51±5.54
balance-scale (3) 87.23±1.15 88.78±1.19 87.11±1.27 89.25±1.26 87.74±1.58 87.11±1.27
ecoli (8) 85.73±2.52 84.57±2.59 85.28±2.44 84.28±2.72 84.74±3.23 85.77±2.54
glass (7) 64.19±5.45 63.06±5.09 63.06±5.54 62.29±5.59 60.92±4.35 62.59±4.56
hypothyroid (4) 96.82±0.53 96.66±0.42 95.28±0.43 97.40±0.40 95.02±0.78 95.40±0.39
iris (3) 95.73±2.79 95.25±3.32 95.49±2.78 95.80±2.96 91.61±8.23 95.37±2.96
letter (26) 76.12±0.72 77.21±0.34 72.17±0.41 84.14±0.34 47.83±2.39
lymphography (4) 77.95±5.57 77.12±6.16 76.97±5.41 78.50±6.21 76.42±5.82 76.54±5.71
optdigits (10) 97.00±0.37 93.17±0.58 94.28±0.56 96.96±0.33 92.58±0.97 95.12±0.44
pendigits (10) 95.42±0.58 95.47±0.34 93.53±0.40 97.57±0.28 84.94±2.28 89.63±0.54
primary-tumor (22) 44.05±3.13 35.56±3.79 39.10±3.72 38.25±3.86 45.28±3.10
segment (7) 94.40±0.73 95.28±0.59 92.05±0.67 95.67±0.64 89.72±2.55 92.53±0.68
soybean (19) 93.05±1.45 89.99±3.04 89.96±2.80 90.62±1.45 92.29±1.42
splice (3) 92.48±1.17 89.01±1.23 90.82±1.00 89.20±0.97 91.04±1.51 91.65±1.00
vehicle (4) 80.03±1.84 79.27±1.97 78.62±2.11 79.15±1.81 76.20±4.05 79.12±1.94
vowel (11) 80.12±3.08 78.09±2.99 65.23±2.62 88.42±1.86 40.38±4.87 51.83±2.56
waveform (3) 86.39±0.73 86.47±0.71 86.57±0.71 86.16±0.70 84.19±2.73 86.57±0.71
zoo (7) 95.25±3.26 90.23±6.85 92.19±5.40 94.73±3.21 91.93±4.81 93.34±5.05
, statistically significant improvement or degradation
Table 3. Comparison of different multi-class methods for logistic regression.

Citations
More filters
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal ArticleDOI

Do we need hundreds of classifiers to solve real world classification problems

TL;DR: The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in theTop-20, respectively).
Journal ArticleDOI

An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes

TL;DR: This work develops a double study, using different base classifiers in order to observe the suitability and potential of each combination within each classifier, and compares the performance of these ensemble techniques with the classifiers' themselves.
Journal ArticleDOI

A review on the combination of binary classifiers in multiclass problems

TL;DR: This paper presents a survey on the main strategies for the generalization of binary classifiers to problems with more than two classes, known as multiclass classification problems, and focuses on strategies that decompose the original multiclass problem into multiple binary subtasks, whose outputs are combined to obtain the final prediction.
Dissertation

Machine Learning Based on Attribute Interactions

Aleks Jakulin
TL;DR: The present work shows that mutual information, information gain, correlation, attribute importance, association and many other concepts, are all merely special cases of the above principle.
References
More filters
Journal ArticleDOI

Data mining: practical machine learning tools and techniques with Java implementations

TL;DR: This presentation discusses the design and implementation of machine learning algorithms in Java, as well as some of the techniques used to develop and implement these algorithms.
Journal ArticleDOI

Solving multiclass learning problems via error-correcting output codes

TL;DR: In this article, error-correcting output codes are employed as a distributed output representation to improve the performance of decision-tree algorithms for multiclass learning problems, such as C4.5 and CART.
Posted Content

Solving Multiclass Learning Problems via Error-Correcting Output Codes

TL;DR: It is demonstrated that error-correcting output codes provide a general-purpose method for improving the performance of inductive learning programs on multiclass problems.