Experiments with a new boosting algorithm

Home
/
Papers
/
Experiments with a new boosting algorithm

Proceedings Article•

Experiments with a new boosting algorithm

Yoav Freund¹, Robert E. Schapire¹•Institutions (1)

03 Jul 1996-pp 148-156

TL;DR: This paper describes experiments carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems and compared boosting to Breiman's "bagging" method when used to aggregate various classifiers.

read less

Abstract: In an earlier paper, we introduced a new "boosting" algorithm called AdaBoost which, theoretically, can be used to significantly reduce the error of any learning algorithm that con- sistently generates classifiers whose performance is a little better than random guessing. We also introduced the related notion of a "pseudo-loss" which is a method for forcing a learning algorithm of multi-label concepts to concentrate on the labels that are hardest to discriminate. In this paper, we describe experiments we carried out to assess how well AdaBoost with and without pseudo-loss, performs on real learning problems. We performed two sets of experiments. The first set compared boosting to Breiman's "bagging" method when used to aggregate various classifiers (including decision trees and single attribute- value tests). We compared the performance of the two methods on a collection of machine-learning benchmarks. In the second set of experiments, we studied in more detail the performance of boosting using a nearest-neighbor classifier on an OCR problem.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Random Forests

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read moreread less

79,257 citations

Cites background or methods from "Experiments with a new boosting alg..."

...But none of these these three forests do as well as Adaboost (Freund & Schapire, 1996) or other algorithms that work by adaptive reweighting (arcing) of the training set (see Breiman, 1998b; Dieterrich, 1998; Bauer & Kohavi, 1999)....
[...]
...In its original version, Adaboost (Freund & Schapire, 1996) is a deterministic algorithm that selects the weights on the training set for input to the next classifier based on the misclassifications in the previous classifiers....
[...]

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

Journal Article•DOI•

Greedy function approximation: A gradient boosting machine.

[...]

Jerome H. Friedman¹•Institutions (1)

Stanford University¹

01 Oct 2001-Annals of Statistics

TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.

...read moreread less

Abstract: Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion.Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

...read moreread less

17,764 citations

Cites background or methods from "Experiments with a new boosting alg..."

...Suppose that for a particular loss (y; F ) and/or base learner h(x; a) the solution to (9) is di cult to obtain....
[...]
...In machine learning, (9) (10) is called \boosting" where y 2 f 1; 1g and (y; F ) is either an exponential loss criterion e yF (Freund and Schapire 1996, Schapire and Singer 1998) or negative binomial log{likelihood (Friedman, Hastie, and...
[...]
...and the approximation updated Fm(x) = Fm 1(x) + mh(x; am): Basically, instead of obtaining the solution under a smoothness constraint (9), the constraint is applied to the unconstrained (rough) solution f gm(xi)g N i=1 (7)....
[...]
...Given the current approximation Fm 1(x) at the mth iteration, the function mh(x; am) (9) (10) is the best greedy step towards the minimizing solution F (x) (1), under the constraint that the step \direction" h(x; am) be a member of the parameterized class of functions h(x; a)....
[...]
...In the special case where y 2 f 1; 1g and the loss function (y; F ) depends on y and F only through their product (y; F ) = (yF ), the analogy of boosting (9) (10) to steepest{descent minimization has been noted in the machine learning literature (Breiman 1997a, Ratsch, Onoda, and Muller 1998)....
[...]

Journal Article•DOI•

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

[...]

Yoav Freund¹, Robert E. Schapire¹•Institutions (1)

AT&T Labs¹

01 Aug 1997

TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.

...read moreread less

Abstract: In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in Rn. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.

...read moreread less

15,813 citations

Cites background or methods from "Experiments with a new boosting alg..."

...Since it was first introduced, several successful experiments have been conducted using AdaBoost, including work by the authors [12], Drucker and Cortes [8], Jackson and Craven [16], Quinlan [21], and Breiman [3]....
[...]
...Empirical tests [12] have shown that pseudo-loss is generally more successful when the weak learners use very restricted hypotheses....
[...]

Journal Article•DOI•

Extreme learning machine: Theory and applications

[...]

Guang-Bin Huang, Qin-Yu Zhu, Chee Kheong Siew

01 Dec 2006-Neurocomputing

TL;DR: A new learning algorithm called ELM is proposed for feedforward neural networks (SLFNs) which randomly chooses hidden nodes and analytically determines the output weights of SLFNs which tends to provide good generalization performance at extremely fast learning speed.

...read moreread less

10,217 citations

Cites methods from "Experiments with a new boosting alg..."

...For this problem, as usually done in the literature [20,21,5,25] 75% and 25% samples are randomly chosen for training and testing at each trial, respectively....
[...]
...57% with 20 nodes, which is obviously higher than all the results so far reported in the literature using various popular algorithms such as SVM [20], SAOCIF [21], Cascade-Correlation algorithm [21], bagging and boosting methods [5], C4....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

C4.5: Programs for Machine Learning

[...]

J. Ross Quinlan¹•Institutions (1)

University of Sydney¹

15 Oct 1992

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

...read moreread less

Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

...read moreread less

21,674 citations

Journal Article•DOI•

Bagging predictors

[...]

Leo Breiman

01 Aug 1996

TL;DR: Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.

...read moreread less

Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

...read moreread less

16,118 citations

Journal Article•DOI•

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

[...]

Yoav Freund¹, Robert E. Schapire¹•Institutions (1)

AT&T Labs¹

01 Aug 1997

...read moreread less

15,813 citations

Programs for Machine Learning

[...]

Steven L. Salzberg¹, Alberto Segre•Institutions (1)

Johns Hopkins University¹

01 Jan 1994

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.

...read moreread less

Abstract: Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 and its successor, C4.5, are probably the most popular in the machine learning community. These algorithms and variations on them have been the subject of numerous research papers since Quinlan introduced ID3. Until recently, most researchers looking for an introduction to decision trees turned to Quinlan's seminal 1986 Machine Learning journal article [Quinlan, 1986]. In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments. As such, this book will be a welcome addition to the library of many researchers and students.

...read moreread less

8,046 citations

Book Chapter•DOI•

Fast effective rule induction

[...]

William W. Cohen¹•Institutions (1)

Bell Labs¹

09 Jul 1995

TL;DR: This paper evaluates the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems, and proposes a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5 and C 4.5rules with respect to error rates, but much more efficient on large samples.

...read moreread less

Abstract: Many existing rule learning systems are computationally expensive on large noisy datasets. In this paper we evaluate the recently-proposed rule learning algorithm IREP on a large and diverse collection of benchmark problems. We show that while IREP is extremely efficient, it frequently gives error rates higher than those of C4.5 and C4.5rules. We then propose a number of modifications resulting in an algorithm RIPPERk that is very competitive with C4.5rules with respect to error rates, but much more efficient on large samples. RIPPERk obtains error rates lower than or equivalent to C4.5rules on 22 of 37 benchmark problems, scales nearly linearly with the number of training examples, and can efficiently process noisy datasets containing hundreds of thousands of examples.

...read moreread less

4,081 citations

"Experiments with a new boosting alg..." refers methods in this paper

...These include: (1) an algorithm that searches for very simple prediction rules which test on a single attribute (similar to Holte’s very simple classification rules [14]); (2) an algorithm that searches for a single good decision rule that tests on a conjunction of attribute tests (similar in flavor to the rule-formation part of Cohen’s RIPPER algorithm [3] and Fürnkranz and Widmer’s IREP algorithm [11]); and (3) Quinlan’s C4....
[...]