scispace - formally typeset
Search or ask a question

Showing papers on "AdaBoost published in 2000"


Journal ArticleDOI
TL;DR: This work shows that this seemingly mysterious phenomenon of boosting can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood, and develops more direct approximations and shows that they exhibit nearly identical results to boosting.
Abstract: Boosting is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.

6,598 citations


Book ChapterDOI
21 Jun 2000
TL;DR: Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.
Abstract: Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, Bagging, and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier. Some previous studies comparing ensemble methods are reviewed, and some new experiments are presented to uncover the reasons that Adaboost does not overfit rapidly.

5,679 citations


Journal ArticleDOI
28 Jun 2000
TL;DR: A unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances, and a parameterized family of algorithms that includes both a sequential- and a parallel-update algorithm as special cases are described, thus showing how the sequential and parallel approaches can themselves be unified.
Abstract: We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms are iterative and can be divided into two types based on whether the parameters are updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms that includes both a sequential- and a parallel-update algorithm as special cases, thus showing how the sequential and parallel approaches can themselves be unified. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliary-function proof technique. As one of our sequential-update algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with the iterative scaling algorithm. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.

730 citations


Journal ArticleDOI
TL;DR: MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees that is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction.
Abstract: MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, MultiBoosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.

729 citations


Journal ArticleDOI
TL;DR: It is suggested that random resampling of the training data is not the main explanation of the success of the improvements brought by Ada Boost, and training methods based on sampling the training set and weighting the cost function are compared.
Abstract: Boosting is a general method for improving the performance of learning algorithms. A recently proposed boosting algorithm, AdaBoost, has been applied with great success to several benchmark machine learning problems using mainly decision trees as base classifiers. In this article we investigate whether AdaBoost also works as well with neural networks, and we discuss the advantages and drawbacks of different versions of the AdaBoost algorithm. In particular, we compare training methods based on sampling the training set and weighting the cost function. The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBoost. This is in contrast to bagging, which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error. Our system achieves about 1.4% error on a data set of on-line handwritten digits from more than 200 writers. A boosted multilayer network achieved 1.5% error on the UCI letters and 8.1% error on the UCI satellite data set, which is significantly better than boosted decision trees.

303 citations


Proceedings Article
29 Jun 2000
TL;DR: In this paper, a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a margin-based binary learning algorithm is presented.
Abstract: We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a margin-based binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class is compared against all others, or in which all pairs of classes are compared to each other, or in which output codes with error-correcting properties are used. We propose a general method for combining the classifiers generated on the binary problems, and we prove a general empirical multiclass loss bound given the empirical loss of the individual binary learning algorithms. The scheme and the corresponding bounds apply to many popular classification learning algorithms including support-vector machines, AdaBoost, regression, logistic regression and decision-tree algorithms. We also give a multiclass generalization error analysis for general output codes with AdaBoost as the binary learner. Experimental results with SVM and AdaBoost show that our scheme provides a viable alternative to the most commonly used multiclass algorithms.

239 citations


Proceedings Article
28 Jun 2000
TL;DR: A new boosting algorithm MadaBoost is proposed that can be casted in the statistical query learning model [Kea93] and thus, it is robust to random classification noise [AL88].
Abstract: We propse a new boosting algorithm that mends some of the problems that have been detected in the so far most successful boosting algorithm, AdaBoost due to Freund and Schapire [FS97]. These problems are: (1) AdaBoost cannot be used in the boosting by filtering framework, and (2) AdaBoost does not seem to be noise resistant. In order to solve them, we propose a new boosting algorithm MadaBoost by modifying the weighting system of AdaBoost. We prove that one version of MadaBoost is in fact a boosting algorithm, and we show how our algorithm can be used in detail. We then prove that our new boosting algorithm can be casted in the statistical query learning model [Kea93] and thus, it is robust to random classification noise [AL88].

200 citations


Journal ArticleDOI
TL;DR: A theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin is proved, in contrast to previous results, which were stated in Terms of the particular cost function sgn(θ − margin).
Abstract: Recent theoretical results have shown that the generalization performance of thresholded convex combinations of base classifiers is greatly improved if the underlying convex combination has large margins on the training data (i.e., correct examples are classified well away from the decision boundary). Neural network algorithms and AdaBoost have been shown to implicitly maximize margins, thus providing some theoretical justification for their remarkably good generalization performance. In this paper we are concerned with maximizing the margin explicitly. In particular, we prove a theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin, in contrast to previous results, which were stated in terms of the particular cost function sgn(t − margin). We then present a new algorithm, DOOM, for directly optimizing a piecewise-linear family of cost functions satisfying the conditions of the theorem. Experiments on several of the datasets in the UC Irvine database are presented in which AdaBoost was used to generate a set of base classifiers and then DOOM was used to find the optimal convex combination of those classifiers. In all but one case the convex combination generated by DOOM had lower test error than AdaBoost's combination. In many cases DOOM achieves these lower test errors by sacrificing training error, in the interests of reducing the new cost function. In our experiments the margin plots suggest that the size of the minimum margin is not the critical factor in determining generalization performance.

134 citations


Book ChapterDOI
31 May 2000
TL;DR: It is pointed out that the boosting pruning problem is intractable even to approximate and a margin-based theoretical heuristic is suggested for this problem.
Abstract: Boosting is a powerful method for improving the predictive accuracy of classifiers. The ADABOOST algorithm of Freund and Schapire has been successfully applied to many domains [2, 10, 12] and the combination of ADABOOST with the C4.5 decision tree algorithm has been called the best off-the-shelf learning algorithm in practice. Unfortunately, in some applications, the number of decision trees required by ADABOOST to achieve a reasonable accuracy is enormously large and hence is very space consuming. This problem was first studied by Margineantu and Dietterich [7], where they proposed an empirical method called Kappa pruning to prune the boosting ensemble of decision trees. The Kappa method did this without sacrificing too much accuracy. In this work-in-progress we propose a potential improvement to the Kappa pruning method and also study the boosting pruning problem from a theoretical perspective. We point out that the boosting pruning problem is intractable even to approximate. Finally, we suggest a margin-based theoretical heuristic for this problem.

97 citations


Proceedings Article
31 May 2000
TL;DR: Initial experiments show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy on supervised WSD, and compares favourably to the other benchmark algorithms.
Abstract: In this paper Schapire and Singer's AdaBoost. MH boosting algorithm is applied to the Word Sense Disambiguation (WSD) problem. Initial experiments on a set of 15 selected polysemous words show that the boosting approach surpasses Naive Bayes and Exemplar-based approaches, which represent state-of-the-art accuracy on supervised WSD. In order to make boosting practical for a real learning domain of thousands of words, several ways of accelerating the algorithm by reducing the feature space are studied. The best variant, which we call LazyBoosting, is tested on the largest sense-tagged corpus available containing 192, 800 examples of the 191 most frequent and ambiguous English words. Again, boosting compares favourably to the other benchmark algorithms.

85 citations


Proceedings ArticleDOI
06 Nov 2000
TL;DR: An improved boosting algorithm, called {\sc AdaBoost.MH$^{KR}$}, is described, and its application to text categorization is described and shown to be both more efficient to train and more effective than the original Ada boost.MH algorithm.
Abstract: We describe an improved boosting algorithm, called {\sc AdaBoost.MH$^{KR}$}, and its application to text categorization. Boosting is a method for supervised learning which has successfully been applied to many different domains, and that has proven one of the best performers in text categorization exercises so far. Boosting is based on the idea of relying on the collective judgment of a committee of classifiers that are trained sequentially. In training the $i$-th classifier special emphasis is placed on the correct categorization of the training documents which have proven harder for the previously trained classifiers. {\sc AdaBoost.MH$^{KR}$} is based on the idea to build, at every iteration of the learning phase, not a single classifier but a sub-committee of the $K$ classifiers which, at that iteration, look the most promising. We report the results of systematic experimentation of this method performed on the standard {\sf Reuters-21578} benchmark. These experiments have shown that {\sc AdaBoost.MH$^{KR}$} is both more efficient to train and more effective than the original {\sc AdaBoost.MH$^{R}$} algorithm.

Proceedings ArticleDOI
01 Jul 2000
TL;DR: A boosting-based learning method for text filtering that uses naive Bayes classifiers as a weak learner allows the boosting algorithm to utilize term frequency information while maintaining probabilistically accurate confidence ratio.
Abstract: Several machine learning algorithms have recently been used for text categorization and filtering. In particular, boosting methods such as AdaBoost have shown good performance applied to real text data. However, most of existing boosting algorithms are based on classifiers that use binary-valued features. Thus, they do not fully make use of the weight information provided by standard term weighting methods. In this paper, we present a boosting-based learning method for text filtering that uses naive Bayes classifiers as a weak learner. The use of naive Bayes allows the boosting algorithm to utilize term frequency information while maintaining probabilistically accurate confidence ratio. Applied to TREC-7 and TREC-8 filtering track documents, the proposed method obtained a significant improvement in LF1, LF2, F1 and F3 measures compared to the best results submitted by other TREC entries.

Proceedings Article
29 Jun 2000
TL;DR: Cascading is discussed, where there is a sequence of classifiers ordered in terms of increasing complexity and specificity such that early classifiers are simple and general whereas later ones are more complex and specific, being localized on patterns rejected by the previous classifiers.
Abstract: For building implementable and industryvaluable classification solutions, machine learning methods must focus not only on accuracy but also on computational and space complexity. We discuss a multistage method, namely cascading, where there is a sequence of classifiers ordered in terms of increasing complexity and specificity such that early classifiers are simple and general whereas later ones are more complex and specific, being localized on patterns rejected by the previous classifiers. We present the technique and its rationale and validate its use by comparing it with the individual classifiers as well as the widely accepted ensemble methods bagging and Adaboost on eight data sets from the UCI repository. We do see that cascading increases accuracy without the concomitant increase in complexity and cost.

Book ChapterDOI
13 Sep 2000
TL;DR: The comparison results using several datasets of the UCI machine learning repository show that boosting and bagging with dynamic integration of classifiers results often better accuracy than boosting andbagging result with their original voting techniques.
Abstract: One approach in classification tasks is to use machine learning techniques to derive classifiers using learning instances. The co-operation of several base classifiers as a decision committee has succeeded to reduce classification error. The main current decision committee learning approaches boosting and bagging use resampling with the training set and they can be used with different machine learning techniques which derive base classifiers. Boosting uses a kind of weighted voting and bagging uses equal weight voting as a combining method. Both do not take into account the local aspects that the base classifiers may have inside the problem space. We have proposed a dynamic integration technique to be used with ensembles of classifiers. In this paper, the proposed dynamic integration technique is applied with AdaBoost and bagging. The comparison results using several datasets of the UCI machine learning repository show that boosting and bagging with dynamic integration of classifiers results often better accuracy than boosting and bagging result with their original voting techniques.

Proceedings Article
29 Jun 2000
TL;DR: It is proved that minimizing the soft margin error function (equivalent to solving an LP) directly optimizes a generalization error bound and is used to solve any boosting LP by iteratively optimizing the dual classification costs in a restricted LP and dynamically generating weak learners to make new LP columns.
Abstract: We examine linear program (LP) approaches to boosting and demonstrate their efficient solution using LPBoost, a column generation simplex method. We prove that minimizing the soft margin error function (equivalent to solving an LP) directly optimizes a generalization error bound. LPBoost can be used to solve any boosting LP by iteratively optimizing the dual classification costs in a restricted LP and dynamically generating weak learners to make new LP columns. Unlike gradient boosting algorithms, LPBoost converges finitely to a global solution using well defined stopping criteria. Computationally, LPBoost finds very sparse solutions as good as or better than those found by ADABoost using comparable computation.

Proceedings Article
28 Jun 2000
TL;DR: This paper examines master regression algorithms that leverage base regressors by iteratively calling them on modified samples and presents three gradient descent leveraging algorithms for regression and proves AdaBoost-style bounds on their sample error using intuitive assumptions on the base learners.
Abstract: In this paper we examine master regression algorithms that leverage base regressors by iteratively calling them on modified samples. The most successful leveraging algorithm for classification is AdaBoost, an algorithm that requires only modest assumptions on the base learning method for its good theoretical bounds. We present three gradient descent leveraging algorithms for regression and prove AdaBoost-style bounds on their sample error using intuitive assumptions on the base learners. We derive bounds on the size of the master functions that lead to PAC-style bounds on the generalization error.

Book ChapterDOI
18 Apr 2000
TL;DR: This paper presents an experimental evaluation of a boosting based learning system that can be run efficiently over a large dataset and provides experimental evidence that the method is as accurate as the equivalent algorithm that uses all the dataset but much faster.
Abstract: In this paper we present a experimental evaluation of a boosting based learning system and show that can be run efficiently over a large dataset. The system uses as base learner decision stumps, single atribute decision trees with only two terminal nodes. To select the best decision stump at each iteration we use an adaptive sampling method. As a boosting algorithm, we use a modification of AdaBoost that is suitable to be combined with a base learner that does not use all the dataset. We provide experimental evidence that our method is as accurate as the equivalent algorithm that uses all the dataset but much faster.

Dissertation
01 Jan 2000
TL;DR: A theoretical framework is proposed to specify the relationship between distance measurement and class score and suggest a novel combining method to reduce the effect of code word selection in non-optimum codes and suggest novel reconstruction frameworks to combine the component outputs.
Abstract: In "decomposition/reconstruction" strategy, we can solve a complex problem by 1) decomposing the problem into simpler sub-problems, 2) solving sub-problems with simpler systems (sub-systems) and 3) combining the results of sub-systems to solve the original problem. In a classification task we may have "label complexity" which is due to high number of possible classes, "function complexity" which means the existence of complex input-output relationship, and "input complexity" which is due to requirement of a huge feature set to represent patterns. Error Correcting Output Code (ECOC) is a technique to reduce the label complexity in which a multi-class problem will be decomposed into a set of binary sub-problems, based oil the sequence of "0"s and "1"s of the columns of a decomposition (code) matrix. Then a given pattern can be assigned to the class having minimum distance to the results of sub-problems. The lack of knowledge about the relationship between distance measurement and class score (like posterior probabilities) has caused some essential shortcomings to answering questions about "source of effectiveness", "error analysis", " code selecting ", and " alternative reconstruction methods" in previous works. Proposing a theoretical framework in this thesis to specify this relationship, our main contributions in this subject are to: 1) explain the theoretical reasons for code selection conditions 2) suggest new conditions for code generation (equidistance code)which minimise reconstruction error and address a search technique for code selection 3) provide an analysis to show the effect of different kinds of error on final performance 4) suggest a novel combining method to reduce the effect of code word selection in non-optimum codes 5) suggest novel reconstruction frameworks to combine the component outputs. Some experiments on artificial and real benchmarks demonstrate significant improvement achieved in multi-class problems when simple feed forward neural networks are arranged based on suggested framework To solve the problem of function complexity we considered AdaBoost, as a technique which can be fused with ECOC to overcome its shortcoming for binary problems. And to handle the problems of huge feature sets, we have suggested a multi-net structure with local back propagation. To demonstrate these improvements on realistic problems a face recognition application is considered. Key words: decomposition/ reconstruction, reconstruction error, error correcting output codes, bias-variance decomposition.

Book ChapterDOI
05 Sep 2000
TL;DR: The proposed dynamic integration technique is evaluated with AdaBoost and Bagging, the decision committee approaches which have received extensive attention recently and results show that boosting and bagging have often significantly better accuracy with dynamic integration of classifiers than with simple voting.
Abstract: Decision committee learning has demonstrated spectacular success in reducing classification error from learned classifiers. These techniques develop a classifier in the form of a committee of subsidiary classifiers. The combination of outputs is usually performed by majority vote. Voting, however, has a shortcoming. It is unable to take into account local expertise. When a new instance is difficult to classify, then the average classifier will give a wrong prediction, and the majority vote will more probably result in a wrong prediction. Instead of voting, dynamic integration of classifiers can be used, which is based on the assumption that each committee member is best inside certain subareas of the whole feature space. In this paper, the proposed dynamic integration technique is evaluated with AdaBoost and Bagging, the decision committee approaches which have received extensive attention recently. The comparison results show that boosting and bagging have often significantly better accuracy with dynamic integration of classifiers than with simple voting.

Book ChapterDOI
13 Dec 2000
TL;DR: A boosting strategy to optimise the generalisation bound obtained recently by Shawe-Taylor and Cristianini in terms of the two norm of the slack variables is considered, which achieves significant improvements over Adaboost.
Abstract: The paper considers applying a boosting strategy to optimise the generalisation bound obtained recently by Shawe-Taylor and Cristianini [7] in terms of the two norm of the slack variables. The formulation performs gradient descent over the quadratic loss function which is insensitive to points with a large margin. A novel feature of this algorithm is a principled adaptation of the size of the target margin. Experiments with text and UCI data shows that the new algorithm improves the accuracy of boosting. DMarginBoost generally achieves significant improvements over Adaboost.



Book ChapterDOI
17 Dec 2000
TL;DR: The proposed dynamic integration of classifiers is evaluated in combination with the well-known decision committee approaches AdaBoost and Bagging and shows that both boosting and bagging produce often significantly higher accuracy with the dynamic integration than with voting.
Abstract: Decision committee learning has demonstrated outstanding success in reducing classification error with an ensemble of classifiers. In a way a decision committee is a classifier formed upon an ensemble of subsidiary classifiers. Voting, which is commonly used to produce the final decision of committees has, however, a shortcoming. It is unable to take into account local expertise. When a new instance is difficult to classify, then it easily happens that only the minority of the classifiers will succeed, and the majority voting will quite probably result in a wrong classification. We suggest that dynamic integration of classifiers is used instead of majority voting in decision committees. Our method is based on the assumption that each classifier is best inside certain subareas of the whole domain. In this paper, the proposed dynamic integration is evaluated in combination with the well-known decision committee approaches AdaBoost and Bagging. The comparison results show that both boosting and bagging produce often significantly higher accuracy with the dynamic integration than with voting.