Showing papers in &quot;Machine Learning in 2000&quot;

An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

TL;DR: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.

...read moreread less

Abstract: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

...read moreread less

3,123 citations

Journal Article•DOI•

[...]

Thomas G. Dietterich¹•Institutions (1)

Oregon State University¹

01 Aug 2000-Machine Learning

TL;DR: In this article, the authors compared the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5 and found that in situations with little or no classification noise, randomization is competitive with bagging but not as accurate as boosting.

...read moreread less

Abstract: Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a “base” learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.

...read moreread less

2,919 citations

Journal Article•DOI•

BoosTexter: A Boosting-based Systemfor Text Categorization

[...]

Robert E. Schapire¹, Yoram Singer²•Institutions (2)

AT&T Labs¹, Hebrew University of Jerusalem²

Reduction Techniques for Instance-BasedLearning Algorithms

TL;DR: In this article, a new and improved family of boosting algorithms is proposed for text categorization tasks, called BoosTexter, which learns from examples to perform multiclass text and speech categorization.

...read moreread less

Abstract: This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.

...read moreread less

2,108 citations

Journal Article•DOI•

[...]

D. Randall Wilson¹, Tony Martinez¹•Institutions (1)

Brigham Young University¹

A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms

TL;DR: Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest average generalization accuracy in these experiments, especially in the presence of uniform class noise.

...read moreread less

Abstract: Instance-based learning algorithms are often faced with the problem of deciding which instances to store for use during generalization. Storing too many instances can result in large memory requirements and slow execution speed, and can cause an oversensitivity to noise. This paper has two main purposes. First, it provides a survey of existing algorithms used to reduce storage requirements in instance-based learning algorithms and other exemplar-based algorithms. Second, it proposes six additional reduction algorithms called DROP1–DROP5 and DEL (three of which were first described in Wilson & Martinez, 1997c, as RT1–RT3) that can be used to remove instances from the concept description. These algorithms and 10 algorithms from the survey are compared on 31 classification tasks. Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest average generalization accuracy in these experiments, especially in the presence of uniform class noise.

...read moreread less

1,234 citations

Journal Article•DOI•

[...]

Tjen-Sien Lim¹, Wei-Yin Loh¹, Yu-Shan Shih²•Institutions (2)

University of Wisconsin-Madison¹, National Chung Cheng University²

MultiBoosting: A Technique for Combining Boosting and Wagging

TL;DR: Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed, but C 4.5 tends to produce trees with twice as many leaves as those fromIND-Cart and QUEST.

...read moreread less

Abstract: Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classification accuracy, training time, and (in the case of trees) number of leaves. Classification accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, spline-based, algorithm called POLYCLSSS at the top, although it is not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is QUEST with linear splits, which ranks fourth and fifth, respectively. Although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The QUEST and logistic regression algorithms are substantially faster. Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed. But C4.5 tends to produce trees with twice as many leaves as those from IND-CART and QUEST.

...read moreread less

1,201 citations

Journal Article•DOI•

[...]

Geoffrey I. Webb¹•Institutions (1)

Deakin University¹

01 Aug 2000-Machine Learning

TL;DR: MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees that is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction.

...read moreread less

Abstract: MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, MultiBoosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.

...read moreread less

729 citations

Journal Article•DOI•

Convergence Results for Single-Step On-PolicyReinforcement-Learning Algorithms

[...]

Satinder Singh¹, Tommi S. Jaakkola², Michael L. Littman³, Csaba Szepesvári•Institutions (3)

AT&T Labs¹, Massachusetts Institute of Technology², Duke University³

Machine Learning for Information Extraction in Informal Domains

TL;DR: This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.

...read moreread less

Abstract: An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

...read moreread less

660 citations

Journal Article•DOI•

[...]

Dayne Freitag¹•Institutions (1)

Justsystem Pittsburgh Research Center¹

Lazy Learning of Bayesian Rules

TL;DR: A multistrategy approach which combines these learners and yields performance competitive with or better than the best of them is described, which is modular and flexible, and could find application in other machine learning problems.

...read moreread less

Abstract: We consider the problem of learning to perform information extraction in domains where linguistic processing is problematic, such as Usenet posts, email, and finger plan files. In place of syntactic and semantic information, other sources of information can be used, such as term frequency, typography, formatting, and mark-up. We describe four learning approaches to this problem, each drawn from a different paradigm: a rote learner, a term-space learner based on Naive Bayes, an approach using grammatical induction, and a relational rule learner. Experiments on 14 information extraction problems defined over four diverse document collections demonstrate the effectiveness of these approaches. Finally, we describe a multistrategy approach which combines these learners and yields performance competitive with or better than the best of them. This technique is modular and flexible, and could find application in other machine learning problems.

...read moreread less

410 citations

Journal Article•DOI•

[...]

Zijian Zheng¹, Geoffrey I. Webb¹•Institutions (1)

Deakin University¹

01 Oct 2000-Machine Learning

TL;DR: This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR, which can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes.

...read moreread less

Abstract: The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute inter-dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross-section of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.

...read moreread less

262 citations

Journal Article•DOI•

Randomizing Outputs to Increase Prediction Accuracy

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

LEARNABLE EVOLUTION MODEL: Evolutionary Processes Guided by Machine Learning

TL;DR: Two methods of randomizing outputs are experimented with, one is called output smearing and the other output flipping, which are shown to consistently do better than bagging.

...read moreread less

Abstract: Bagging and boosting reduce error by changing both the inputs and outputs to form perturbed training sets, growing predictors on these perturbed training sets and combining them. An interesting question is whether it is possible to get comparable performance by perturbing the outputs alone. Two methods of randomizing outputs are experimented with. One is called output smearing and the other output flipping. Both are shown to consistently do better than bagging.

...read moreread less

261 citations

Journal Article•DOI•

[...]

Ryszard S. Michalski¹•Institutions (1)

Polish Academy of Sciences¹

01 Jan 2000-Machine Learning

TL;DR: A remarkable property of LEM is that it is capable of quantum leaps (“insight jumps”) of the fitness function, unlike Darwinian-type evolution that typically proceeds through numerous slight improvements.

...read moreread less

Abstract: A new class of evolutionary computation processes is presented, called Learnable Evolution Model or LEM. In contrast to Darwinian-type evolution that relies on mutation, recombination, and selection operators, LEM employs machine learning to generate new populations. Specifically, in Machine Learning mode, a learning system seeks reasons why certain individuals in a population (or a collection of past populations) are superior to others in performing a designated class of tasks. These reasons, expressed as inductive hypotheses, are used to generate new populations. A remarkable property of LEM is that it is capable of quantum leaps (“insight jumps”) of the fitness function, unlike Darwinian-type evolution that typically proceeds through numerous slight improvements. In our early experimental studies, LEM significantly outperformed evolutionary computation methods used in the experiments, sometimes achieving speed-ups of two or more orders of magnitude in terms of the number of evolutionary steps. LEM has a potential for a wide range of applications, in particular, in such domains as complex optimization or search problems, engineering design, drug design, evolvable hardware, software engineering, economics, data mining, and automatic programming.

...read moreread less

Journal Article•DOI•

Multiple Comparisons in Induction Algorithms

[...]

David Jensen¹, Paul R. Cohen¹•Institutions (1)

University of Massachusetts Amherst¹

A Formalism for Relevance and Its Application in Feature Subset Selection

TL;DR: This work analyzes the statistical properties of MCPs and shows how failure to adjust for these properties leads to the pathologies of induction algorithms, including attribute selection errors, overfitting, and oversearching.

...read moreread less

Abstract: A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a multiple comparison procedure (MCP). We analyze the statistical properties of MCPs and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can control pathological behavior, including Bonferroni adjustment, randomization testing, and cross-validation.

...read moreread less

Journal Article•DOI•

[...]

David A. Bell¹, Hui Wang¹•Institutions (1)

Ulster University¹

01 Nov 2000-Machine Learning

TL;DR: A rigorous mathematical formalism is proposed for relevance, which is quantitative and normalized and applied to FSS, resulting in an improvement in prediction accuracy on 16 datasets, and a loss in accuracy on only 1 dataset.

...read moreread less

Abstract: The notion of relevance is used in many technical fields. In the areas of machine learning and data mining, for example, relevance is frequently used as a measure in feature subset selection (FSS). In previous studies, the interpretation of relevance has varied and its connection to FSS has been loose. In this paper a rigorous mathematical formalism is proposed for relevance, which is quantitative and normalized. To apply the formalism in FSS, a characterization is proposed for FSS: preservation of learning information and minimization of joint entropy. Based on the characterization, a tight connection between relevance and FSS is established: maximizing the relevance of features to the decision attribute, and the relevance of the decision attribute to the features. This connection is then used to design an algorithm for FSS. The algorithm is linear in the number of instances and quadratic in the number of features. The algorithm is evaluated using 23 public datasets, resulting in an improvement in prediction accuracy on 16 datasets, and a loss in accuracy on only 1 dataset. This provides evidence that both the formalism and its connection to FSS are sound.

...read moreread less

Journal Article•DOI•

Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

[...]

Filippo Menczer¹, Richard K. Belew²•Institutions (2)

University of Iowa¹, University of California, San Diego²

Technical Note: Naive Bayes for Regression

TL;DR: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents that browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion.

...read moreread less

Abstract: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents. These InfoSpiders browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion. Each agent adapts to the spatial and temporal regularities of its local context thanks to a combination of machine learning techniques inspired by ecological models: evolutionary adaptation with local selection, reinforcement learning and selective query expansion by internalization of environmental signals, and optional relevance feedback. We evaluate the feasibility and performance of these methods in three domains: a general class of artificial graph environments, a controlled subset of the Web, and (preliminarly) the full Web. Our results suggest that InfoSpiders could take advantage of the starting points provided by search engines, based on global word statistics, and then use linkage topology to guide their search on-line. We show how this approach can complement the current state of the art, especially with respect to the scalability challenge.

...read moreread less

Journal Article•DOI•

[...]

Eibe Frank¹, Leonard Eric Trigg¹, Geoffrey Holmes¹, Ian H. Witten¹•Institutions (1)

University of Waikato¹

01 Oct 2000-Machine Learning

TL;DR: This paper shows how to apply the naive Bayes methodology to numeric prediction tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weightedlinear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves.

...read moreread less

Abstract: Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates. This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.

...read moreread less

Journal Article•DOI•

Selecting Examples for Partial Memory Learning

[...]

Marcus A. Maloof¹, Ryszard S. Michalski²•Institutions (2)

Georgetown University¹, Polish Academy of Sciences²

01 Oct 2000-Machine Learning

TL;DR: Experimental results suggest that the partial memory learner notably reduced memory requirements at the slight expense of predictive accuracy, and tracked concept drift as well as other learners designed for this task.

...read moreread less

Abstract: This paper describes a method for selecting training examples for a partial memory learning system. The method selects extreme examples that lie at the boundaries of concept descriptions and uses these examples with new training examples to induce new concept descriptions. Forgetting mechanisms also may be active to remove examples from partial memory that are irrelevant or outdated for the learning task. Using an implementation of the method, we conducted a lesion study and a direct comparison to examine the effects of partial memory learning on predictive accuracy and on the number of training examples maintained during learning. These experiments involved the STAGGER Concepts, a synthetic problem, and two real-world problems: a blasting cap detection problem and a computer intrusion detection problem. Experimental results suggest that the partial memory learner notably reduced memory requirements at the slight expense of predictive accuracy, and tracked concept drift as well as other learners designed for this task.

...read moreread less

Journal Article•DOI•

Learning to Play Chess Using Temporal Differences

[...]

Jonathan Baxter¹, Andrew Tridgell¹, Lex Weaver¹•Institutions (1)

Australian National University¹

Improved Generalization Through Explicit Optimization of Margins

TL;DR: TDLEAF(λ), a variation on the TD(λ) algorithm that enables it to be used in conjunction with game-tree search, is presented and it is investigated whether it can yield better results in the domain of backgammon, where TD( ε) has previously yielded striking success.

...read moreread less

Abstract: In this paper we present TDLEAF(λ), a variation on the TD(λ) algorithm that enables it to be used in conjunction with game-tree search. We present some experiments in which our chess program “KnightCap” used TDLEAF(λ) to learn its evaluation function while playing on Internet chess servers. The main success we report is that KnightCap improved from a 1650 rating to a 2150 rating in just 308 games and 3 days of play. As a reference, a rating of 1650 corresponds to about level B human play (on a scale from E (1000) to A (1800)), while 2150 is human master level. We discuss some of the reasons for this success, principle among them being the use of on-line, rather than self-play. We also investigate whether TDLEAF(λ) can yield better results in the domain of backgammon, where TD(λ) has previously yielded striking success.

...read moreread less

Journal Article•DOI•

[...]

Llew Mason¹, Peter L. Bartlett¹, Jonathan Baxter¹•Institutions (1)

Australian National University¹

Nonparametric Time Series Prediction Through Adaptive ModelSelection

TL;DR: A theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin is proved, in contrast to previous results, which were stated in Terms of the particular cost function sgn(θ − margin).

...read moreread less

Abstract: Recent theoretical results have shown that the generalization performance of thresholded convex combinations of base classifiers is greatly improved if the underlying convex combination has large margins on the training data (i.e., correct examples are classified well away from the decision boundary). Neural network algorithms and AdaBoost have been shown to implicitly maximize margins, thus providing some theoretical justification for their remarkably good generalization performance. In this paper we are concerned with maximizing the margin explicitly. In particular, we prove a theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin, in contrast to previous results, which were stated in terms of the particular cost function sgn(t − margin). We then present a new algorithm, DOOM, for directly optimizing a piecewise-linear family of cost functions satisfying the conditions of the theorem. Experiments on several of the datasets in the UC Irvine database are presented in which AdaBoost was used to generate a set of base classifiers and then DOOM was used to find the optimal convex combination of those classifiers. In all but one case the convex combination generated by DOOM had lower test error than AdaBoost's combination. In many cases DOOM achieves these lower test errors by sacrificing training error, in the interests of reducing the new cost function. In our experiments the margin plots suggest that the size of the minimum margin is not the critical factor in determining generalization performance.

...read moreread less

Journal Article•DOI•

[...]

Ron Meir¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Apr 2000-Machine Learning

TL;DR: This work considers the problem of one-step ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, and derives nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik.

...read moreread less

Abstract: We consider the problem of one-step ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and adapt the uniform convergence framework of Vapnik and Chervonenkis to the problem of time series prediction, obtaining finite sample bounds. Furthermore, by allowing both the model complexity and memory size to be adaptively determined by the data, we derive nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik. All our results are derived for general L error measures, and apply to both exponentially and algebraically mixing processes.

...read moreread less

Journal Article•DOI•

On-line Learning and the Metrical Task System Problem

[...]

Avrim Blum¹, Carl Burch¹•Institutions (1)

Carnegie Mellon University¹

01 Apr 2000-Machine Learning

TL;DR: An experimental comparison of how these algorithms perform on a process migration problem, a problem that combines aspects of both the experts-tracking and MTS formalisms, is presented.

...read moreread less

Abstract: The problem of combining expert advice, studied extensively in the Computational Learning Theory literature, and the Metrical Task System (MTS) problem, studied extensively in the area of On-line Algorithms, contain a number of interesting similarities. In this paper we explore the relationship between these problems and show how algorithms designed for each can be used to achieve good bounds and new approaches for solving the other. Specific contributions of this paper include: • An analysis of how two recent algorithms for the MTS problem can be applied to the problem of tracking the best expert in the “decision-theoretic” setting, providing good bounds and an approach of a much different flavor from the well-known multiplicative-update algorithms. • An analysis showing how the standard randomized Weighted Majority (or Hedge) algorithm can be used for the problem of “combining on-line algorithms on-line”, giving much stronger guarantees than the results of Azar, Y., Broder, A., & Manasse, M. (1993). Proc ACM-SIAM Symposium on Discrete Algorithms (pp. 432–440) when the algorithms being combined occupy a state space of bounded diameter. • A generalization of the above, showing how (a simplified version of) Herbster and Warmuth's weight-sharing algorithm can be applied to give a “finely competitive” bound for the uniform-space Metrical Task System problem. We also give a new, simpler algorithm for tracking experts, which unfortunately does not carry over to the MTS problem. Finally, we present an experimental comparison of how these algorithms perform on a process migration problem, a problem that combines aspects of both the experts-tracking and MTS formalisms.

...read moreread less

Journal Article•DOI•

Enlarging the Margins in Perceptron Decision Trees

[...]

Kristin P. Bennett¹, Nello Cristianini², John Shawe-Taylor², Donghui Wu¹•Institutions (2)

Rensselaer Polytechnic Institute¹, Royal Holloway, University of London²

01 Dec 2000-Machine Learning

TL;DR: It is proved that other quantities can be as relevant to reduce their flexibility and combat overfitting to provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes.

...read moreread less

Abstract: Capacity control in perceptron decision trees is typically performed by controlling their size. We prove that other quantities can be as relevant to reduce their flexibility and combat overfitting. In particular, we provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes. So enlarging the margin in perceptron decision trees will reduce the upper bound on generalization error. Based on this analysis, we introduce three new algorithms, which can induce large margin perceptron decision trees. To assess the effect of the large margin bias, OC1 (Journal of Artificial Intelligence Research, 1994, 2, 1–32.) of Murthy, Kasif and Salzberg, a well-known system for inducing perceptron decision trees, is used as the baseline algorithm. An extensive experimental study on real world data showed that all three new algorithms perform better or at least not significantly worse than OC1 on almost every dataset with only one exception. OC1 performed worse than the best margin-based method on every dataset.

...read moreread less

Journal Article•DOI•

Multistrategy Theory Revision: Induction and Abductionin INTHELEX

[...]

Floriana Esposito¹, Giovanni Semeraro¹, Nicola Fanizzi¹, Stefano Ferilli¹•Institutions (1)

University of Bari¹

01 Jan 2000-Machine Learning

TL;DR: This paper presents an integration of induction and abduction in INTHELEX, a prototypical incremental learning system that has been run on a standard dataset about family trees as well as in the domain of document classification to prove the effectiveness of such multistrategy incrementallearning system with respect to a classical batch algorithm.

...read moreread less

Abstract: This paper presents an integration of induction and abduction in INTHELEX, a prototypical incremental learning system. The refinement operators perform theory revision in a search space whose structure is induced by a quasi-ordering, derived from Plotkin's t-subsumption, compliant with the principle of Object Identity. A reduced complexity of the refinement is obtained, without a major loss in terms of expressiveness. These inductive operators have been proven ideal for this search space. Abduction supports the inductive operators in the completion of the incoming new observations. Experiments have been run on a standard dataset about family trees as well as in the domain of document classification to prove the effectiveness of such multistrategy incremental learning system with respect to a classical batch algorithm.

...read moreread less

Journal Article•DOI•

A Study of Reinforcement Learning in the Continuous Case by the Means of Viscosity Solutions

[...]

Rémi Munos¹•Institutions (1)

Carnegie Mellon University¹