scispace - formally typeset
Search or ask a question

Showing papers in "Machine Learning in 2000"


Journal ArticleDOI
TL;DR: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents, and presents two extensions to the algorithm that improve classification accuracy under these conditions.
Abstract: This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

3,123 citations


Journal ArticleDOI
TL;DR: In this article, the authors compared the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5 and found that in situations with little or no classification noise, randomization is competitive with bagging but not as accurate as boosting.
Abstract: Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a “base” learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.

2,919 citations


Journal ArticleDOI
TL;DR: In this article, a new and improved family of boosting algorithms is proposed for text categorization tasks, called BoosTexter, which learns from examples to perform multiclass text and speech categorization.
Abstract: This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.

2,108 citations


Journal ArticleDOI
TL;DR: Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest average generalization accuracy in these experiments, especially in the presence of uniform class noise.
Abstract: Instance-based learning algorithms are often faced with the problem of deciding which instances to store for use during generalization. Storing too many instances can result in large memory requirements and slow execution speed, and can cause an oversensitivity to noise. This paper has two main purposes. First, it provides a survey of existing algorithms used to reduce storage requirements in instance-based learning algorithms and other exemplar-based algorithms. Second, it proposes six additional reduction algorithms called DROP1–DROP5 and DEL (three of which were first described in Wilson & Martinez, 1997c, as RT1–RT3) that can be used to remove instances from the concept description. These algorithms and 10 algorithms from the survey are compared on 31 classification tasks. Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest average generalization accuracy in these experiments, especially in the presence of uniform class noise.

1,234 citations


Journal ArticleDOI
TL;DR: Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed, but C 4.5 tends to produce trees with twice as many leaves as those fromIND-Cart and QUEST.
Abstract: Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classification accuracy, training time, and (in the case of trees) number of leaves. Classification accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, spline-based, algorithm called POLYCLSSS at the top, although it is not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is QUEST with linear splits, which ranks fourth and fifth, respectively. Although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The QUEST and logistic regression algorithms are substantially faster. Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed. But C4.5 tends to produce trees with twice as many leaves as those from IND-CART and QUEST.

1,201 citations


Journal ArticleDOI
TL;DR: MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees that is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction.
Abstract: MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, MultiBoosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution.

729 citations


Journal ArticleDOI
TL;DR: This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.
Abstract: An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

660 citations


Journal ArticleDOI
TL;DR: A multistrategy approach which combines these learners and yields performance competitive with or better than the best of them is described, which is modular and flexible, and could find application in other machine learning problems.
Abstract: We consider the problem of learning to perform information extraction in domains where linguistic processing is problematic, such as Usenet posts, email, and finger plan files. In place of syntactic and semantic information, other sources of information can be used, such as term frequency, typography, formatting, and mark-up. We describe four learning approaches to this problem, each drawn from a different paradigm: a rote learner, a term-space learner based on Naive Bayes, an approach using grammatical induction, and a relational rule learner. Experiments on 14 information extraction problems defined over four diverse document collections demonstrate the effectiveness of these approaches. Finally, we describe a multistrategy approach which combines these learners and yields performance competitive with or better than the best of them. This technique is modular and flexible, and could find application in other machine learning problems.

410 citations


Journal ArticleDOI
TL;DR: This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR, which can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes.
Abstract: The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. A number of approaches have sought to alleviate this problem. A Bayesian tree learning algorithm builds a decision tree, and generates a local naive Bayesian classifier at each leaf. The tests leading to a leaf can alleviate attribute inter-dependencies for the local naive Bayesian classifier. However, Bayesian tree learning still suffers from the small disjunct problem of tree learning. While inferred Bayesian trees demonstrate low average prediction error rates, there is reason to believe that error rates will be higher for those leaves with few training examples. This paper proposes the application of lazy learning techniques to Bayesian tree induction and presents the resulting lazy Bayesian rule learning algorithm, called LBR. This algorithm can be justified by a variant of Bayes theorem which supports a weaker conditional attribute independence assumption than is required by naive Bayes. For each test example, it builds a most appropriate rule with a local naive Bayesian classifier as its consequent. It is demonstrated that the computational requirements of LBR are reasonable in a wide cross-section of natural domains. Experiments with these domains show that, on average, this new algorithm obtains lower error rates significantly more often than the reverse in comparison to a naive Bayesian classifier, C4.5, a Bayesian tree learning algorithm, a constructive Bayesian classifier that eliminates attributes and constructs new attributes using Cartesian products of existing nominal attributes, and a lazy decision tree learning algorithm. It also outperforms, although the result is not statistically significant, a selective naive Bayesian classifier.

262 citations


Journal ArticleDOI
TL;DR: Two methods of randomizing outputs are experimented with, one is called output smearing and the other output flipping, which are shown to consistently do better than bagging.
Abstract: Bagging and boosting reduce error by changing both the inputs and outputs to form perturbed training sets, growing predictors on these perturbed training sets and combining them. An interesting question is whether it is possible to get comparable performance by perturbing the outputs alone. Two methods of randomizing outputs are experimented with. One is called output smearing and the other output flipping. Both are shown to consistently do better than bagging.

261 citations


Journal ArticleDOI
TL;DR: A remarkable property of LEM is that it is capable of quantum leaps (“insight jumps”) of the fitness function, unlike Darwinian-type evolution that typically proceeds through numerous slight improvements.
Abstract: A new class of evolutionary computation processes is presented, called Learnable Evolution Model or LEM. In contrast to Darwinian-type evolution that relies on mutation, recombination, and selection operators, LEM employs machine learning to generate new populations. Specifically, in Machine Learning mode, a learning system seeks reasons why certain individuals in a population (or a collection of past populations) are superior to others in performing a designated class of tasks. These reasons, expressed as inductive hypotheses, are used to generate new populations. A remarkable property of LEM is that it is capable of quantum leaps (“insight jumps”) of the fitness function, unlike Darwinian-type evolution that typically proceeds through numerous slight improvements. In our early experimental studies, LEM significantly outperformed evolutionary computation methods used in the experiments, sometimes achieving speed-ups of two or more orders of magnitude in terms of the number of evolutionary steps. LEM has a potential for a wide range of applications, in particular, in such domains as complex optimization or search problems, engineering design, drug design, evolvable hardware, software engineering, economics, data mining, and automatic programming.

Journal ArticleDOI
TL;DR: This work analyzes the statistical properties of MCPs and shows how failure to adjust for these properties leads to the pathologies of induction algorithms, including attribute selection errors, overfitting, and oversearching.
Abstract: A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a multiple comparison procedure (MCP). We analyze the statistical properties of MCPs and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can control pathological behavior, including Bonferroni adjustment, randomization testing, and cross-validation.

Journal ArticleDOI
TL;DR: A rigorous mathematical formalism is proposed for relevance, which is quantitative and normalized and applied to FSS, resulting in an improvement in prediction accuracy on 16 datasets, and a loss in accuracy on only 1 dataset.
Abstract: The notion of relevance is used in many technical fields. In the areas of machine learning and data mining, for example, relevance is frequently used as a measure in feature subset selection (FSS). In previous studies, the interpretation of relevance has varied and its connection to FSS has been loose. In this paper a rigorous mathematical formalism is proposed for relevance, which is quantitative and normalized. To apply the formalism in FSS, a characterization is proposed for FSS: preservation of learning information and minimization of joint entropy. Based on the characterization, a tight connection between relevance and FSS is established: maximizing the relevance of features to the decision attribute, and the relevance of the decision attribute to the features. This connection is then used to design an algorithm for FSS. The algorithm is linear in the number of instances and quadratic in the number of features. The algorithm is evaluated using 23 public datasets, resulting in an improvement in prediction accuracy on 16 datasets, and a loss in accuracy on only 1 dataset. This provides evidence that both the formalism and its connection to FSS are sound.

Journal ArticleDOI
TL;DR: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents that browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion.
Abstract: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents. These InfoSpiders browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion. Each agent adapts to the spatial and temporal regularities of its local context thanks to a combination of machine learning techniques inspired by ecological models: evolutionary adaptation with local selection, reinforcement learning and selective query expansion by internalization of environmental signals, and optional relevance feedback. We evaluate the feasibility and performance of these methods in three domains: a general class of artificial graph environments, a controlled subset of the Web, and (preliminarly) the full Web. Our results suggest that InfoSpiders could take advantage of the starting points provided by search engines, based on global word statistics, and then use linkage topology to guide their search on-line. We show how this approach can complement the current state of the art, especially with respect to the scalability challenge.

Journal ArticleDOI
TL;DR: This paper shows how to apply the naive Bayes methodology to numeric prediction tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weightedlinear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves.
Abstract: Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates. This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.

Journal ArticleDOI
TL;DR: Experimental results suggest that the partial memory learner notably reduced memory requirements at the slight expense of predictive accuracy, and tracked concept drift as well as other learners designed for this task.
Abstract: This paper describes a method for selecting training examples for a partial memory learning system. The method selects extreme examples that lie at the boundaries of concept descriptions and uses these examples with new training examples to induce new concept descriptions. Forgetting mechanisms also may be active to remove examples from partial memory that are irrelevant or outdated for the learning task. Using an implementation of the method, we conducted a lesion study and a direct comparison to examine the effects of partial memory learning on predictive accuracy and on the number of training examples maintained during learning. These experiments involved the STAGGER Concepts, a synthetic problem, and two real-world problems: a blasting cap detection problem and a computer intrusion detection problem. Experimental results suggest that the partial memory learner notably reduced memory requirements at the slight expense of predictive accuracy, and tracked concept drift as well as other learners designed for this task.

Journal ArticleDOI
TL;DR: TDLEAF(λ), a variation on the TD(λ) algorithm that enables it to be used in conjunction with game-tree search, is presented and it is investigated whether it can yield better results in the domain of backgammon, where TD( ε) has previously yielded striking success.
Abstract: In this paper we present TDLEAF(λ), a variation on the TD(λ) algorithm that enables it to be used in conjunction with game-tree search. We present some experiments in which our chess program “KnightCap” used TDLEAF(λ) to learn its evaluation function while playing on Internet chess servers. The main success we report is that KnightCap improved from a 1650 rating to a 2150 rating in just 308 games and 3 days of play. As a reference, a rating of 1650 corresponds to about level B human play (on a scale from E (1000) to A (1800)), while 2150 is human master level. We discuss some of the reasons for this success, principle among them being the use of on-line, rather than self-play. We also investigate whether TDLEAF(λ) can yield better results in the domain of backgammon, where TD(λ) has previously yielded striking success.

Journal ArticleDOI
TL;DR: A theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin is proved, in contrast to previous results, which were stated in Terms of the particular cost function sgn(θ − margin).
Abstract: Recent theoretical results have shown that the generalization performance of thresholded convex combinations of base classifiers is greatly improved if the underlying convex combination has large margins on the training data (i.e., correct examples are classified well away from the decision boundary). Neural network algorithms and AdaBoost have been shown to implicitly maximize margins, thus providing some theoretical justification for their remarkably good generalization performance. In this paper we are concerned with maximizing the margin explicitly. In particular, we prove a theorem bounding the generalization performance of convex combinations in terms of general cost functions of the margin, in contrast to previous results, which were stated in terms of the particular cost function sgn(t − margin). We then present a new algorithm, DOOM, for directly optimizing a piecewise-linear family of cost functions satisfying the conditions of the theorem. Experiments on several of the datasets in the UC Irvine database are presented in which AdaBoost was used to generate a set of base classifiers and then DOOM was used to find the optimal convex combination of those classifiers. In all but one case the convex combination generated by DOOM had lower test error than AdaBoost's combination. In many cases DOOM achieves these lower test errors by sacrificing training error, in the interests of reducing the new cost function. In our experiments the margin plots suggest that the size of the minimum margin is not the critical factor in determining generalization performance.

Journal ArticleDOI
TL;DR: This work considers the problem of one-step ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, and derives nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik.
Abstract: We consider the problem of one-step ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and adapt the uniform convergence framework of Vapnik and Chervonenkis to the problem of time series prediction, obtaining finite sample bounds. Furthermore, by allowing both the model complexity and memory size to be adaptively determined by the data, we derive nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik. All our results are derived for general L error measures, and apply to both exponentially and algebraically mixing processes.

Journal ArticleDOI
TL;DR: An experimental comparison of how these algorithms perform on a process migration problem, a problem that combines aspects of both the experts-tracking and MTS formalisms, is presented.
Abstract: The problem of combining expert advice, studied extensively in the Computational Learning Theory literature, and the Metrical Task System (MTS) problem, studied extensively in the area of On-line Algorithms, contain a number of interesting similarities. In this paper we explore the relationship between these problems and show how algorithms designed for each can be used to achieve good bounds and new approaches for solving the other. Specific contributions of this paper include: • An analysis of how two recent algorithms for the MTS problem can be applied to the problem of tracking the best expert in the “decision-theoretic” setting, providing good bounds and an approach of a much different flavor from the well-known multiplicative-update algorithms. • An analysis showing how the standard randomized Weighted Majority (or Hedge) algorithm can be used for the problem of “combining on-line algorithms on-line”, giving much stronger guarantees than the results of Azar, Y., Broder, A., & Manasse, M. (1993). Proc ACM-SIAM Symposium on Discrete Algorithms (pp. 432–440) when the algorithms being combined occupy a state space of bounded diameter. • A generalization of the above, showing how (a simplified version of) Herbster and Warmuth's weight-sharing algorithm can be applied to give a “finely competitive” bound for the uniform-space Metrical Task System problem. We also give a new, simpler algorithm for tracking experts, which unfortunately does not carry over to the MTS problem. Finally, we present an experimental comparison of how these algorithms perform on a process migration problem, a problem that combines aspects of both the experts-tracking and MTS formalisms.

Journal ArticleDOI
TL;DR: It is proved that other quantities can be as relevant to reduce their flexibility and combat overfitting to provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes.
Abstract: Capacity control in perceptron decision trees is typically performed by controlling their size. We prove that other quantities can be as relevant to reduce their flexibility and combat overfitting. In particular, we provide an upper bound on the generalization error which depends both on the size of the tree and on the margin of the decision nodes. So enlarging the margin in perceptron decision trees will reduce the upper bound on generalization error. Based on this analysis, we introduce three new algorithms, which can induce large margin perceptron decision trees. To assess the effect of the large margin bias, OC1 (Journal of Artificial Intelligence Research, 1994, 2, 1–32.) of Murthy, Kasif and Salzberg, a well-known system for inducing perceptron decision trees, is used as the baseline algorithm. An extensive experimental study on real world data showed that all three new algorithms perform better or at least not significantly worse than OC1 on almost every dataset with only one exception. OC1 performed worse than the best margin-based method on every dataset.

Journal ArticleDOI
TL;DR: This paper presents an integration of induction and abduction in INTHELEX, a prototypical incremental learning system that has been run on a standard dataset about family trees as well as in the domain of document classification to prove the effectiveness of such multistrategy incrementallearning system with respect to a classical batch algorithm.
Abstract: This paper presents an integration of induction and abduction in INTHELEX, a prototypical incremental learning system. The refinement operators perform theory revision in a search space whose structure is induced by a quasi-ordering, derived from Plotkin's t-subsumption, compliant with the principle of Object Identity. A reduced complexity of the refinement is obtained, without a major loss in terms of expressiveness. These inductive operators have been proven ideal for this search space. Abduction supports the inductive operators in the completion of the incoming new observations. Experiments have been run on a standard dataset about family trees as well as in the domain of document classification to prove the effectiveness of such multistrategy incremental learning system with respect to a classical batch algorithm.

Journal ArticleDOI
TL;DR: A general convergence theorem is derived for RL algorithms when one uses only “approximations” of the initial data, which can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics, and based on FE or FD discretization methods.
Abstract: This paper proposes a study of Reinforcement Learning (RL) for continuous state-space and time control problems, based on the theoretical framework of viscosity solutions (VSs). We use the method of dynamic programming (DP) which introduces the value function (VF), expectation of the best future cumulative reinforcement. In the continuous case, the value function satisfies a non-linear first (or second) order (depending on the deterministic or stochastic aspect of the process) differential equation called the Hamilton-Jacobi-Bellman (HJB) equation. It is well known that there exists an infinity of generalized solutions (differentiable almost everywhere) to this equation, other than the VF. We show that gradient-descent methods may converge to one of these generalized solutions, thus failing to find the optimal control. In order to solve the HJB equation, we use the powerful framework of viscosity solutions and state that there exists a unique viscosity solution to the HJB equation, which is the value function. Then, we use another main result of VSs (their stability when passing to the limit) to prove the convergence of numerical approximations schemes based on finite difference (FD) and finite element (FE) methods. These methods discretize, at some resolution, the HJB equation into a DP equation of a Markov Decision Process (MDP), which can be solved by DP methods (thanks to a “strong” contraction property) if all the initial data (the state dynamics and the reinforcement function) were perfectly known. However, in the RL approach, as we consider a system in interaction with some a priori (at least partially) unknown environment, which learns “from experience”, the initial data are not perfectly known but have to be approximated during learning. The main contribution of this work is to derive a general convergence theorem for RL algorithms when one uses only “approximations” (in a sense of satisfying some “weak” contraction property) of the initial data. This result can be used for model-based or model-free RL algorithms, with off-line or on-line updating methods, for deterministic or stochastic state dynamics (though this latter case is not described here), and based on FE or FD discretization methods. It is illustrated with several RL algorithms and one numerical simulation for the “Car on the Hill” problem.

Journal ArticleDOI
TL;DR: Using a deterministic analysis in a general metric space setting, this paper provides a technique for constructing a successful prediction algorithm, given a successful estimation algorithm, for the prediction of changing concepts.
Abstract: This paper examines learning problems in which the target function is allowed to change. The learner sees a sequence of random examples, labelled according to a sequence of functions, and must provide an accurate estimate of the target function sequence. We consider a variety of restrictions on how the target function is allowed to change, including infrequent but arbitrary changes, sequences that correspond to slow walks on a graph whose nodes are functions, and changes that are small on average, as measured by the probability of disagreements between consecutive functions. We first study estimation, in which the learner sees a batch of examples and is then required to give an accurate estimate of the function sequence. Our results provide bounds on the sample complexity and allowable drift rate for these problems. We also study prediction, in which the learner must produce online a hypothesis after each labelled example and the average misclassification probability over this hypothesis sequence should be small. Using a deterministic analysis in a general metric space setting, we provide a technique for constructing a successful prediction algorithm, given a successful estimation algorithm. This leads to sample complexity and drift rate bounds for the prediction of changing concepts.

Journal ArticleDOI
TL;DR: An explanation of the early, linearly-decreasing behavior of the learning curves and the bounds as well as a study of the asymptotic behaviors of the curves are presented.
Abstract: In this paper we introduce and illustrate non-trivial upper and lower bounds on the learning curves for one-dimensional Guassian Processes. The analysis is carried out emphasising the effects induced on the bounds by the smoothness of the random process described by the Modified Bessel and the Squared Exponential covariance functions. We present an explanation of the early, linearly-decreasing behavior of the learning curves and the bounds as well as a study of the asymptotic behavior of the curves. The effects of the noise level and the lengthscale on the tightness of the bounds are also discussed.

Journal ArticleDOI
TL;DR: This paper proposes a method, based on a Monte Carlo algorithm, to estimate on-line the likelihood that the current matching problem will exceed a given amount of computational resources, and demonstrates that phase transitions also appear in real-world learning problems, and that learners tend to generate inductive hypotheses lying exactly on the phase transition.
Abstract: One of the major limitations of relational learning is due to the complexity of verifying hypotheses on examples. In this paper we investigate this task in light of recent published results, which show that many hard problems exhibit a narrow “phase transition” with respect to some order parameter, coupled with a large increase in computational complexity. First we show that matching a class of artificially generated Horn clauses on ground instances presents a typical phase transition in solvability with respect to both the number of literals in the clause and the number of constants occurring in the instance to match. Then, we demonstrate that phase transitions also appear in real-world learning problems, and that learners tend to generate inductive hypotheses lying exactly on the phase transition. On the other hand, an extensive experimenting revealed that not every matching problem inside the phase transition region is intractable. However, unfortunately, identifying those that are feasible cannot be done solely on the basis of the order parameters. To face this problem, we propose a method, based on a Monte Carlo algorithm, to estimate on-line the likelihood that the current matching problem will exceed a given amount of computational resources. The impact of the above findings on relational learning is discussed.

Journal ArticleDOI
TL;DR: The representation formalism based on feature terms and the corresponding notions of subsumption and anti-unification are introduced and the results of INDIE are shown, showing that it's able to find hypotheses at a level comparable to the original ones.
Abstract: The aim of relational learning is to develop methods for the induction of hypotheses in representation formalisms that are more expressive than attribute-value representation. Most work on relational learning has been focused on induction in subsets of first order logic like Horn clauses. In this paper we introduce the representation formalism based on feature terms and we introduce the corresponding notions of subsumption and anti-unification. Then we explain INDIE, a heuristic bottom-up learning method that induces class hypotheses, in the form of feature terms, from positive and negative examples. The biases used in INDIE while searching the hypothesis space are explained while describing INDIE's algorithms. The representational bias of INDIE can be summarised in that it makes an intensive use of sorts and sort hierarchy, and in that it does not use negation but focuses on detecting path equalities. We show the results of INDIE in some classical relational datasets showing that it's able to find hypotheses at a level comparable to the original ones. The differences between INDIE's hypotheses and those of the other systems are explained by the bias in searching the hypothesis space and on the representational bias of the hypothesis language of each system.

Journal ArticleDOI
TL;DR: An approach based on stochastic grammatical inference is described which scales more naturally to large data sets and produces models with richer semantics and modifications that enable better interactive control of results are described.
Abstract: For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on stochastic grammatical inference which scales more naturally to large data sets and produces models with richer semantics. We adopt an algorithm that produces stochastic finite automata and describe modifications that enable better interactive control of results. Our experimental evaluation uses four document collections with varying structure.

Journal ArticleDOI
TL;DR: An any-time relational inference algorithm that proceeds by stochastically sampling the inference search space, after this space has been judiciously restricted using strongly-typed logic-like declarations is presented.
Abstract: One of the obstacles to widely using first-order logic languages is the fact that relational inference is intractable in the worst case. This paper presents an any-time relational inference algorithm: it proceeds by stochastically sampling the inference search space, after this space has been judiciously restricted using strongly-typed logic-like declarations. We present a relational learner producing programs geared to stochastic inference, named STILL, to enforce the potentialities of this framework. STILL handles examples described as definite or constrained clauses, and uses sampling-based heuristics again to achieve any-time learning. Controlling both the construction and the exploitation of logic programs yields robust relational reasoning, where deductive biases are compensated for by inductive biases, and vice versa.

Journal ArticleDOI
TL;DR: This work describes a tagger which is able to use information of any kind, and in particular to incorporate the machine-learned decision trees, and addresses the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus.
Abstract: We have applied the inductive learning of statistical decision trees and relaxation labeling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities, consisting of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired decision trees have been directly used in a tagger that is both relatively simple and fast, and which has been tested and evaluated on the Wall Street Journal (WSJ) corpus with competitive accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labeling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine-learned decision trees. Simultaneously, we address the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that high levels of accuracy can be achieved with our system in this situation, and report some results obtained when using it to develop a 5.5 million words Spanish corpus from scratch.