scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Artificial Intelligence Research in 1994"


Journal ArticleDOI
TL;DR: In this article, error-correcting output codes are employed as a distributed output representation to improve the performance of decision-tree algorithms for multiclass learning problems, such as C4.5 and CART.
Abstract: Multiclass learning problems involve finding a definition for an unknown function f(x) whose range is a discrete set containing k > 2 values (i.e., k "classes"). The definition is acquired by studying collections of training examples of the form (xi, f(xi)). Existing approaches to multiclass learning problems include direct application of multiclass algorithms such as the decision-tree algorithms C4.5 and CART, application of binary concept learning algorithms to learn individual binary functions for each of the k classes, and application of binary concept learning algorithms with distributed output representations. This paper compares these three approaches to a new technique in which error-correcting codes are employed as a distributed output representation. We show that these output representations improve the generalization performance of both C4.5 and backpropagation on a wide range of multiclass learning tasks. We also demonstrate that this approach is robust with respect to changes in the size of the training sample, the assignment of distributed representations to particular classes, and the application of overfitting avoidance techniques such as decision-tree pruning. Finally, we show that--like the other methods--the error-correcting code technique can provide reliable class probability estimates. Taken together, these results demonstrate that error-correcting output codes provide a general-purpose method for improving the performance of inductive learning programs on multiclass problems.

2,542 citations


Journal ArticleDOI
TL;DR: In this paper, a new system for induction of oblique decision trees, OC1, combines deterministic hill-climbing with two forms of randomization to find a good oblique split (in the form of a hyperplane) at each node of a decision tree.
Abstract: This article describes a new system for induction of oblique decision trees. This system, OC1, combines deterministic hill-climbing with two forms of randomization to find a good oblique split (in the form of a hyperplane) at each node of a decision tree. Oblique decision tree methods are tuned especially for domains in which the attributes are numeric, although they can be adapted to symbolic or mixed symbolic/numeric attributes. We present extensive empirical studies, using both real and artificial data, that analyze OC1's ability to construct oblique trees that are smaller and more accurate than their axis-parallel counterparts. We also examine the benefits of randomization for the construction of oblique decision trees.

845 citations


Journal ArticleDOI
TL;DR: In this article, a multidisciplinary review of empirical, statistical learning from a graphical model perspective is presented, including decomposition, differentiation, and manipulation of probability models from the exponential family.
Abstract: This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Well-known examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models are extended to model data analysis and empirical learning using the notation of plates. Graphical operations for simplifying and manipulating a problem are provided including decomposition, differentiation, and the manipulation of probability models from the exponential family. Two standard algorithm schemas for learning are reviewed in a graphical framework: Gibbs sampling and the expectation maximization algorithm. Using these operations and schemas, some popular algorithms can be synthesized from their graphical specification. This includes versions of linear regression, techniques for feed-forward networks, and learning Gaussian and discrete Bayesian networks frorn data. The paper concludes by sketching some implications for data analysis and summarizing how some popular algorithms fall within the framework presented. The main original contributions here are the decomposition techniques and the demonstration that graphical models provide a framework for understanding and developing complex learning algorithms.

617 citations


Journal ArticleDOI
TL;DR: This paper introduces ICET, a new algorithm for cost-sensitive classification that uses a genetic algorithm to evolve a population of biases for a decision tree induction algorithm and establishes that ICET performs significantly better than its competitors.
Abstract: This paper introduces ICET, a new algorithm for cost-sensitive classification. ICET uses a genetic algorithm to evolve a population of biases for a decision tree induction algorithm. The fitness function of the genetic algorithm is the average cost of classification when using the decision tree, including both the costs of tests (features, measurements) and the costs of classification errors. ICET is compared here with three other algorithms for cost-sensitive classification -- EG2, CS-ID3, and IDX -- and also with C4.5, which classifies without regard to cost. The five algorithms are evaluated empirically on five real-world medical datasets. Three sets of experiments are performed. The first set examines the baseline performance of the five algorithms on the five datasets and establishes that ICET performs significantly better than its competitors. The second set tests the robustness of ICET under a variety of conditions and shows that ICET maintains its advantage. The third set looks at ICET's search in bias space and discovers a way to improve the search.

534 citations


Journal ArticleDOI
TL;DR: In this paper, the authors study the process of multi-agent reinforcement learning in the context of load balancing in a distributed system, without use of either central coordination or explicit communication.
Abstract: We study the process of multi-agent reinforcement learning in the context of load balancing in a distributed system, without use of either central coordination or explicit communication. We first define a precise framework in which to study adaptive load balancing, important features of which are its stochastic nature and the purely local information available to individual agents. Given this framework, we show illuminating results on the interplay between basic adaptive behavior parameters and their effect on system efficiency. We then investigate the properties of adaptive load balancing in heterogeneous populations, and address the issue of exploration vs. exploitation in that context. Finally, we show that naive use of communication may not improve, and might even harm system efficiency.

201 citations


Journal ArticleDOI
TL;DR: It is shown how to construct agents with bounded optimality for a simple class of machine architectures in a broad class of real-time environments, and relates it to similar trends in philosophy, economics, and game theory.
Abstract: Since its inception, artificial intelligence has relied upon a theoretical foundation centred around perfect rationality as the desired property of intelligent systems We argue, as others have done, that this foundation is inadequate because it imposes fundamentally unsatisfiable requirements As a result, there has arisen a wide gap between theory and practice in AI, hindering progress in the field We propose instead a property called bounded optimality Roughly speaking, an agent is bounded-optimal if its program is a solution to the constrained optimization problem presented by its architecture and the task environment We show how to construct agents with this property for a simple class of machine architectures in a broad class of real-time environments We illustrate these results using a simple model of an automated mail sorting facility We also define a weaker property, asymptotic bounded optimality (ABO), that generalizes the notion of optimality in classical complexity theory We then construct universal ABO programs, ie, programs that are ABO no matter what real-time constraints are applied Universal ABO programs can be used as building blocks for more complex systems We conclude with a discussion of the prospects for bounded optimality as a theoretical basis for AI, and relate it to similar trends in philosophy, economics, and game theory

180 citations


Journal ArticleDOI
TL;DR: The algorithm's completeness ensures that the adaptation algorithm will eventually search the entire graph and its systematicity ensures that it will do so without redundantly searching any parts of the graph.
Abstract: The paradigms of transformational planning, case-based planning, and plan debugging all involve a process known as plan adaptation -- modifying or repairing an old plan so it solves a new problem. In this paper we provide a domain-independent algorithm for plan adaptation, demonstrate that it is sound, complete, and systematic, and compare it to other adaptation algorithms in the literature. Our approach is based on a view of planning as searching a graph of partial plans. Generative planning starts at the graph's root and moves from node to node using plan-refinement operators. In planning by adaptation, a library plan--an arbitrary node in the plan graph--is the starting point for the search, and the plan-adaptation algorithm can apply both the same refinement operators available to a generative planner and can also retract constraints and steps from the plan. Our algorithm's completeness ensures that the adaptation algorithm will eventually search the entire graph and its systematicity ensures that it will do so without redundantly searching any parts of the graph.

131 citations


Journal ArticleDOI
TL;DR: Although the random-worlds method makes sense in general, the connection to maximum entropy seems to disappear in the non-unary case, and these observations suggest unexpected limitations to the applicability of maximum-entropy methods.
Abstract: Given a knowledge base KB containing first-order and statistical facts, we consider a principled method, called the random-worlds method, for computing a degree of belief that some formula Φ holds given KB. If we are reasoning about a world or system consisting of N individuals, then we can consider all possible worlds, or first-order models, with domain {1,..., N} that satisfy KB, and compute the fraction of them in which Φ is true. We define the degree of belief to be the asymptotic value of this fraction as N grows large. We show that when the vocabulary underlying Φ and KB uses constants and unary predicates only, we can naturally associate an entropy with each world. As N grows larger, there are many more worlds with higher entropy. Therefore, we can use a maximum-entropy computation to compute the degree of belief. This result is in a similar spirit to previous work in physics and artificial intelligence, but is far more general. Of equal interest to the result itself are the limitations on its scope. Most importantly, the restriction to unary predicates seems necessary. Although the random-worlds method makes sense in general, the connection to maximum entropy seems to disappear in the non-unary case. These observations suggest unexpected limitations to the applicability of maximum-entropy methods.

99 citations


Journal ArticleDOI
TL;DR: Learning the past tense of English verbs has generated heated debates since 1986, and has become a landmark task for testing the adequacy of cog... as discussed by the authors, which is a seemingly minor aspect of language acquisition.
Abstract: Learning the past tense of English verbs - a seemingly minor aspect of language acquisition - has generated heated debates since 1986, and has become a landmark task for testing the adequacy of cog...

70 citations


Journal ArticleDOI
TL;DR: It is shown that there are some subtle assumptions that underly the wide-spread intuitions regarding the supposed efficiency of partial-order planning, which can depend critically upon the search strategy and the structure of the search space.
Abstract: For many years, the intuitions underlying partial-order planning were largely taken for granted. Only in the past few years has there been renewed interest in the fundamental principles underlying this paradigm. In this paper, we present a rigorous comparative analysis of partial-order and total-order planning by focusing on two specific planners that can be directly compared. We show that there are some subtle assumptions that underly the wide-spread intuitions regarding the supposed efficiency of partial-order planning. For instance, the superiority of partial-order planning can depend critically upon the search strategy and the structure of the search space. Understanding the underlying assumptions is crucial for constructing efficient planners.

62 citations


Journal ArticleDOI
William W. Cohen1
TL;DR: In this article, it was shown that a single k-ary recursive constant-depth determinate clause is learnable and two-clause programs consisting of one learnable recursive clause and one non-recursive clause are also learnable.
Abstract: We present algorithms that learn certain classes of function-free recursive logic programs in polynomial time from equivalence queries. In particular, we show that a single k-ary recursive constant-depth determinate clause is learnable. Two-clause programs consisting of one learnable recursive clause and one constant-depth determinate nonrecursive clause are also learnable, if an additional "basecase" oracle is assumed. These results immediately imply the pac-learnability of these classes. Although these classes of learnable recursive programs are very constrained, it is shown in a companion paper that they are maximally general, in that generalizing either class in any natural way leads to a computationally difficult learning problem. Thus, taken together with its companion paper, this paper establishes a boundary of efficient learnability for recursive logic programs.

Journal ArticleDOI
William W. Cohen1
TL;DR: In a companion paper as discussed by the authors, Valiant's model of paclearnability was used to show that any natural generalization of the class of constant-depth determinate k-ary recursive clauses is hard to learn.
Abstract: In a companion paper it was shown that the class of constant-depth determinate k-ary recursive clauses is efficiently learnable In this paper we present negative results showing that any natural generalization of this class is hard to learn in Valiant's model of paclearnability In particular, we show that the following program classes are cryptographically hard to learn: programs with an unbounded number of constant-depth linear recursive clauses; programs with one constant-depth determinate clause containing an unbounded number of recursive calls; and programs with one linear recursive clause of constant locality These results immediately imply the non-learnability of any more general class of programs We also show that learning a constant-depth determinate program with either two linear recursive clauses or one linear recursive clause and one non-recursive clause is as hard as learning boolean DNF Together with positive results from the companion paper, these negative results establish a boundary of efficient learnability for recursive function-free clauses

Journal ArticleDOI
TL;DR: Examination of the issues of the efficient and general implementation of TD(λ) for arbitrary λ, for use with reinforcement learning algorithms optimizing the discounted sum of rewards suggests that using λ > 0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.
Abstract: Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor λ. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This paper examines the issues of the efficient and general implementation of TD(λ) for arbitrary λ, for use with reinforcement learning algorithms optimizing the discounted sum of rewards. The traditional approach, based on eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD (Truncated Temporal Differences) procedure is proposed as an alternative, that indeed only approximates TD(λ), but requires very little computation per action and can be used with arbitrary function representation methods. The idea from which it is derived is fairly simple and not new, but probably unexplored so far. Encouraging experimental results are presented, suggesting that using λ > 0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.

Journal ArticleDOI
TL;DR: Wrap-Up is a trainable IE discourse component that makes intersentential inferences and identifies logical relations among information extracted from the text, and automatically decides what classifiers are needed, and derives the feature set for each classifier automatically.
Abstract: The vast amounts of on-line text now available have led to renewed interest in information extraction (IE) systems that analyze unrestricted text, producing a structured representation of selected information from the text. This paper presents a novel approach that uses machine learning to acquire knowledge for some of the higher level IE processing. Wrap-Up is a trainable IE discourse component that makes intersentential inferences and identifies logical relations among information extracted from the text. Previous corpus-based approaches were limited to lower level processing such as part-of-speech tagging, lexical disambiguation, and dictionary construction. Wrap-Up is fully trainable, and not only automatically decides what classifiers are needed, but even derives the feature set for each classifier automatically. Performance equals that of a partially trainable discourse module requiring manual customization for each domain.

Journal ArticleDOI
TL;DR: Theory revision integrates inductive learning and background knowledge by combining training examples with a coarse domain theory to produce a more accurate theory and the behavior, limitations, and potential of theory-guided constructive induction is studied.
Abstract: Theory revision integrates inductive learning and background knowledge by combining training examples with a coarse domain theory to produce a more accurate theory. There are two challenges that theory revision and other theory-guided systems face. First, a representation language appropriate for the initial theory may be inappropriate for an improved theory. While the original representation may concisely express the initial theory, a more accurate theory forced to use that same representation may be bulky, cumbersome, and difficult to reach. Second, a theory structure suitable for a coarse domain theory may be insufficient for a fine-tuned theory. Systems that produce only small, local changes to a theory have limited value for accomplishing complex structural alterations that may be required. Consequently, advanced theory-guided learning systems require flexible representation and flexible structure. An analysis of various theory revision systems and theory-guided learning systems reveals specific strengths and weaknesses in terms of these two desired properties. Designed to capture the underlying qualities of each system, a new system uses theory-guided constructive induction. Experiments in three domains show improvement over previous theory-guided systems. This leads to a study of the behavior, limitations, and potential of theory-guided constructive induction.

Journal ArticleDOI
TL;DR: This paper proposes a new decomposition method benefiting from both semantic properties of functional constraints (not bijective constraints) and structural properties of the network and shows that under some conditions, the existence of solutions can be guaranteed.
Abstract: Many studies have been carried out in order to increase the search efficiency of constraint satisfaction problems; among them, some make use of structural properties of the constraint network; others take into account semantic properties of the constraints, generally assuming that all the constraints possess the given property. In this paper, we propose a new decomposition method benefiting from both semantic properties of functional constraints (not bijective constraints) and structural properties of the network; furthermore, not all the constraints need to be functional. We show that under some conditions, the existence of solutions can be guaranteed. We first characterize a particular subset of the variables, which we name a root set. We then introduce pivot consistency, a new local consistency which is a weak form of path consistency and can be achieved in O(n2d2 complexity (instead of O(n3d3) for path consistency), and we present associated properties; in particular, we show that any consistent instantiation of the root set can be linearly extended to a solution, which leads to the presentation of the aforementioned new method for solving by decomposing functional CSPs.

Journal ArticleDOI
TL;DR: A Japanese information extraction system that merges information using a pattern matcher and discourse processor that approaches human performance is reported on.
Abstract: Information extraction is the task of automatically picking up information of interest from an unconstrained text. Information of interest is usually extracted in two steps. First, sentence level processing locates relevant pieces of information scattered throughout the text; second, discourse processing merges coreferential information to generate the output. In the first step, pieces of information are locally identified without recognizing any relationships among them. A key word search or simple pattern search can achieve this purpose. The second step requires deeper knowledge in order to understand relationships among separately identified pieces of information. Previous information extraction systems focused on the first step, partly because they were not required to link up each piece of information with other pieces. To link the extracted pieces of information and map them onto a structured output format, complex discourse processing is essential. This paper reports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor. Evaluation results show a high level of system performance which approaches human performance.

Journal ArticleDOI
TL;DR: This paper introduces a framework for Planning while Learning where an agent is given a goal to achieve in an environment whose behavior is only partially known to the agent, and shows that, in most natural cases, the verification part can be carried out in an efficient algorithmic manner.
Abstract: This paper introduces a framework for Planning while Learning where an agent is given a goal to achieve in an environment whose behavior is only partially known to the agent. We discuss the tractability of various plan-design processes. We show that for a large natural class of Planning while Learning systems, a plan can be presented and verified in a reasonable time. However, coming up algorithmically with a plan, even for simple classes of systems is apparently intractable. We emphasize the role of off-line plan-design processes, and show that, in most natural cases, the verification (projection) part can be carried out in an efficient algorithmic manner.

Journal ArticleDOI
TL;DR: In this article, a simple change and reinterpretation of the DNA promoter sequences domain theory in terms of M-of-N concepts, involving no learning, was reported, which results in an accuracy of 93.4% on the 106 items of the database.
Abstract: The DNA promoter sequences domain theory and database have become popular for testing systems that integrate empirical and analytical learning. This note reports a simple change and reinterpretation of the domain theory in terms of M-of-N concepts, involving no learning, that results in an accuracy of 93.4% on the 106 items of the database. Moreover, an exhaustive search of the space of M-of-N domain theory interpretations indicates that the expected accuracy of a randomly chosen interpretation is 76.5%, and that a maximum accuracy of 97.2% is achieved in 12 cases. This demonstrates the informativeness of the domain theory, without the complications of understanding the interactions between various learning algorithms and the theory. In addition, our results help characterize the difficulty of learning using the DNA promoters theory.

Journal ArticleDOI
TL;DR: In this paper, the authors report on a series of experiments in which all decision trees consistent with the training data are constructed and run to gain an understanding of the properties of the set.
Abstract: We report on a series of experiments in which all decision trees consistent with the training data are constructed. These experiments were run to gain an understanding of the properties of the set ...