scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 1997"


Book ChapterDOI
01 Oct 1997
TL;DR: This paper shows that additional sensation from another agent is beneficial if it can be used efficiently, sharing learned policies or episodes among agents speeds up learning at the cost of communication, and for joint tasks, agents engaging in partnership can significantly outperform independent agents although they may learn slowly in the beginning.
Abstract: Intelligent human agents exist in a cooperative social environment that facilitates learning They learn not only by trial-and-error, but also through cooperation by sharing instantaneous information, episodic experience, and learned knowledge The key investigations of this paper are, “Given the same number of reinforcement learning agents, will cooperative agents outperform independent agents who do not communicate during learning?” and “What is the price for such cooperation?” Using independent agents as a benchmark, cooperative agents are studied in following ways: (1) sharing sensation, (2) sharing episodes, and (3) sharing learned policies This paper shows that (a) additional sensation from another agent is beneficial if it can be used efficiently, (b) sharing learned policies or episodes among agents speeds up learning at the cost of communication, and (c) for joint tasks, agents engaging in partnership can significantly outperform independent agents although they may learn slowly in the beginning These tradeoff's are not just limited to multi-agent reinforcement learning

1,387 citations


Journal ArticleDOI
TL;DR: This article summarizes four directions of machine-learning research, the improvement of classification accuracy by learning ensembles of classifiers, methods for scaling up supervised learning algorithms, reinforcement learning, and the learning of complex stochastic models.
Abstract: Machine-learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (1) the improvement of classification accuracy by learning ensembles of classifiers, (2) methods for scaling up supervised learning algorithms, (3) reinforcement learning, and (4) the learning of complex stochastic models.

1,250 citations


Journal ArticleDOI
TL;DR: In this paper, the authors discuss a variety of adaptive critic designs (ACDs) for neuro-control, which are suitable for learning in noisy, nonlinear, and nonstationary environments They have common roots as generalizations of dynamic programming for neural reinforcement learning approaches.
Abstract: We discuss a variety of adaptive critic designs (ACDs) for neurocontrol These are suitable for learning in noisy, nonlinear, and nonstationary environments They have common roots as generalizations of dynamic programming for neural reinforcement learning approaches Our discussion of these origins leads to an explanation of three design families: heuristic dynamic programming, dual heuristic programming, and globalized dual heuristic programming (GDHP) The main emphasis is on DHP and GDHP as advanced ACDs We suggest two new modifications of the original GDHP design that are currently the only working implementations of GDHP They promise to be useful for many engineering applications in the areas of optimization and optimal control Based on one of these modifications, we present a unified approach to all ACDs This leads to a generalized training procedure for ACDs

1,109 citations


Proceedings Article
01 Dec 1997
TL;DR: This work presents provably convergent algorithms for problem-solving and learning with hierarchical machines and demonstrates their effectiveness on a problem with several thousand states.
Abstract: We present a new approach to reinforcement learning in which the policies considered by the learning process are constrained by hierarchies of partially specified machines. This allows for the use of prior knowledge to reduce the search space and provides a framework in which knowledge can be transferred across problems and in which component solutions can be recombined to solve larger and more complicated problems. Our approach can be seen as providing a link between reinforcement learning and "behavior-based" or "teleo-reactive" approaches to control. We present provably convergent algorithms for problem-solving and learning with hierarchical machines and demonstrate their effectiveness on a problem with several thousand states.

746 citations


Journal ArticleDOI
TL;DR: A version of Bush and Mosteller's stochastic learning theory in the context of games is considered and it is shown that in the continuous time limit the biological model coincides with the deterministic, continuous time replicator process.

708 citations


Journal ArticleDOI
TL;DR: There are ways in which locally weighted learning, a type of lazy learning, has been applied by us to control tasks, and various forms that control tasks can take, are explained.
Abstract: Lazy learning methods provide useful representations and training algorithms for learning about complex phenomena during autonomous adaptive control of complex systems. This paper surveys ways in which locally weighted learning, a type of lazy learning, has been applied by us to control tasks. We explain various forms that control tasks can take, and how this affects the choice of learning paradigm. The discussion section explores the interesting impact that explicitly remembering all previous experiences has on the problem of learning to control.

619 citations


Journal ArticleDOI
TL;DR: A formulation of reinforcement learning that enables learning in noisy, dynamic environments such as in the complex concurrent multi-robot learning domain and experimentally validate the approach on a group of four mobile robots learning a foraging task.
Abstract: This paper describes a formulation of reinforcement learning that enables learning in noisy, dynamic environments such as in the complex concurrent multi-robot learning domain. The methodology involves minimizing the learning space through the use of behaviors and conditions, and dealing with the credit assignment problem through shaped reinforcement in the form of heterogeneous reinforcement functions and progress estimators. We experimentally validate the approach on a group of four mobile robots learning a foraging task.

488 citations


Journal ArticleDOI
TL;DR: A gradual evolution in the formal conception of rationality that brings it closer to our informal conception of intelligence and simultaneously reduces the gap between theory and practice has been discussed in this article, where some directions for future research are indicated.

375 citations


Book
15 Jan 1997
TL;DR: This new edition, with substantial new material, takes account of important new developments in the theory of learning and deals extensively with the Theory of learning control systems, which has now reached a level of maturity comparable to that of learning of neural networks.
Abstract: From the Publisher: How does it differ from first edition? Includes new material on: * support vector machines (SVM's), * fat shattering dimensions * applications to neural network learning, * learning with dependent samples generated by beta-mixing process, * connections between system identification and learning theory * probabilistic solution of "intractable" problems in robust control and matrix theory using randomised algorithms. In addition, solutions to some open problems posed in the first edition are included, and new open problems are added. The author is a respected authority in the field of control and systems theory. This new edition, with substantial new material, takes account of important new developments in the theory of learning. It also deals extensively with the theory of learning control systems, which has now reached a level of maturity comparable to that of learning of neural networks. The book is written in a manner that would suit self-study and contains comprehensive references. The chapters are also written to be as autonomous as possible and contain updated open problems to enhance further research and self-study.

361 citations


Proceedings ArticleDOI
20 Apr 1997
TL;DR: This paper compares direct reinforcement learning (no explicit model) and model-based reinforcement learning on a simple task: pendulum swing up and finds that in this task model- based approaches support reinforcement learning from smaller amounts of training data and efficient handling of changing goals.
Abstract: This paper compares direct reinforcement learning (no explicit model) and model-based reinforcement learning on a simple task: pendulum swing up. We find that in this task model-based approaches support reinforcement learning from smaller amounts of training data and efficient handling of changing goals.

270 citations


Proceedings Article
27 Jul 1997
TL;DR: Ultimately, this dissertation demonstrates that by learning portions of their cognitive processes, selectively communicating, and coordinating their behaviors via common knowledge, a group of independent agents can work towards a common goal in a complex, real-time, noisy, collaborative, and adversarial environment.
Abstract: Multi-agent systems in complex, real-time domains require agents to act effectively both autonomously and as part of a team. This dissertation addresses multi-agent systems consisting of teams of autonomous agents acting in real-time, noisy, collaborative, and adversarial environments. Because of the inherent complexity of this type of multi-agent system, this thesis investigates the use of machine learning within multi-agent systems. The dissertation makes four main contributions to the fields of Machine Learning and Multi-Agent Systems. First, the thesis defines a team member agent architecture within which a flexible team structure is presented, allowing agents to decompose the task space into flexible roles and allowing them to smoothly switch roles while acting. Team organization is achieved by the introduction of a locker-room agreement as a collection of conventions followed by all team members. It defines agent roles, team formations, and pre-compiled multi-agent plans. In addition, the team member agent architecture includes a communication paradigm for domains with single-channel, low-bandwidth, unreliable communication. The communication paradigm facilitates team coordination while being robust to lost messages and active interference from opponents. Second, the thesis introduces layered learning, a general-purpose machine learning paradigm for complex domains in which learning a mapping directly from agents' sensors to their actuators is intractable. Given a hierarchical task decomposition, layered learning allows for learning at each level of the hierarchy, with learning at each level directly affecting learning at the next higher level. Third, the thesis introduces a new multi-agent reinforcement learning algorithm, namely team-partitioned, opaque-transition reinforcement learning (TPOT-RL). TPOT-RL is designed for domains in which agents cannot necessarily observe the state changes when other team members act. It exploits local, action-dependent features to aggressively generalize its input representation for learning and partitions the task among the agents, allowing them to simultaneously learn collaborative policies by observing the long-term effects of their actions. Fourth, the thesis contributes a fully functioning multi-agent system that incorporates learning in a real-time, noisy domain with teammates and adversaries. Detailed algorithmic descriptions of the agents' behaviors as well as their source code are included in the thesis. Empirical results validate all four contributions within the simulated robotic soccer domain. The generality of the contributions is verified by applying them to the real robotic soccer, and network routing domains. Ultimately, this dissertation demonstrates that by learning portions of their cognitive processes, selectively communicating, and coordinating their behaviors via common knowledge, a group of independent agents can work towards a common goal in a complex, real-time, noisy, collaborative, and adversarial environment.

Book
06 Nov 1997
TL;DR: Robot Shaping proposes a new engineering discipline, "behavior engineering," to provide the methodologies and tools for creating autonomous robots.
Abstract: From the Publisher: foreword by Lashon Booker "[This] book gives a clear and comprehensive exposition of [the authors] extensive experience in integrating reinforcement learning and autonomous robotics. Their continuing contribution is to the development of a distinct engineering discipline (`Behavior engineering') through which such robots can be created. I am excited because their efforts combine some of the best theoretical ideas with a strong eye for the practical - for what will actually work." -- Stewart W. Wilson, The Rowland Institute for Science To program an autonomous robot to act reliably in a dynamic environment is a complex task. The dynamics of the environment are unpredictable, and the robots' sensors provide noisy input. A learning autonomous robot, one that can acquire knowledge through interaction with its environment and then adapt its behavior, greatly simplifies the designer's work. A learning robot need not be given all of the details of its environment, and its sensors and actuators need not be finely tuned. Robot Shaping is about designing and building learning autonomous robots. The term "shaping" comes from experimental psychology, where it describes the incremental training of animals. The authors propose a new engineering discipline, "behavior engineering," to provide the methodologies and tools for creating autonomous robots. Their techniques are based on classifier systems, a reinforcement learning architecture originated by John Holland, to which they have added several new ideas, such as "mutespec," classifier system "energy," and dynamic population size. In the book they present Behavior Analysis and Training (BAT) as an example of a behavior engineering methodology.

Journal ArticleDOI
TL;DR: A filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties.
Abstract: In information-filtering environments, uncertainties associated with changing interests of the user and the dynamic document stream must be handled efficiently. In this article, a filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering system, SIFTER, has been implemented based on the model, using established techniques in information retrieval and artificial intelligence. These techniques include document representation by a vector-space model, document classification by unsupervised learning, and user modeling by reinforcement learning. The system can filter information based on content and a user's specific interests. The user's interests are automatically learned with only limited user intervention in the form of optional relevance feedback for documents. We also describe experimental studies conducted with SIFTER to filter computer and information science documents collected from the Internet and commercial database services. The experimental results demonstrate that the system performs very well in filtering documents in a realistic problem setting.

Journal ArticleDOI
TL;DR: It is shown circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters.
Abstract: We discuss Hinton's (1989) relative payoff procedure (RPP), a static reinforcement learning algorithm whose foundation is not stochastic gradient ascent. We show circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters. The proof is based on a mapping between the RPP and a form of the expectation-maximization procedure of Dempster, Laird, and Rubin (1977).

Proceedings ArticleDOI
14 Dec 1997
TL;DR: A stochastic model for dialogue systems based on the Markov decision process is introduced, showing that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach.
Abstract: We introduce a stochastic model for dialogue systems based on the Markov decision process. Within this framework we show that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach. The advantages of this new paradigm include objective evaluation of dialogue systems and their automatic design and adaptation. We show some preliminary results on learning a dialogue strategy for an air travel information system.

Book
01 Jan 1997
TL;DR: This book provides a comprehensive introduction to the computational material that forms the underpinnings of the currently evolving set of brain models and stresses the broad spectrum of learning models and situates the various models in their appropriate neural context.
Abstract: From the Publisher: It is now clear that the brain is unlikely to be understood without recourse to computational theories. The theme of An Introduction to Natural Computation is that ideas from diverse areas such as neuroscience, information theory, and optimization theory have recently been extended in ways that make them useful for describing the brains programs. This book provides a comprehensive introduction to the computational material that forms the underpinnings of the currently evolving set of brain models. It stresses the broad spectrum of learning models--ranging from neural network learning through reinforcement learning to genetic learning--and situates the various models in their appropriate neural context. To write about models of the brain before the brain is fully understood is a delicate matter. Very detailed models of the neural circuitry risk losing track of the task the brain is trying to solve. At the other extreme, models that represent cognitive constructs can be so abstract that they lose all relationship to neurobiology. An Introduction to Natural Computation takes the middle ground and stresses the computational task while staying near the neurobiology.

ReportDOI
01 Jan 1997
TL;DR: The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather toPresent a conceptual framework that might serve as an introduction to a more rigorous study of RL.
Abstract: : The purpose of this tutorial is to provide an introduction to reinforcement learning (RL) at a level easily understood by students and researchers in a wide range of disciplines. The intent is not to present a rigorous mathematical discussion that requires a great deal of effort on the part of the reader, but rather to present a conceptual framework that might serve as an introduction to a more rigorous study of RL. The fundamental principles and techniques used to solve RL problems are presented. The most popular RL algorithms are presented. Section (1) presents an overview of RL and provides a simple example to develop intuition of the underlying dynamic programming mechanism. In Section (2) the parts of a reinforcement learning problem are discussed. These include the environment, reinforcement function, and value function. Section (3) gives a description of the most widely used reinforcement learning algorithms. These include TD(lambda) and both the residual and direct forms of value iteration, Q-learning, and advantage learning. In Section (4) some of the ancillary issues of RL are briefly discussed, such as choosing an exploration strategy and a discount factor. The conclusion is given in Section (5). Finally, Section (6) is a glossary of commonly used terms followed by references and bibliography.

Journal ArticleDOI
TL;DR: Two implementations are proposed: one with a competitive multilayer perceptron and the other with a self-organising map, which show that this last implementation is very effective, learning more than 40 times faster than the basic Q-learning implementation.

Journal ArticleDOI
TL;DR: A new method for continuous case-based reasoning is introduced and its application to the dynamic selection, modification, and acquisition of robot behaviors in an autonomous navigation system, SINS (self-improving navigation system) is discussed.

Proceedings Article
01 Dec 1997
TL;DR: A more general form of temporally abstract model is introduced, the multi-time model, and its suitability for planning and learning by virtue of its relationship to the Bellman equations is established.
Abstract: Planning and learning at multiple levels of temporal abstraction is a key problem for artificial intelligence. In this paper we summarize an approach to this problem based on the mathematical framework of Markov decision processes and reinforcement learning. Current model-based reinforcement learning is based on one-step models that cannot represent common-sense higher-level actions, such as going to lunch, grasping an object, or flying to Denver. This paper generalizes prior work on temporally abstract models [Sutton, 1995] and extends it from the prediction setting to include actions, control, and planning. We introduce a more general form of temporally abstract model, the multi-time model, and establish its suitability for planning and learning by virtue of its relationship to the Bellman equations. This paper summarizes the theoretical framework of multi-time models and illustrates their potential advantages in a grid world planning task.

Journal ArticleDOI
TL;DR: CHILD is described, an agent capable of Continual, Hierarchical, Incremental Learning and Development, which can quickly solve complicated non-Markovian reinforcement-learning tasks and can then transfer its skills to similar but even more complicated tasks, learning these faster still.
Abstract: Continual learning is the constant development of increasingly complex behaviors; the process of building more complicated skills on top of those already developed. A continual-learning agent should therefore learn incrementally and hierarchically. This paper describes CHILD, an agent capable of Continual, Hierarchical, Incremental Learning and Development. CHILD can quickly solve complicated non-Markovian reinforcement-learning tasks and can then transfer its skills to similar but even more complicated tasks, learning these faster still.

Book
01 Jan 1997
TL;DR: It is argued that local models have the potential to help solving problems in high-dimensional spaces and that global models have not and a linear approximation of the system dynamics and a quadratic function describing the long term reward are suggested to constitute a suitable local model.
Abstract: Reinforcement learning is a general and powerful way to formulate complex learning problems and acquire good system behaviour. The goal of a reinforcement learning system is to maximize a long term ...

01 Jan 1997
TL;DR: A new policy iteration method for dynamic programming problems with discounted and undiscounted cost is introduced based on the notion of temporal dierences and is primarily geared to the case of large and complex problems where the use of approximations is essential.
Abstract: We introduce a new policy iteration method for dynamic programming problems with discounted and undiscounted cost. The method is based on the notion of temporal dierences, and is primarily geared to the case of large and complex problems where the use of approximations is essential. We develop the theory of the method without approximation, we describe how to embed it within a neuro-dynamic programming/reinforcement learning context where feature-based approximation architectures are used, we relate it to TD( ) methods, and we illustrate its use in the training of a tetris playing program.

Journal ArticleDOI
TL;DR: In this paper, a reinforcement learning algorithm is introduced which can be applied over a continuous range of actions, with a set of probability density functions being used to determine the action set.

Journal ArticleDOI
TL;DR: A new computational model of the conditioning process is described that attempts to capture some of the aspects that are missing from simple reinforcement learning: conditioned reinforcers, shifting reinforcement contingencies, explicit action sequencing, and state space refinement.
Abstract: Instrumental (or operant) conditioning, a form of animal learning, is similar to reinforcement learning (Watkins, 1989) in that it allows an agent to adapt its actions to gain maximally from the environment while being rewarded only for correct performance. However, animals learn much more complicated behaviors through instrumental conditioning than robots presently acquire through reinforcement learning. We describe a new computational model of the conditioning process that attempts to capture some of the aspects that are missing from simple reinforcement learning: conditioned reinforcers, shifting reinforcement contingencies, explicit action sequencing, and state space refinement. We apply our model to a task commonly used to study working memory in rats and monkeys—the delayed match-to-sample task. Animals learn this task in stages. In simulation, our model also acquires the task in stages, in a similar manner. We have used the model to train an RWI B21 robot.


Journal ArticleDOI
TL;DR: A lazy learning method that combines a deductive and an inductive strategy to efficiently learn control knowledge incrementally with experience to improve both search efficiency and the quality of the solutions generated by a nonlinear planner, namely prodigy4.0.
Abstract: General-purpose generative planners use domain-independent search heuristics to generate solutions for problems in a variety of domains. However, in some situations these heuristics force the planner to perform inefficiently or obtain solutions of poor quality. Learning from experience can help to identify the particular situations for which the domain-independent heuristics need to be overridden. Most of the past learning approaches are fully deductive and eagerly acquire correct control knowledge from a necessarily complete domain theory and a few examples to focus their scope. These learning strategies are hard to generalize in the case of nonlinear planning, where it is difficult to capture correct explanations of the interactions among goals, multiple planning operator choices, and situational data. In this article, we present a lazy learning method that combines a deductive and an inductive strategy to efficiently learn control knowledge incrementally with experience. We present hamlet, a system we developed that learns control knowledge to improve both search efficiency and the quality of the solutions generated by a nonlinear planner, namely prodigy4.0. We have identified three lazy aspects of our approach from which we believe hamlet greatly benefits: lazy explanation of successes, incremental refinement of acquired knowledge, and lazy learning to override only the default behavior of the problem solver. We show empirical results that support the effectiveness of this overall lazy learning approach, in terms of improving the efficiency of the problem solver and the quality of the solutions produced.

Posted Content
TL;DR: Experience Weighted Attraction (EWA) as discussed by the authors is a general model, which includes reinforcement learning and a class of weighted fictitious play belief models as special cases, and is able to combine the best features of both approaches, allowing attractions to begin and grow exibly as choice reinforcement does, but reinforcing unchosen strategies substantially as belief-based models implicitly do.
Abstract: We describe a general model, 'experience-weighted attraction' (EWA) learning, which includes reinforcement learning and a class of weighted fictitious play belief models as special cases. In EWA, strategies have attractions which reflect prior predispositions, are updated based on payoff experience, and determine choice probabilities according to some rule (e.g., logit). A key feature is a parameter δ which weights the strength of hypothetical reinforcement of strategies which were not chosen according to the payoff they would have yielded. When δ = 0 choice reinforcement results. When δ = 1, levels of reinforcement of strategies are proportional to expected payoffs given beliefs based on past history. Another key feature is the growth rates of attractions. The EWA model controls the growth rates by two decay parameters, φ and ρ, which depreciate attractions and amount of experience separately. When φ = ρ belief-based models result; when ρ = 0 choice reinforcement results. Using three data sets, parameter estimates of the model were calibrated on part of the data and used to predict the rest. Estimates of δ are generally around .50, φ around 1, and ρ varies from 0 to φ. Choice reinforcement models often outperform belief-based models in the calibration phase and underperform in out-of-sample validation. Both special cases are generally rejected in favor of EWA, though sometimes belief models do better. EWA is able to combine the best features of both approaches, allowing attractions to begin and grow exibly as choice reinforcement does, but reinforcing unchosen strategies substantially as belief-based models implicitly do.

Proceedings ArticleDOI
09 Jun 1997
TL;DR: This work examines the similarities and differences between CMACs, RBFs and normalized RBFs, and compares the performance of Q-learning with each representation applied to the mountain car problem.
Abstract: CMACs and radial basis functions are often used in reinforcement learning to learn value function approximations having local generalization properties. We examine the similarities and differences between CMACs, RBFs and normalized RBFs and compare the performance of Q-learning with each representation applied to the mountain car problem. We discuss ongoing research efforts to exploit the flexibility of adaptive units to better represent the local characteristics of the state space.

Journal ArticleDOI
TL;DR: Comparing three approaches to machine learning that have developed largely independently: classical statistics, Vapnik's statistical learning theory, and computational learning theory concludes that statisticians and data miners can profit by studying each other's methods and using a judiciously chosen combination of them.