scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 1990"


Book ChapterDOI
01 Jun 1990
TL;DR: This paper extends previous work with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods, and presents and shows results for two Dyna architectures, based on Watkins's Q-learning, a new kind of reinforcement learning.
Abstract: This paper extends previous work with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned model of the world. In this paper, I present and show results for two Dyna architectures. The Dyna-PI architecture is based on dynamic programming's policy iteration method and can be related to existing AI ideas such as evaluation functions and universal plans (reactive systems). Using a navigation task, results are shown for a simple Dyna-PI system that simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. The Dyna-Q architecture is based on Watkins's Q-learning, a new kind of reinforcement learning. Dyna-Q uses a less familiar set of data structures than does Dyna-PI, but is arguably simpler to implement and use. We show that Dyna-Q architectures are easy to adapt for use in changing environments.

1,592 citations


Book
03 Jan 1990
TL;DR: This chapter contains sections titled: Introduction and Overview, A Simple Two-Component Adaptive Critic Design, HDP and Dynamic Programming, Alternative Ways to Figure 3.2 in Adapting the Action Network, Alternatives to HDP inadapting the Critic Network, Some Topics for Further Research, Equations and Code For Implementation.
Abstract: This chapter contains sections titled: Introduction and Overview, A Simple Two-Component Adaptive Critic Design, HDP and Dynamic Programming, Alternative Ways to Figure 3.2 in Adapting the Action Network, Alternatives to HDP in Adapting the Critic Network, Some Topics for Further Research, Equations and Code For Implementation, References

571 citations


Journal ArticleDOI
TL;DR: A stochastic reinforcement learning algorithm for learning functions with continuous outputs using a connectionist network that learns to perform an underconstrained positioning task using a simulated 3 degree-of-freedom robot arm.

306 citations


Journal ArticleDOI
TL;DR: This article addresses the issue of consistency in using Heuristic Dynamic Programming (HDP), a procedure for adapting a “critic” neural network, closely related to Sutton's method of temporal differences.

178 citations


Journal ArticleDOI
TL;DR: This paper considers adaptive control architectures that integrate active sensorimotor systems with decision systems based on reinforcement learning and describes a new decision system that overcomes difficulties and incorporates a perceptual subcycle within the overall decision cycle.
Abstract: This paper considers adaptive control architectures that integrate active sensorimotor systems with decision systems based on reinforcement learning. One unavoidable consequence of active perception is that the agent's internal representation often confounds external world states. We call this phenomenon perceptual aliasing and show that it destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy. A new decision system that overcomes these difficulties is described. The system incorporates a perceptual subcycle within the overall decision cycle and uses a modified learning algorithm to suppress the effects of perceptual aliasing. The result is a control architecture that learns not only how to solve a task but also where to focus its attention in order to collect necessary sensory information.

136 citations


Proceedings Article
01 Oct 1990
TL;DR: This work addresses three problems with reinforcement learning and adaptive neuro-control: non-Markovian interfaces between learner and environment, problems with parallel learning and how interacting model/controller systems can be combined with vector-valued 'adaptive critics'.
Abstract: This work addresses three problems with reinforcement learning and adaptive neuro-control: 1. Non-Markovian interfaces between learner and environment. 2. On-line learning based on system realization. 3. Vector-valued adaptive critics. An algorithm is described which is based on system realization and on two interacting fully recurrent continually running networks which may learn in parallel. Problems with parallel learning are attacked by 'adaptive randomness'. It is also described how interacting model/controller systems can be combined with vector-valued 'adaptive critics' (previous critics have been scalar).

128 citations


Proceedings ArticleDOI
17 Jun 1990
TL;DR: An online learning algorithm for reinforcement learning with continually running recurrent networks in nonstationary reactive environments is described and the possibility of using the system for planning future action sequences is investigated and this approach is compared to approaches based on temporal difference methods.
Abstract: An online learning algorithm for reinforcement learning with continually running recurrent networks in nonstationary reactive environments is described. Various kinds of reinforcement are considered as special types of input to an agent living in the environment. The agent's only goal is to maximize the amount of reinforcement received over time. Supervised learning techniques for recurrent networks serve to construct a differentiable model of the environmental dynamics which includes a model of future reinforcement. This model is used for learning goal-directed behavior in an online fashion. The possibility of using the system for planning future action sequences is investigated and this approach is compared to approaches based on temporal difference methods. A connection to metalearning (learning how to learn) is noted

98 citations


Book ChapterDOI
01 Jun 1990
TL;DR: This paper considers adaptive control architectures that integrate active sensory-motor systems with decision systems based on reinforcement learning and proposes a new decision system that overcomes the effects of perceptual aliasing.
Abstract: This paper considers adaptive control architectures that integrate active sensory-motor systems with decision systems based on reinforcement learning. One unavoidable consequence of active perception is that the agent's internal representation often confounds external world states. We call this phenomenon perceptual aliasing and show that it destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy. A new decision system that overcomes these difficulties is described. The system incorporates a perceptual subcycle within the overall decision cycle and uses a modified learning algorithm to suppress the effects of perceptual aliasing. The result is a control architecture that learns not only how to solve a task but also where to focus its attention in order to collect necessary sensory information.

90 citations


Journal ArticleDOI
TL;DR: A mathematical framework is established based on the functional spaces which have been heavily used in H∞ control theory and then applied to the study of the convergence problems associated with learning control.
Abstract: In this paper we study a control methodology now often referred to as learning control in the recent control literature. Our primary aim is to establish a mathematical framework which will allow systematic and continuing development of learning control theory. The framework is based on the functional spaces which have been heavily used in H∞ control theory. It is then applied to the study of the convergence problems associated with learning control. Three theorems on general convergence conditions are presented.

75 citations


Proceedings Article
01 Oct 1990
TL;DR: A summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods, shows that Dyna-Q architectures (based on Watkins's Q-learning) are easy to adapt for use in changing environments.
Abstract: This is a summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures, Dyna-AHC and Dyna-Q. Using a navigation task, results are shown for a simple Dyna-AHC system which simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. We show that Dyna-Q architectures (based on Watkins's Q-learning) are easy to adapt for use in changing environments.

55 citations


Journal ArticleDOI
TL;DR: The theory and methods proposed in this paper not only provide the learning control system with powerful tools for analysis and design, but also offer a multi-dimensional system theory with a new field of application as well as some new problems for further exploration.
Abstract: The connections between two research areas, intelligent control systems and multi-dimensional systems, are established. Two-dimensional (2-D) system theory is used to analyze a class of learning control system. The 2-D state-space model of a learning control system is given. A class of learning control laws is proposed and the convergence of the learning process can be checked based on a 2-D model of the learning control system. The theory and methods proposed in this paper not only provide the learning control system with powerful tools for analysis and design, but also offer a multi-dimensional system theory with a new field of application as well as some new problems for further exploration.

Proceedings ArticleDOI
13 May 1990
TL;DR: An approach to iterative learning control system design based on 2D system theory is presented, and a learning control algorithm is proposed, and the convergence of learning using this algorithm is guaranteed by 2D stability.
Abstract: An approach to iterative learning control system design based on 2D system theory is presented. A 2D model for the iterative learning control system which reveals the connections between learning control systems and 2D system theory is established. A learning control algorithm is proposed, and the convergence of learning using this algorithm is guaranteed by 2D stability. The learning algorithm is applied successfully to the trajectory tracking control problem for a parallel link robot manipulator. The excellent performance of this learning algorithm is demonstrated by computer simulation results. >

Proceedings ArticleDOI
26 Mar 1990
TL;DR: An examination is made of the problem of default hierarchy formation under the conventional bid competition method of LCS conflict resolution and the necessity auction and a separate priority factor are suggested as modifications to this method.
Abstract: Consideration is given to the learning classifier system (LCS) as an approach to reinforcement learning problems. An LCS is a type of adaptive expert system that uses a knowledge base of production rules in a low-level syntax that can be manipulated by a genetic algorithm (GA). GAs are a class of computerized search procedures that are based on the mechanics of natural genetics. An important feature of the LCS paradigm is the possible adaptive formation of default hierarchies (layered sets of default and exception rules). An examination is made of the problem of default hierarchy formation under the conventional bid competition method of LCS conflict resolution and the necessity auction and a separate priority factor are suggested as modifications to this method. Simulations show the utility of this method. >

Book ChapterDOI
01 Jun 1990
TL;DR: In control tasks, such as pole balancing, it is found that a program that learns to balance the pole quickly produces a control strategy that is so specific as to make it impossible to transfer expertise from one related task to another.
Abstract: The most frequently used measure of performance for reinforcement learning algorithms is learning rate. That is, how many learning trials are required before the program is able to perform its task adequately. In this paper, we argue that this is not necessarily the best measure of performance and, in some cases, can even be misleading. In control tasks, such as pole balancing, we have found that a program that learns to balance the pole quickly produces a control strategy that is so specific as to make it impossible to transfer expertise from one related task to another. We examine the reasons for this and suggest ways of obtaining general control strategies. We also make the conjecture that, as a broad principle, there is a trade-off between rapid learning rate and the ability to generalise. We also introduce methods for analysing the results of reinforcement learning algorithms to produce readable control rules.

Proceedings ArticleDOI
05 Sep 1990
TL;DR: It is concluded that control architectures base on reinforcement learning are now in a position to satisfy many of the criteria associated with intelligent control.
Abstract: The focus of this work is on control architectures that are based on reinforcement learning. A number of recent advances that have contributed to the viability of reinforcement learning approaches to intelligent control are surveyed. These advances include the formalization of the relationship between reinforcement learning and dynamic programming, the use of internal predictive models to improve learning rate, and the integration of reinforcement learning with active perception. On the basis of these advances and other results, it is concluded that control architectures base on reinforcement learning are now in a position to satisfy many of the criteria associated with intelligent control. >

Proceedings ArticleDOI
17 Jun 1990
TL;DR: A control system for mobile robots which is reminiscent of the modular distributed architecture of the brain is proposed, and demonstrates self-organizing, teacher-controlled, and reinforcement learning paradigms, and it integrates these into a system in which external events interact with internal emotional states.
Abstract: A control system for mobile robots which is reminiscent of the modular distributed architecture of the brain is proposed. This system is motivated by the various neural network models of S. Grossberg (1982, 1988) and M. Seibert and A.M. Waxman (1989). The approach takes the point of view that the robot must be adaptive to its environment and learn from experience. The system utilizes neural networks for learning and performance at all stages, from visual object recognition to behavioral conditioning. The system includes networks for early visual perception, pattern learning and recognition, object associations, emotional states, behavioral actions, and motor control. These networks are interconnected by a variety of adaptive pathways. The system demonstrates self-organizing, teacher-controlled, and reinforcement learning paradigms, and it integrates these into a system in which external events interact with internal emotional states. The system has been implemented on a mobile robot, MAVIN (mobile adaptive visual navigator), and a variety of behavioral conditioning paradigms are demonstrated


Proceedings ArticleDOI
01 Aug 1990
TL;DR: An approach to robotic control patterned after models of human skill acquisition and the organization of the human motor control system is described to develop autonomous robots capable of learning complex tasks in unstructured environments through rule-based inference and self-induced practice.
Abstract: This paper describes an approach to robotic control patterned after models of human skill acquisition and the organization of the human motor control system. The intent of the approach is to develop autonomous robots capable of learning complex tasks in unstructured environments through rule-based inference and self-induced practice. Features of the human motor control system emulated include a hierarchical and modular organization antagonistic actuation and multi-joint motor synergies. Human skill acquisition is emulated using declarative and reflexive representations of knowledge feedback and feedforward implementations of control and attentional mechanisms. Rule-based systems acquire rough-cut task execution and supervise the training of neural networks during the learning process. After the neural networks become capable of controlling system operation reinforcement learning is used to further refine the system performance. The research described is interdisciplinary and addresses fundamental issues in learning and adaptive control dexterous manipulation redundancy management knowledge-based system and neural network applications to control and the computational modelling of cognitive and motor skill acquisition. 296 / SPIE Vol. 1294 Applications of Artificial Neural Networks (1990)© (1990) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Journal ArticleDOI
TL;DR: In this paper, a self-supervised simulated neurons using the drive-reinforcement rule for synaptic modification can learn to balance a pole without experiencing failure and respond quickly and automatically to rapidly changing plant parameters.
Abstract: A network of two self-supervised simulated neurons using the drive-reinforcement rule for synaptic modification can learn to balance a pole without experiencing failure. This adaptive controller also responds quickly and automatically to rapidly changing plant parameters. Other aspects of the controller's performance investigated include the controller's response in a noisy environment, the effect of varying the partitioning of the state space of the plant, the effect of increasing the controller's response time, and the consequences of disabling learning at the beginning of a trial and during the progress of a trial. Earlier work with drive-reinforcement learning supports the claim that the theory's neuronal model can account for observed phenomena of classical conditioning; this work constitutes progress toward demonstrating that useful adaptive controllers can be fabricated from networks of classically conditionable elements.

Book ChapterDOI
01 Jan 1990
TL;DR: An on-line learning algorithm for attacking the fundamental credit assignment problem in non-stationary reactive environments by adapting a differentiable model of the environmental dynamics which includes a model of future reinforcement.
Abstract: We describe an on-line learning algorithm for attacking the fundamental credit assignment problem in non-stationary reactive environments. Reinforcement and pain are considered as special types of input to an agent living in the environment. The agent’s only goal is to maximize cumulative reinforcement and to minimize cumulative pain. This simple goal may require to produce complicated action sequences. Supervised learning techniques for recurrent networks serve to construct a differentiable model of the environmental dynamics which includes a model of future reinforcement. While this model is adapted, it is concurrently used for learning goal directed behavior. The method extends work done by Munro, Robinson and Fallside, Werbos, Widrow, and Jordan.

Book ChapterDOI
01 Jan 1990
TL;DR: A system is presented which learns to associate early stimuli with later reinforcement by buffering unfamiliar input images until that reinforcement arrives, and is shown to learn to predict the immediate results of various actions in a given state.
Abstract: Many real-world problems involve sequences where a automaton executes an action but there is some delay before the results of that action become apparent. A system is presented which learns to associate early stimuli with later reinforcement by buffering unfamiliar input images until that reinforcement arrives. It is shown to learn to predict the immediate results of various actions in a given state, to avoid entering negative next-states, and also to avoid entering positive next-states which lead in turn only to negative states. The system is capable of learning across indefinitely long reinforcement delays while only buffering a small number of past states locally at the nodes.

Book ChapterDOI
01 Jun 1990
TL;DR: Both connectionist and classical statistics-based algorithms are presented, then compared empirically on three test problems and modifications and extensions that will allow the algorithms to work in more complex domains are discussed.
Abstract: An agent that must learn to act in the world by trial and error faces the reinforcement learning problem, which is quite different from standard concept learning. Although good algorithms exist for this problem in the general case, they are quite inefficient. One strategy is to find restricted classes of action strategies that can be learned more efficiently. This paper pursues that strategy by developing algorithms that can efficiently learn action maps that are expressible in κ-DNF. Both connectionist and classical statistics-based algorithms are presented, then compared empirically on three test problems. Modifications and extensions that will allow the algorithms to work in more complex domains are also discussed.

Journal ArticleDOI
TL;DR: The approach of learning control is described and an analysis of the learning control problem is given for the case of linear, time-invariant plants and controllers, which includes typical robotic manipulator models.

Proceedings ArticleDOI
17 Jun 1990
TL;DR: A neural network model based on pulse signals is presented and it is shown that reinforcement learning is the type of learning which is best suited for this type of model and that this does not interfere with the properties mentioned above.
Abstract: A neural network model based on pulse signals is presented. Pulsed signals are shown to give the model such properties as a simplified hardware implementation, the ability to use stochastic search techniques, and biological plausibility. In this model, there is also the possibility of building modular neural systems. Since information representation in pulsed signals has to be studied carefully, the stochastic representation is outlined and it is shown how generation of a signal and estimation of information are obtained. Several design issues have to be solved before it is possible to use a model of this kind, and solutions are proposed for synaptic multiplication, summation, and nonlinearity. It is also shown that reinforcement learning is the type of learning which is best suited for this type of model and that this does not interfere with the properties mentioned above. Initial simulations of the model show promising results

01 Jan 1990
TL;DR: An overview of some novel algorithms for reinforcement learning in non-stationary possibly reactive environments and how ‘selective attention’ can be learned is given.
Abstract: This paper gives an overview of some novel algorithms for reinforcement learning in non-stationary possibly reactive environments. I have decided to describe many ideas briefly rather than going into great detail on any one idea. The paper is structured as follows: In the first section some terminology is introduced. Then there follow five sections, each headed by a short abstract. The second section describes the entirely local ‘neural bucket brigade algorithm’. The third section applies Sutton's TD-methods to fully recurrent continually running probabilistic networks. The fourth section describes an algorithm based on system identification and on two interacting fully recurrent ‘self-supervised’ learning networks. The fifth section describes an application of adaptive control techniques to adaptive attentive vision: It demonstrates how ‘selective attention’ can be learned. Finally, the sixth section critisizes methods based on system identification and adaptive critics, and describes an adaptive subgoal generator.

01 Dec 1990
TL;DR: This paper reviews some basic issues and methods involved in using neural networks to respond in a desired fashion to a temporally-varying environment and some popular network models and training methods are introduced.
Abstract: This paper reviews some basic issues and methods involved in using neural networks to respond in a desired fashion to a temporally-varying environment. Some popular network models and training methods are introduced. A speech recognition example is then used to illustrate the central difficulty of temporal data processing: learning to notice and remember relevant contextual information. Feedforward network methods are applicable to cases where this problem is not severe. The application of these methods are explained and applications are discussed in the areas of pure mathematics, chemical and physical systems, and economic systems. A more powerful but less practical algorithm for temporal problems, the moving targets algorithm, is sketched and discussed. For completeness, a few remarks are made on reinforcement learning.

Proceedings Article
01 Jan 1990
TL;DR: A summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods, shows that Dyna-Q architectures (based on Watkins's Q-Iearning) are easy to adapt for use in changing environments.
Abstract: This is a summary of results with Dyna, a class of architectures for intelligent systems based on approximating dynamic programming methods. Dyna architectures integrate trial-and-error (reinforcement) learning and execution-time planning into a single process operating alternately on the world and on a learned forward model of the world. We describe and show results for two Dyna architectures, Dyna-AHC and Dyna-Q. Using a navigation task, results are shown for a simple Dyna-AHC system which simultaneously learns by trial and error, learns a world model, and plans optimal routes using the evolving world model. We show that Dyna-Q architectures (based on Watkins's Q-Iearning) are easy to adapt for use in changing environments.


01 Jan 1990
TL;DR: The Iterative Learning Control System (ILCS) is a new approach to the problem of improving transient behavior using previously acquired experience for systems that execute repetitive tasks, but the theoretical frameworks on the fundamentals of ILCS have not heretofore been established.
Abstract: The Iterative Learning Control System (ILCS) is a new approach to the problem of improving transient behavior using previously acquired experience for systems that execute repetitive tasks. However the theoretical frameworks on the fundamentals of ILCS have not heretofore been established. Theoretical results for two-dimensional (2-D) system theory has been developed during the last two decades. But few applications have been reported. In this dissertation research, the connections between these two research areas (ILCS and 2-D) are established. A class of iterative learning control systems is analyzed from a two-dimensional system point of view. A generic 2-D model for a class of ILCS is established which allows us to employ 2-D system theory to analyze and design the entire learning control system. A general structure of learning controller is given based on the 2-D model. The analysis of the 2-D error equation shows that the 2-D asymptotic stability of the 2-D model guarantees the learning convergence of ILCS. Several class of learning control algorithms have been proposed for which the learning convergence is proved. The learning gain matrices are obtained from a recursive 2-D estimator using the input and output data of the controlled plant obtained from previous operations. The estimation algorithms are derived for both time-invariant and time variant systems. Comparisons of the 2-D ILCS algorithm with other learning algorithms are given. The results demonstrate superior performance of the 2-D based learning control system. The feasibility of applying proposed 2-D learning algorithms to engineering systems are investigated through two application case studies. In the first case study, the proposed 2-D learning algorithm is applied to a parallel link robotic manipulator executing a repetitive motion. The learning process converges in four iterations of learning and the actual trajectory of the manipulator follows the desired trajectory with acceptable accuracy. The second case study deals with the learning control problem for the Experimental Breeder Reactor II (EBR II) in a nuclear power plant. The tracking control tasks for the EBR II primary system and the steam generator system are implemented by employing the learning controller. The efficiency of the 2-D based learning algorithms are demonstrated by extensive simulations.

Book
03 Jan 1990
TL;DR: In this article, a description of several ways that backpropagation can be useful in training networks to perform associative reinforcement learning tasks is given, and the mathematical validity of this technique rests on the use of continuous-valued stochastic units.
Abstract: A description is given of several ways that backpropagation can be useful in training networks to perform associative reinforcement learning tasks. One way is to train a second network to model the environmental reinforcement signal and to backpropagate through this network into the first network. This technique has been proposed and explored previously in various forms. Another way is based on the use of the reinforce algorithm and amounts to backpropagating through deterministic parts of the network while performing a correlation-style computation where the behavior is stochastic. A third way, which is an extension of the second, allows backpropagation through the stochastic parts of the network as well. The mathematical validity of this third technique rests on the use of continuous-valued stochastic units. Some implications of this result for using supervised learning to train networks of stochastic units are noted, and it is also observed that such an approach even permits a seamless blend of associative reinforcement learning and supervised learning within the same network.<>