scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 1992"


Journal ArticleDOI
TL;DR: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.
Abstract: This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.

7,930 citations


Journal ArticleDOI
TL;DR: In this article, it is shown that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action values are represented discretely.
Abstract: Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. This paper presents and proves in detail a convergence theorem for Q,-learning based on that outlined in Watkins (1989). We show that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where many Q values can be changed each iteration, rather than just one.

3,294 citations


Book
01 Jun 1992
TL;DR: Reinforcement learning as mentioned in this paper is an approach to artificial intelligence that emphasizes learning by the individual from its interaction with its environment, and it has been shown that exploration and exploitation can be pursued exclusively without failing at the task.
Abstract: Reinforcement learning is an approach to artificial intelligence that emphasizes learning by the individual from its interaction with its environment. This contrasts with classical approaches to artificial intelligence and machine learning, which have downplayed learning from interaction, focusing instead on learning from a knowledgeable teacher, or on reasoning from a complete model of the environment. Modern reinforcement learning research is highly interdisciplinary; it includes researchers specializing in operations research, genetic algorithms, neural networks, psychology, and control engineering. Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize a scalar reward signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning. One of the challenges that arises in reinforcement learning and not in other kinds of learning is the tradeoff between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover which actions these are it has to select actions that it has not tried before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploitation nor exploration can be pursued exclusively without failing at the task. Modern reinforcement learning research uses the formal framework of

2,052 citations


Journal ArticleDOI
TL;DR: This paper compares eight reinforcement learning frameworks: Adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning and two extensions are experience replay, learning action models for planning, and teaching.
Abstract: To date, reinforcement learning has mostly been studied solving simple learning tasks. Reinforcement learning methods that have been studied so far typically converge slowly. The purpose of this work is thus two-fold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than previously studied, and 2) to investigate methods that will speed up reinforcement learning. This paper compares eight reinforcement learning frameworks: adaptive heuristic critic (AHC) learning due to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning. The three extensions are experience replay, learning action models for planning, and teaching. The frameworks were investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the frameworks.

1,691 citations


Journal ArticleDOI
TL;DR: The generalized approximate-reasoning-based intelligent control (GARIC) architecture learns and tunes a fuzzy logic controller even when only weak reinforcement is available; introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; and learns to produce real-valued control actions.
Abstract: A method for learning and tuning a fuzzy logic controller based on reinforcements from a dynamic system is presented. It is shown that: the generalized approximate-reasoning-based intelligent control (GARIC) architecture learns and tunes a fuzzy logic controller even when only weak reinforcement, such as a binary failure signal, is available; introduces a new conjunction operator in computing the rule strengths of fuzzy control rules; introduces a new localized mean of maximum (LMOM) method in combining the conclusions of several firing control rules; and learns to produce real-valued control actions. Learning is achieved by integrating fuzzy inference into a feedforward network, which can then adaptively improve performance by using gradient descent methods. The GARIC architecture is applied to a cart-pole balancing system and demonstrates significant improvements in terms of the speed of learning and robustness to changes in the dynamic system's parameters over previous schemes for cart-pole balancing. >

987 citations


01 Jan 1992
TL;DR: This dissertation concludes that it is possible to build artificial agents than can acquire complex control policies effectively by reinforcement learning and enable its applications to complex robot-learning problems.
Abstract: Reinforcement learning agents are adaptive, reactive, and self-supervised. The aim of this dissertation is to extend the state of the art of reinforcement learning and enable its applications to complex robot-learning problems. In particular, it focuses on two issues. First, learning from sparse and delayed reinforcement signals is hard and in general a slow process. Techniques for reducing learning time must be devised. Second, most existing reinforcement learning methods assume that the world is a Markov decision process. This assumption is too strong for many robot tasks of interest. This dissertation demonstrates how we can possibly overcome the slow learning problem and tackle non-Markovian environments, making reinforcement learning more practical for realistic robot tasks: (1) Reinforcement learning can be naturally integrated with artificial neural networks to obtain high-quality generalization, resulting in a significant learning speedup. Neural networks are used in this dissertation, and they generalize effectively even in the presence of noise and a large of binary and real-valued inputs. (2) Reinforcement learning agents can save many learning trials by using an action model, which can be learned on-line. With a model, an agent can mentally experience the effects of its actions without actually executing them. Experience replay is a simple technique that implements this idea, and is shown to be effective in reducing the number of action executions required. (3) Reinforcement learning agents can take advantage of instructive training instances provided by human teachers, resulting in a significant learning speedup. Teaching can also help learning agents avoid local optima during the search for optimal control. Simulation experiments indicate that even a small amount of teaching can save agents many learning trials. (4) Reinforcement learning agents can significantly reduce learning time by hierarchical learning--they first solve elementary learning problems and then combine solutions to the elementary problems to solve a complex problem. Simulation experiments indicate that a robot with hierarchical learning can solve a complex problem, which otherwise is hardly solvable within a reasonable time. (5) Reinforcement learning agents can deal with a wide range of non-Markovian environments by having a memory of their past. Three memory architectures are discussed. They work reasonably well for a variety of simple problems. One of them is also successfully applied to a nontrivial non-Markovian robot task. The results of this dissertation rely on computer simulation, including (1) an agent operating in a dynamic and hostile environment and (2) a mobile robot operating in a noisy and non-Markovian environment. The robot simulator is physically realistic. This dissertation concludes that it is possible to build artificial agents than can acquire complex control policies effectively by reinforcement learning.

911 citations


Book
31 Dec 1992
TL;DR: The material presented in this book addresses the analysis and design of learning control systems using a system-theoretic approach, and the application of artificial neural networks to the learning control problem.
Abstract: The material presented in this book addresses the analysis and design of learning control systems. It begins with an introduction to the concept of learning control, including a comprehensive literature review. The text follows with a complete and unifying analysis of the learning control problem for linear LTI systems using a system-theoretic approach which offers insight into the nature of the solution of the learning control problem. Additionally, several design methods are given for LTI learning control, incorporating a technique based on parameter estimation and a one-step learning control algorithm for finite-horizon problems. Further chapters focus unpon learning control for deterministic nonlinear systems, and a time-varying learning controller is presented which can be applied to a class of nonlinear systems, including the models of typical robotic manipulators. The book concludes with the application of artificial neural networks to the learning control problem. Three specific ways to use neural nets for this purpose are discussed, including two methods which use backpropagation training and reinforcement learning.

771 citations


Proceedings Article
30 Nov 1992
TL;DR: This paper shows how to create a Q-learning managerial hierarchy in which high level managers learning how to set tasks to their submanagers who, in turn, learn how to satisfy them.
Abstract: One way to speed up reinforcement learning is to enable learning to happen simultaneously at multiple resolutions in space and time. This paper shows how to create a Q-learning managerial hierarchy in which high level managers learn how to set tasks to their submanagers who, in turn, learn how to satisfy them. Submanagers need not initially understand their managers' commands. They simply learn to maximise their reinforcement in the context of the current command. We illustrate the system using a simple maze task. As the system learns how to get around, satisfying commands at the multiple levels, it explores more efficiently than standard, flat, Q-learning and builds a more comprehensive map.

681 citations


Journal ArticleDOI
TL;DR: In this article, two algorithms for behavior learning are described that combine Q learning, a well-known scheme for propagating reinforcement values temporally across actions, with statistical clustering and Hamming distance.

632 citations


Journal ArticleDOI
TL;DR: Reinforcement learning methods are presented as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems.
Abstract: Neural network reinforcement learning methods are described and considered as a direct approach to adaptive optimal control of nonlinear systems. These methods have their roots in studies of animal learning and in early learning control work. An emerging deeper understanding of these methods is summarized that is obtained by viewing them as a synthesis of dynamic programming and stochastic approximation methods. The focus is on Q-learning systems, which maintain estimates of utilities for all state-action pairs and make use of these estimates to select actions. The use of hybrid direct/indirect methods is briefly discussed. >

437 citations


Proceedings Article
12 Jul 1992
TL;DR: The predictive distinctions approach to compensate for perceptual aliasing caused from incomplete perception of the world is introduced and Experimental results are given for a simple simulated domain, and additional issues are discussed.
Abstract: It is known that Perceptual Aliasing may significantly diminish the effectiveness of reinforcement learning algorithms [Whitehead and Ballard, 1991]. Perceptual aliasing occurs when multiple situations that are indistinguishable from immediate perceptual input require different responses from the system. For example, if a robot can only see forward, yet the presence of a battery charger behind it determines whether or not it should backup, immediate perception alone is insufficient for determining the most appropriate action. It is problematic since reinforcement algorithms typically learn a control policy from immediate perceptual input to the optimal choice of action. This paper introduces the predictive distinctions approach to compensate for perceptual aliasing caused from incomplete perception of the world. An additional component, a predictive model, is utilized to track aspects of the world that may not be visible at all times. In addition to the control policy, the model must also be learned, and to allow for stochastic actions and noisy perception, a probabilistic model is learned from experience. In the process, the system must discover, on its own, the important distinctions in the world. Experimental results are given for a simple simulated domain, and additional issues are discussed.

Journal ArticleDOI
TL;DR: A new learning algorithm and a modular architecture is presented that learns the decomposition of compositeSDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs.
Abstract: Although building sophisticated learning agents that operate in complex environments will require learning to perform multiple tasks, most applications of reinforcement learning have focused on single tasks. In this paper I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SDTs cannot be decomposed into simpler SDTs. I consider a learning agent that has to learn to solve a set of elemental and composite SDTs. I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns the decomposition of composite SDTs, and achieves transfer of learning by sharing the solutions of elemental SDTs across multiple composite SDTs. The solution of a composite SDT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SDTs. I provide a proof of one aspect of the learning algorithm.

01 Jan 1992
TL;DR: It is proved that for all finite deterministic domains, reinforcement learning using a directed technique can always be performed in polynomial time, demonstrating the important role of exploration in reinforcement learning.
Abstract: Exploration plays a fundamental role in any active learning system. This study evaluates the role of exploration in active learning and describes several local techniques for exploration in finite, discrete domains, embedded in a reinforcement learning framework (delayed reinforcement). This paper distinguishes between two families of exploration schemes: undirected and directed exploration. While the former family is closely related to random walk exploration, directed exploration techniques memorize exploration-specific knowledge which is used for guiding the exploration search. In many finite deterministic domains, any learning technique based on undirected exploration is inefficient in terms of learning time, i.e., learning time is expected to scale exponentially with the size of the state space. We prove that for all these domains, reinforcement learning using a directed technique can always be performed in polynomial time, demonstrating the important role of exploration in reinforcement learning. (The proof is given for one specific directed exploration technique named counter-based exploration.) Subsequently, several exploration techniques found in recent reinforcement learning and connectionist adaptive control literature are described. In order to trade off efficiently between exploration and exploitation --- a trade-off which characterizes many real-world active learning tasks --- combination methods are described which explore and avoid costs simultaneously. This includes a selective attention mechanism, which allows smooth switching between exploration and exploitation. All techniques are evaluated and compared on a discrete reinforcement learning task (robot navigation). The empirical evaluation is followed by an extensive discussion of benefits and limitations of this work.

Journal ArticleDOI
TL;DR: Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one is adapted to demonstrate this strong form of convergence for a slightly modified version of TD.
Abstract: The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones. It also considers how this version of TD behaves in the face of linearly dependent representations for states—demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins' theorem that \cal Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.

Book
01 May 1992
TL;DR: This volume is relatively self contained as the necessary background material from logic, probability and complexity theory is included, and will form an introduction to the theory of computational learning, suitable for a broad spectrum of graduate students from theoretical computer science and mathematics.
Abstract: Computational learning theory is a subject which has been advancing rapidly in the last few years. The authors concentrate on the probably approximately correct model of learning, and gradually develop the ideas of efficiency considerations. Finally, applications of the theory to artificial neural networks are considered. Many exercises are included throughout, and the list of references is extensive. This volume is relatively self contained as the necessary background material from logic, probability and complexity theory is included. It will therefore form an introduction to the theory of computational learning, suitable for a broad spectrum of graduate students from theoretical computer science and mathematics.

Journal ArticleDOI
TL;DR: This analysis offers insight into the nature of the solution of the learning control problem by deriving sufficient convergence conditions; an approach to learning control for linear systems based on parameter estimation; and an analysis that shows that for finite-horizon problems it is possible to design a learning control algorithm that converges, with memory, in one step.
Abstract: Learning control is an iterative approach to the problem of improving transient behavior for processes that are repetitive in nature. Some results on iterative learning control are presented. A complete review of the literature is given first. Then, a general formulation of the problem is given. Next, a complete analysis of the learning control problem for the case of linear, time-invariant plants and controllers is presented. This analysis offers: insight into the nature of the solution of the learning control problem by deriving sufficient convergence conditions; an approach to learning control for linear systems based on parameter estimation; and an analysis that shows that for finite-horizon problems it is possible to design a learning control algorithm that converges, with memory, in one step. Finally, a time-varying learning controller is given for controlling the trajectory of a nonlinear robot manipulator. A brief simulation example is presented to illustrate the effectiveness of this scheme. 56 refs.

Journal ArticleDOI
TL;DR: A neural model is described of how adaptively timed reinforcement learning occurs and suggested to exist in the hippocampus, and to involve convergence of dentate granule cells on CA3 pyramidal cells, and N-methyl-D-aspartate receptors.

Journal ArticleDOI
TL;DR: The approximate reasoning based intelligent control (ARIC) architecture proposed here learns by updating its prediction of the physical system's behavior and fine tunes a control knowledge base.

Book ChapterDOI
07 Jul 1992
TL;DR: A method that allows a human expert to interact in real-time with a reinforcement learning algorithm is shown to accelerate the learning process.
Abstract: This paper presents a method for accelerating the learning rates of reinforcement learning algorithms. Reinforcement learning algorithms are known for their slow learning rates, and researchers have focused recently on increasing those rates. In this paper, a method that allows a human expert to interact in real-time with a reinforcement learning algorithm is shown to accelerate the learning process. Two experiments, each with a different domain and a different reinforcement learning algorithm, illustrate that the unobtrusive method accelerates learning by more than an order of magnitude

01 May 1992
TL;DR: This paper studies three connectionist approaches which learn to use history to handle perceptual aliasing: the window-Q, recurrent- Q, and recurrent-model architectures.
Abstract: Reinforcement learning is a type of unsupervised learning for sequential decision making. Q-learning is probably the best-understood reinforcement learning algorithm. In Q-learning, the agent learns a mapping from states and actions to their utilities. An important assumption of Q-learning is the Markovian environment assumption, meaning that any information needed to determine the optimal actions is reflected in the agent''s state representation. Consider an agent whose state representation is based solely on its immediate perceptual sensations. When its sensors are not able to make essential distinctions among world states, the Markov assumption is violated, causing a problem called perceptual aliasing. For example, when facing a closed box, an agent based on its current visual sensation cannot act optimally if the optimal action depends on the contents of the box. There are two basic approaches to addressing this problem -- using more sensors or using history to figure out the current world state. This paper studies three connectionist approaches which learn to use history to handle perceptual aliasing: the window-Q, recurrent-Q, and recurrent-model architectures. Empirical study of these architectures is presented. Their relative strengths and weaknesses are also discussed.

Proceedings Article
12 Jul 1992
TL;DR: Simulations on a set of compositionally-structured navigation tasks show that H-DYNA can learn to solve them faster than conventional RL algorithms, and the abstract models can be used to solve stochastic control tasks.
Abstract: Reinforcement learning (RL) algorithms have traditionally been thought of as trial and error learning methods that use actual control experience to incrementally improve a control policy. Sutton's DYNA architecture demonstrated that RL algorithms can work as well using simulated experience from an environment model, and that the resulting computation was similar to doing one-step lookahead planning. Inspired by the literature on hierarchical planning, I propose learning a hierarchy of models of the environment that abstract temporal detail as a means of improving the scalability of RL algorithms. I present H-DYNA (Hierarchical DYNA), an extension to Sutton's DYNA architecture that is able to learn such a hierarchy of abstract models. H-DYNA differs from hierarchical planners in two ways: first, the abstract models are learned using experience gained while learning to solve other tasks in the same environment, and second, the abstract models can be used to solve stochastic control tasks. Simulations on a set of compositionally-structured navigation tasks show that H-DYNA can learn to solve them faster than conventional RL algorithms. The abstract models also serve as mechanisms for achieving transfer of learning across multiple tasks.

Journal ArticleDOI
TL;DR: The learner is not told which action to take, but instead must discover which actions yield the highest reward by trying them, and these two characteristics are the two most important distinguishing features of reinforcement learning.
Abstract: Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate’s reward, but also the next situation, and through that all subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning.

Proceedings Article
30 Nov 1992
TL;DR: An algorithm based on Q-learning is described that is proven to converge to the optimal controller for a large class of LQR problems, an important class of control problems involving continuous state and action spaces and requiring a simple type of non-linear function approximator.
Abstract: Recent research on reinforcement learning has focused on algorithms based on the principles of Dynamic Programming (DP). One of the most promising areas of application for these algorithms is the control of dynamical systems, and some impressive results have been achieved. However, there are significant gaps between practice and theory. In particular, there are no convergence proofs for problems with continuous state and action spaces, or for systems involving non-linear function approximators (such as multilayer perceptrons). This paper presents research applying DP-based reinforcement learning theory to Linear Quadratic Regulation (LQR), an important class of control problems involving continuous state and action spaces and requiring a simple type of non-linear function approximator. We describe an algorithm based on Q-learning that is proven to converge to the optimal controller for a large class of LQR problems. We also describe a slightly different algorithm that is only locally convergent to the optimal Q-function, demonstrating one of the possible pitfalls of using a non-linear function approximator with DP-based learning.

Proceedings Article
30 Nov 1992
TL;DR: A neural network learning method that generalizes rationally from many fewer data points, relying instead on prior knowledge encoded in previously learned neural networks that is used to bias generalization when learning the target function.
Abstract: How can artificial neural nets generalize better from fewer examples? In order to generalize successfully, neural network learning methods typically require large training data sets. We introduce a neural network learning method that generalizes rationally from many fewer data points, relying instead on prior knowledge encoded in previously learned neural networks. For example, in robot control learning tasks reported here, previously learned networks that model the effects of robot actions are used to guide subsequent learning of robot control functions. For each observed training example of the target function (e.g. the robot control policy), the learner explains the observed example in terms of its prior knowledge, then analyzes this explanation to infer additional information about the shape, or slope, of the target function. This shape knowledge is used to bias generalization when learning the target function. Results are presented applying this approach to a simulated robot task based on reinforcement learning.

Journal ArticleDOI
TL;DR: A network architecture for process control is described, and quantitatively how the weights on the connections in the network can be adjusted to yield the desired control action is explained.

Journal ArticleDOI
TL;DR: Neurocontrol — the use of ANNs to directly control motors, actuators, etc.

Proceedings ArticleDOI
12 May 1992
TL;DR: The results indicated that direct reinforcement learning can be used to learn a reactive control strategy that works well even in the presence of a high degree of noise and uncertainty.
Abstract: A peg-in-hole insertion task is used as an example to illustrate the utility of direct associative reinforcement learning methods for learning control under real-world conditions of uncertainty and noise. An associative reinforcement learning system has to learn appropriate actions in various situations through a search guided by evaluative performance feedback The authors used such a learning system, implemented as a connectionist network, to learn active compliant control for peg-in-hole insertion. The results indicated that direct reinforcement learning can be used to learn a reactive control strategy that works well even in the presence of a high degree of noise and uncertainty. >

Proceedings Article
12 Jul 1992
TL;DR: The task of automatically generating a computer program to enable an autonomous mobile robot to perform the task of moving a box from the middle of an irregular shaped room to the wall is considered.
Abstract: The goal in automatic programming is to get a computer to perform a task by telling it what needs to be done. rather than by explicitly programming it. This paper considers the task of automatically generating a computer program to enable an autonomous mobile robot to perform the task of moving a box from the middle of an irregular shaped room to the wall. We compare the ability of the recently developed genetic programming paradigm to produce such a program to the reported ability of reinforcement learning techniques. such as Q learning. to produce such a program in the style of the subsumption architecture. The computational requirements of reinforcement learning necessitates considerable human knowledge and intervention. whereas genetic programming comes much closer to achieving the goal of getting the computer to perform the task without explicitly programming it. The solution produced by genetic programming emerges as a result of Darwinian natural selection and genetic crossover (sexual recombination) in a population of computer programs. The process is driven by a fitness measure which communicates the nature of the task to the computer and its learning paradigm.

01 Jan 1992
TL;DR: It is argued that for certain types of problems the latter approach, of which reinforcement learning is an example, can yield faster, more reliable learning, while the former approach is relatively inefficient.
Abstract: Learning control involves modifying a controller's behavior to improve its performance as measured by some predefined index of performance (IP). If control actions that improve performance as measured by the IP are known, supervised learning methods, or methods for learning from examples, can be used to train the controller. But when such control actions are not known a priori, appropriate control behavior has to be inferred from observations of the IP. One can distinguish between two classes of methods for training controllers under such circumstances. Indirect methods involve constructing a model of the problem's IP and using the model to obtain training information for the controller. On the other hand, direct, or model-free, methods obtain the requisite training information by observing the effects of perturbing the controlled process on the IP. Despite its reputation for inefficiency, we argue that for certain types of problems the latter approach, of which reinforcement learning is an example, can yield faster, more reliable learning. Using several control problems as examples, we illustrate how the complexity of model construction can often exceed that of solving the original control problem using direct reinforcement learning methods, making indirect methods relatively inefficient. These results indicate the importance of considering direct reinforcement learning methods as tools for learning to solve control problems. We also present several techniques for augmenting the power of reinforcement learning methods. These include (1) the use of local models to guide assigning credit to the components of a reinforcement learning system, (2) implementing a procedure from experimental psychology called "shaping" to improve the efficiency of learning, thereby making more complex problems amenable to solution, and (3) implementing a multi-level learning architecture designed for exploiting task decomposability by using previously-learned behaviors as primitives for learning more complex tasks.

01 Jan 1992
TL;DR: This dissertation applies reinforcement learning to the adaptive control of active sensory-motor systems by extending the technique to include two cooperative learning mechanisms, called Learning with an External Critic (LEC) and Learning By Watching (LBW), respectively, which significantly improve learning.
Abstract: This dissertation applies reinforcement learning to the adaptive control of active sensory-motor systems Active sensory-motor systems, in addition to providing for overt action, also support active, selective sensing of the environment The principal advantage of this active approach to perception is that the agent's internal representation can be made highly task specific--thus, avoiding wasteful sensory processing and the representation of irrelevant information One unavoidable consequence of active perception is that improper control can lead to internal states that confound functionally distinct states in the external world This phenomenon, called perceptual aliasing, is shown to destabilize existing reinforcement learning algorithms with respect to optimal control To overcome these difficulties, an approach to adaptive control, called the Consistent Representation (CR) method, is developed This method is used to construct systems that learn not only the overt actions needed to solve a task, but also where to focus their attention in order to collect necessary sensory information The principle of the CR-method is to separate control into two stages: an identification stage, followed by an overt stage The identification stage generates the task-specific internal representation that is used by the overt control stage Adaptive identification is accomplished by a technique that involves the detection and suppression of perceptually aliased internal states Q-learning is used for adaptive overt control The technique is then extended to include two cooperative learning mechanisms, called Learning with an External Critic (LEC) and Learning By Watching (LBW), respectively, which significantly improve learning Cooperative mechanisms exploit the presence of helpful agents in the environment to supply auxillary sources of trial-and-error experience and to decrease the latency between the execution and evaluation of an action