scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 1993"


Journal ArticleDOI
01 Aug 1993
TL;DR: A rigorous proof of convergence of DP-based learning algorithms is provided by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem, which establishes a general class of convergent algorithms to which both TD() and Q-learning belong.
Abstract: Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(λ) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(λ) and Q-learning belong.

936 citations


Journal ArticleDOI
TL;DR: This work presents a new algorithm, prioritized sweeping, for efficient prediction and control of stochastic Markov systems, which successfully solves large state-space real-time problems with which other methods have difficulty.
Abstract: We present a new algorithm, prioritized sweeping, for efficient prediction and control of stochastic Markov systems. Incremental learning methods such as temporal differencing and Q-learning have real-time performance. Classical methods are slower, but more accurate, because they make full use of the observations. Prioritized sweeping aims for the best of both worlds. It uses all previous experiences both to prioritize important dynamic programming sweeps and to guide the exploration of state-space. We compare prioritized sweeping with other reinforcement learning schemes for a number of different stochastic optimal control problems. It successfully solves large state-space real-time problems with which other methods have difficulty.

800 citations


Book
20 May 1993
TL;DR: This dissertation addresses the problem of designing algorithms for learning in embedded systems using Sutton's techniques for linear association and reinforcement comparison, while the interval estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions.
Abstract: This dissertation addresses the problem of designing algorithms for learning in embedded systems. This problem differs from the traditional supervised learning problem. An agent, finding itself in a particular input situation must generate an action. It then receives a reinforcement value from the environment, indicating how valuable the current state of the environment is for the agent. The agent cannot, however, deduce the reinforcement value that would have resulted from executing any of its other actions. A number of algorithms for learning action strategies from reinforcement values are presented and compared empirically with existing reinforcement-learning algorithms. The interval-estimation algorithm uses the statistical notion of confidence intervals to guide its generation of actions in the world, trading off acting to gain information against acting to gain reinforcement. It performs well in simple domains but does not exhibit any generalization and is computationally complex. The cascade algorithm is a structural credit-assignment method that allows an action strategy with many output bits to be learned by a collection of reinforcement-learning modules that learn Boolean functions. This method represents an improvement in computational complexity and often in learning rate. Two algorithms for learning Boolean functions in k-DNF are described. Both are based on Valiant's algorithm for learning such functions from input-output instances. The first uses Sutton's techniques for linear association and reinforcement comparison, while the second uses techniques from the interval estimation algorithm. They both perform well and have tractable complexity. A generate-and-test reinforcement-learning algorithm is presented. It allows symbolic representations of Boolean functions to be constructed incrementally and tested in the environment. It is highly parametrized and can be tuned to learn a broad range of function classes. Low-complexity functions can be learned very efficiently even in the presence of large numbers of irrelevant input bits. This algorithm is extended to construct simple sequential networks using a set-reset operator, which allows the agent to learn action strategies with state. These algorithms, in addition to being studied in simulation, were implemented and tested on a physical mobile robot.

677 citations


Book ChapterDOI
27 Jun 1993
TL;DR: A metric of undiscounted performance and an algorithm for finding action policies that maximize that measure are presented, which are modelled after the popular Q-learning algorithm.
Abstract: While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call R-learning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.

326 citations


Proceedings Article
11 Jul 1993
TL;DR: It is argued that the machine learning approach to building interface agents is a feasible one which has several advantages over other approaches: it provides a customized and adaptive solution which is less costly and ensures better user acceptability.
Abstract: Interface agents are computer programs that employ Artificial Intelligence techniques in order to provide assistance to a user dealing with a particular computer application. The paper discusses an interface agent which has been modelled closely after the metaphor of a personal assistant. The agent learns how to assist the user by (i) observing the user's actions and imitating them, (ii) receiving user feedback when it takes wrong actions and (iii) being trained by the user on the basis of hypothetical examples. The paper discusses how this learning agent was implemented using memory-based learning and reinforcement learning techniques. It presents actual results from two prototype agents built using these techniques: one for a meeting scheduling application and one for electronic mail. It argues that the machine learning approach to building interface agents is a feasible one which has several advantages over other approaches: it provides a customized and adaptive solution which is less costly and ensures better user acceptability. The paper also argues what the advantages are of the particular learning techniques used.

325 citations


Journal ArticleDOI
29 Nov 1993
TL;DR: Parti-game is a new algorithm for learning feasible trajectories to goal regions in high dimensional continuous state-spaces and applies techniques from game-theory and computational geometry to efficiently and adaptively concentrate high resolution only on critical areas.
Abstract: Parti-game is a new algorithm for learning feasible trajectories to goal regions in high dimensional continuous state-spaces In high dimensions it is essential that neither planning nor exploration occurs uniformly over a state-space Parti-game maintains a decision-tree partitioning of state-space and applies techniques from game-theory and computational geometry to efficiently and adaptively concentrate high resolution only on critical areas The current version of the algorithm is designed to find feasible paths or trajectories to goal regions in high dimensional spaces Future versions will be designed to find a solution that optimizes a real-valued criterion Many simulated problems have been tested, ranging from two-dimensional to nine-dimensional state-spaces, including mazes, path planning, non-linear dynamics, and planar snake robots in restricted spaces In all cases, a good solution is found in less than ten trials and a few minutes

268 citations


Journal ArticleDOI
01 Apr 1993
TL;DR: A class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency are examined.
Abstract: The Dyna class of reinforcement learning architectures enables the creation of integrated learning, planning and reacting systems. A class of strategies designed to enhance the learning and planning power of Dyna systems by increasing their computational efficiency is examined. The benefit of using these strategies is demonstrated on some simple abstract learning tasks. It is proposed that the backups to be performed in Dyna be prioritized in order to improve its efficiency. It is demonstrated with simple tasks that use some specific prioritizing schemes can lead to significant reductions in computational effort and corresponding improvements in learning performance. >

241 citations


Book ChapterDOI
27 Jun 1993
TL;DR: This paper presents a method by which a reinforcement learning agent can solve the incomplete perception problem using memory by using a hidden Markov model to represent its internal state space and creating memory capacity by splitting states of the HMM.
Abstract: This paper presents a method by which a reinforcement learning agent can solve the incomplete perception problem using memory. The agent uses a hidden Markov model (HMM) to represent its internal state space and creates memory capacity by splitting states of the HMM. The key idea is a test to determine when and how a state should be split: the agent only splits a state when doing so will help the agent predict utility. Thus the agent can create only as much memory as needed to perform the task at hand—not as much as would be required to model all the perceivable world. I call the technique UDM, for Utile Distinction Memory.

189 citations


Journal ArticleDOI
TL;DR: On a simulated inverted-pendulum control problem, “genetic reinforcement learning” produces competitive results with AHC, another well-known reinforcement learning paradigm for neural networks that employs the temporal difference method.
Abstract: Empirical tests indicate that at least one class of genetic algorithms yields good performance for neural network weight optimization in terms of learning rates and scalability. The successful application of these genetic algorithms to supervised learning problems sets the stage for the use of genetic algorithms in reinforcement learning problems. On a simulated inverted-pendulum control problem, “genetic reinforcement learning” produces competitive results with AHC, another well-known reinforcement learning paradigm for neural networks that employs the temporal difference method. These algorithms are compared in terms of learning rates, performance-based generalization, and control behavior over time.

186 citations


Proceedings Article
11 Jul 1993
TL;DR: This paper analyzes the complexity of on-line reinforcement learning algorithms, namely asynchronous realtime versions of Q-learning and value-iteration, applied to the problem of reaching a goal state in deterministic domains and shows that the algorithms are tractable with only a simple change in the task representation or initialization.
Abstract: This paper analyzes the complexity of on-line reinforcement learning algorithms, namely asynchronous realtime versions of Q-learning and value-iteration, applied to the problem of reaching a goal state in deterministic domains. Previous work had concluded that, in many cases, tabula rasa reinforcement learning was exponential for such problems, or was tractable only if the learning algorithm was augmented. We show that, to the contrary, the algorithms are tractable with only a simple change in the task representation or initialization. We provide tight bounds on the worst-case complexity, and show how the complexity is even smaller if the reinforcement learning algorithms have initial knowledge of the topology of the state space or the domain has certain special properties. We also present a novel bidirectional Q-learning algorithm to find optimal paths from all states to a goal state and show that it is no more complex than the other algorithms.

134 citations


Book ChapterDOI
27 Jun 1993
TL;DR: It is shown that simple random-representation methods can perform as well as nearest-neighbor methods (while being more suited to online learning), and signicantly better than backpropagation, and suggest that randomness has a useful role to play in online supervised learning and constructive induction.
Abstract: We consider the requirements of online learning|learning which must be done incrementally and in realtime, with the results of learning available soon after each new example is acquired. Despite the abundance of methods for learning from examples, there are few that can be used eectively for online learning, e.g., as components of reinforcement learning systems. Most of these few, including radial basis functions, CMACs, Kohonen’s self-organizing maps, and those developed in this paper, share the same structure. All expand the original input representation into a higher dimensional representation in an unsupervised way, and then map that representation to the nal answer using a relatively simple supervised learner, such as a perceptron or LMS rule. Such structures learn very rapidly and reliably, but have been thought either to scale poorly or to require extensive domain knowledge. To the contrary, some researchers (Rosenblatt, 1962; Gallant & Smith, 1987; Kanerva, 1988; Prager & Fallside, 1988) have argued that the expanded representation can be chosen largely at random with good results. The main contribution of this paper is to develop and test this hypothesis. We show that simple random-representation methods can perform as well as nearest-neighbor methods (while being more suited to online learning), and signicantly better than backpropagation. We nd that the size of the random representation does increase with the dimensionality of the problem, but not unreasonably so, and that the required size can be reduced substantially using unsupervisedlearning techniques. Our results suggest that randomness has a useful role to play in online supervised learning and constructive induction. 1. Online Learning Applications of supervised learning can be divided into two types: online and oine.

Proceedings Article
09 Aug 1993

Proceedings Article
28 Aug 1993
TL;DR: In this paper, two reinforcement learning algorithms, ACE and AGE, are proposed for the reinforcement learning of appropriate sequences of action sets in multi-agent systems, and experimental results illustrate the learning abilities of these algorithms.
Abstract: This paper deals with learning in reactive multi-agent systems. The central problem addressed is how several agents can collectively learn to coordinate their actions such that they solve a given environmental task together. In approaching this problem, two important constraints have to be taken into consideration: the incompatibility constraint, that is, the fact that different actions may be mutually exclusive; and the local information constraint, that is, the fact that each agent typically knows only a fraction of its environment. The contents of the paper is as follows. First, the topic of learning in multi-agent systems is motivated (section 1). Then, two algorithms called ACE and AGE (standing for "ACtion Estimation" and "Action Group Estimation", respectively) for the reinforcement learning of appropriate sequences of action sets in multi agent systems are described (section 2). Next, experimental results illustrating the learning abilities of these algorithms are presented (section 3). Finally, the algorithms are discussed and an outlook on future research is provided (section 4).

Book ChapterDOI
01 Jan 1993
TL;DR: This chapter considers the application of reinforcement learning to a simple class of dynamic multi-goal tasks and considers several merging strategies, from simple ones that compare and combine modular information about the current state only, to more sophisticated strategies that use lookahead search to construct more accurate utility estimates.
Abstract: An ability to coordinate the pursuit of multiple, time-varying goals is important to an intelligent robot. In this chapter we consider the application of reinforcement learning to a simple class ofdynamicmulti-goal tasks.Not surprisingly, we find that the most straightforward, monolithic approach scales poorly, since the size of the state space is exponential in the number of goals. As an alternative, we propose a simple modular architecture which distributes the learning and control task amongst a set of separate control modules, one for each goal that the agent might encounter. Learning is facilitated since each module learns the optimal policy associated with its goal without regard for other current goals. This greatly simplifies the state representation and speeds learning time compared to a single monolithic controller. When the robot is faced with a single goal, the module associated with that goal is used to determine the overall control policy. When the robot is faced with multiple goals, information from each associated module is merged to determine the policy for the combined task. In general, these merged strategies yield good but suboptimal performance. Thus, the architecture trades poor initial performance, slow learning, and an optimal asymptotic policy in favor of good initial performance, fast learning, and a slightly sub-optimal asymptotic policy. We consider several merging strategies, from simple ones that compare and combine modular information about the current state only, to more sophisticated strategies that use lookahead search to construct more accurate utility estimates.

Book ChapterDOI
27 Jun 1993
TL;DR: This research concludes that it is possible to build artificial agents that can acquire complex control policies effectively by reinforcement learning and enable its applications to complex robot- learning problems.
Abstract: The aim of this research is to extend the state of the art of reinforcement learning and enable its applications to complex robot- learning problems. This paper presents a series of scaling-up extensions to reinforcement learning, including: generalization by neural networks, using action models, teaching, hierarchical learning , and having a short-term memory. These extensions have been tested in a physically-realistic robot simulator, and combined to solve a complex robot-learning problem. Simulation results indicate that each of the extensions could result in either significant learning speedup or new capabilities. This research concludes that it is possible to build artificial agents that can acquire complex control policies effectively by reinforcement learning.

Proceedings ArticleDOI
15 Dec 1993
TL;DR: The author uses these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establishes its convergence under conditions more general than previously available.
Abstract: Provides some general results on the convergence of a class of stochastic approximation algorithms and their parallel and asynchronous variants. The author then uses these results to study the Q-learning algorithm, a reinforcement learning method for solving Markov decision problems, and establishes its convergence under conditions more general than previously available. >

Book ChapterDOI
01 Jan 1993
TL;DR: This work proposes a division of learning styles into four main types based on the amount of built-in structure and the type of information being learned, and discusses the effectiveness of various learning methodologies when applied in a real robot context.
Abstract: The weaknesses of existing learning techniques, and the variety of knowledge necessary to make a robot perform efficiently in the real world, suggest that many concurrent, complementary, and redundant learning methods are necessary. We propose a division of learning styles into four main types based on the amount of built-in structure and the type of information being learned. Using this classification, we discuss the effectiveness of various learning methodologies when applied in a real robot context.

Proceedings ArticleDOI
28 Mar 1993
TL;DR: The authors propose a reinforcement neural-network-based fuzzy logic control system (RNN-FLCS) for solving various reinforcement learning problems and finds it best applied to learning environments where obtaining exact training data is expensive.
Abstract: The authors propose a reinforcement neural-network-based fuzzy logic control system (RNN-FLCS) for solving various reinforcement learning problems. RNN-FLCS is best applied to learning environments where obtaining exact training data is expensive. It is constructed by integrating two neural-network-based fuzzy logic controllers (NN-FLCs), each of which is a connectionist model with a feedforward multilayered network developed for the realization of a fuzzy logic controller. One NN-FLC functions as a fuzzy predictor and the other as a fuzzy controller. Using the temporal difference prediction method, the fuzzy predictor can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the fuzzy controller. The fuzzy controller implements a stochastic exploratory algorithm to adapt itself according to the internal reinforcement signal. During the learning process, the RNN-FLCs can construct a fuzzy logic control system automatically and dynamically through a reward-penalty signal or through very simple fuzzy information feedback. Structure learning and parameter learning are performed simultaneously in the two NN-FLCs. Simulation results are presented. >

Book ChapterDOI
27 Jun 1993
TL;DR: A density-adaptive reinforcement learning and a density adaptive forgetting algorithm that deletes observations from the learning set depending on whether subsequent evidence is available in a local region of the parameter space.
Abstract: We describe a density-adaptive reinforcement learning and a density-adaptive forgetting algorithm. This learning algorithm uses hybrid D κ-D/2 κ -trees to allow for a variable resolution partitioning and labelling of the input space. The density adaptive forgetting algorithm deletes observations from the learning set depending on whether subsequent evidence is available in a local region of the parameter space. The algorithms are demonstrated in a simulation for learning feasible robotic grasp approach directions and orientations and then adapting to subsequent mechanical failures in the gripper.

01 Nov 1993
TL;DR: This dissertation establishes a novel connection between stochastic approximation theory and RL that provides a uniform framework for understanding all the different RL algorithms that have been proposed to date and highlights a dimension that clearly separates all RL research from prior work on DP.
Abstract: This dissertation is about building learning control architectures for agents embedded in finite, stationary, and Markovian environments. Such architectures give embedded agents the ability to improve autonomously the efficiency with which they can achieve goals. Machine learning researchers have developed reinforcement learning (RL) algorithms based on dynamic programming (DP) that use the agent''s experience in its environment to improve its decision policy incrementally. This is achieved by adapting an evaluation function in such a way that the decision policy that is ``greedy'''' with respect to it improves with experience. This dissertation focuses on finite, stationary and Markovian environments for two reasons: it allows the development and use of a strong theory of RL, and there are many challenging real-world RL tasks that fall into this category. This dissertation establishes a novel connection between stochastic approximation theory and RL that provides a uniform framework for understanding all the different RL algorithms that have been proposed to date. It also highlights a dimension that clearly separates all RL research from prior work on DP. Two other theoretical results showing how approximations affect performance in RL provide partial justification for the use of compact function approximators in RL. In addition, a new family of ``soft'''' DP algorithms is presented. These algorithms converge to solutions that are more robust than the solutions found by classical DP algorithms. Despite all of the theoretical progress, conventional RL architectures scale poorly enough to make them impractical for many real-world problems. This dissertation studies two aspects of the scaling issue: the need to accelerate RL, and the need to build RL architectures that can learn to solve multiple tasks. It presents three RL architectures, CQ-L, H-DYNA, and BB-RL, that accelerate learning by facilitating transfer of training from simple to complex tasks. Each architecture uses a different method to achieve transfer of training: CQ-L uses the evaluation functions for simple tasks as building blocks to construct the evaluation function for complex tasks. H-DYNA uses the evaluation functions for simple tasks to build an abstract environment model, and BB-RL uses the decision policies found for the simple tasks as the primitive actions for the complex tasks. A mixture of theoretical and empirical results are presented to support the new RL architectures developed in this dissertation.

Proceedings ArticleDOI
28 Mar 1993
TL;DR: It is shown how reinforcement learning can be made practical for complex problems by introducing hierarchical learning and artificial neural networks are used to generalize experiences.
Abstract: It is shown how reinforcement learning can be made practical for complex problems by introducing hierarchical learning. The agent at first learns elementary skills for solving elementary problems. To learn a new skill for solving a complex problem later on, the agent can ignore the low-level details and focus on the problem of coordinating the elementary skills it has developed. A physically-realistic mobile robot simulator is used to demonstrate the success and importance of hierarchical learning. For fast learning, artificial neural networks are used to generalize experiences, and a teaching technique is employed to save many learning trials of the simulated robot. >

Book ChapterDOI
01 Jan 1993
TL;DR: This chapter discusses how learning can be speeded up by exploiting properties of the task, sensor configuration, environment, and existing control structure.
Abstract: For learning to be useful on real robots, whatever algorithm is used must converge in some “reasonable” amount of time. If each trial step takes on the order of seconds, a million steps would take several months of continuous run time. In many cases such extended runs are neither desirable nor practical. In this chapter we discuss how learning can be speeded up by exploiting properties of the task, sensor configuration, environment, and existing control structure.

Proceedings Article
29 Nov 1993
TL;DR: This paper presents a method that uses domain knowledge to reduce the number of failures during exploration and formulates the set of actions from which the RL agent composes a control policy to ensure that exploration is conducted in a policy space that excludes most of the unacceptable policies.
Abstract: While exploring to find better solutions, an agent performing online reinforcement learning (RL) can perform worse than is acceptable. In some cases, exploration might have unsafe, or even catastrophic, results, often modeled in terms of reaching 'failure' states of the agent's environment. This paper presents a method that uses domain knowledge to reduce the number of failures during exploration. This method formulates the set of actions from which the RL agent composes a control policy to ensure that exploration is conducted in a policy space that excludes most of the unacceptable policies. The resulting action set has a more abstract relationship to the task being solved than is common in many applications of RL. Although the cost of this added safety is that learning may result in a suboptimal solution, we argue that this is an appropriate tradeoff in many problems. We illustrate this method in the domain of motion planning.

Book ChapterDOI
13 Sep 1993
TL;DR: This work denominates the method projective mapping, which is the most common method in feed forward neural networks, where an input vector is projected on a “weight vector”.
Abstract: A response generating system can be seen as a mapping from a set of external states (inputs) to a set of actions (outputs). This mapping can be done in principally different ways. One method is to divide the state space into a set of discrete states and store the optimal response for each state. This is denominated a memory mapping system. Another method is to approximate continuous functions from the input space to the output space. I denominate this method projective mapping, although the function does not have to be linear. The latter method is the most common one in feed forward neural networks, where an input vector is projected on a “weight vector”.

Book
01 Jan 1993
TL;DR: A Knowledge-Intensive Genetic Algorithm for Supervised Learning C.Z. Janikow and a Genetic Reinforcement Learning for Neurocontrol Problems D.F. Smith.
Abstract: Introduction J.J. Grefenstette. Using Genetic Algorithms for Concept Learning K.A. De Jong, W.M. Spears, D.F. Gordon. A Knowledge-Intensive Genetic Algorithm for Supervised Learning C.Z. Janikow. Competition-Based Induction of Decision Models from Examples D.P. Greene, S.F. Smith. Genetic Reinforcement Learning for Neurocontrol Problems D. Whitely, S. Dominic, R. Das, C.W. Anderson. What Makes a Problem Hard for a Genetic Algorithm? Some Anomalous Results and Their Explanation S. Forrest, M. Mitchell. Subject Index.

Proceedings Article
29 Nov 1993
TL;DR: This paper presents a convergence result for indirect adaptive asynchronous value iteration algorithms for the case in which a look-up table is used to store the value function and implies convergence of several existing reinforcement learning algorithms.
Abstract: Reinforcement Learning methods based on approximating dynamic programming (DP) are receiving increased attention due to their utility in forming reactive control policies for systems embedded in dynamic environments. Environments are usually modeled as controlled Markov processes, but when the environment model is not known a priori, adaptive methods are necessary. Adaptive control methods are often classified as being direct or indirect. Direct methods directly adapt the control policy from experience, whereas indirect methods adapt a model of the controlled process and compute control policies based on the latest model. Our focus is on indirect adaptive DP-based methods in this paper. We present a convergence result for indirect adaptive asynchronous value iteration algorithms for the case in which a look-up table is used to store the value function. Our result implies convergence of several existing reinforcement learning algorithms such as adaptive real-time dynamic programming (ARTDP) (Barto, Bradtke, & Singh, 1993) and prioritized sweeping (Moore & Atkeson, 1993). Although the emphasis of researchers studying DP-based reinforcement learning has been on direct adaptive methods such as Q-Learning (Watkins, 1989) and methods using TD algorithms (Sutton, 1988), it is not clear that these direct methods are preferable in practice to indirect methods such as those analyzed in this paper.

Proceedings ArticleDOI
28 Mar 1993
TL;DR: It is demonstrated that it is possible to control the pitch, roll, and yaw of the space shuttle within a specified deadband by using fuzzy control rules and to adapt automatically to a reduced error tolerance.
Abstract: The authors discuss the results of applying two fuzzy reinforcement learning architectures to the difficult control problem of space shuttle attitude control They demonstrate that it is possible to control the pitch, roll, and yaw of the space shuttle within a specified deadband by using fuzzy control rules and to adapt automatically to a reduced error tolerance The performance of this controller is compared with a controller using conventional control theory and also a nonadaptive fuzzy controller The results, using the orbital operations simulator system, demonstrate that more difficult tasks can be learned by the controller while the fuel efficiency remains very high >

Book ChapterDOI
01 Jan 1993
TL;DR: In this paper, the authors present two basic approaches to building a controller for problems in which the transition probabilities and rewards are not initially specified, using a method such as Bellman's value iteration.
Abstract: Publisher Summary This chapter discusses reinforcement learning for planning and control. Reinforcement learning is complicated by the fact that the reinforcement, which is in the form of rewards and punishments, is often intermittent and delayed. The controller may perform a long sequence of actions before receiving any reward. This condition makes it difficult to attribute credit or blame to actions when the reinforcement is finally received. The policy that the controller constructs represents a particular plan that indicates the best action to take in every possible state that it faces. There are two basic approaches to building a controller for problems in which the transition probabilities and rewards are not initially specified. In the first, the controller attempts to learn the transition probabilities and rewards and then constructs an optimal policy off-line, using a method such as Bellman's value iteration. In the second approach, the controller attempts to learn an optimal policy by constructing an evaluation function for use in selecting the best action to take in a given state. The controller constructs this evaluation function without recourse to an explicit model for the system dynamics and so while the system cannot predict what the state resulting from a given action will be, it can determine whether that state is better or worse than the state resulting from any other action.

Journal ArticleDOI
01 Sep 1993
TL;DR: A method based on learning automata for designing controllers for the control of unknown complex dynamic systems using subsets of control actions to reduce the number of actions during a learning procedure is presented.
Abstract: The paper is concerned with the application of reinforcement learning techniques to the stochastic control problem, and in particular presents a method based on learning automata for designing controllers for the control of unknown complex dynamic systems. The work is focused on the design of a learning automaton using subsets of control actions to reduce the number of actions during a learning procedure. The subsets of actions can be expanded or contracted according to action probabilities which are reset from time to time so as to achieve a global selection over the action set. Two reinforcement schemes have been investigated alongside the variable subsets of control actions. A reference performance index and an approach to quantification and normalisation of the performance index are proposed in association with the two schemes to evaluate environment responases during the learning procedure. The method has been used to achieve learning control for an unknown nonlinear turbo-generator system.

01 Jan 1993
TL;DR: A self-improving reactive control system for autonomous robotic navigation that combines case-based reasoning and reinforcement learning to continuously tune the navigation system through experience, resulting in an improved library of cases that capture environmental regularities necessary to perform on-line adaptation.
Abstract: This paper presents a self-improving reactive control system for autonomous robotic navigation. The navigation module uses a schemabased reactive control system to perform the navigation task. The learning module combines case-based reasoning and reinforcement learning to continuously tune the navigation system through experience. The case-based reasoning component perceives and characterizes the system’ s environment, retrieves an appropriate case, and uses the recommendations of the case to tune the parameters of the reactive control system. The reinforcement learning component refinesthe content of the cases based on the current experience. Together, the learning components perform on-line adaptation, resulting in improved performance as the reactive control system tunes itself to the environment, as well as on-line learning, resulting in an improved library of cases that capture environmental regularities necessary to perform on-line adaptation. The system is extensively evaluated through simulation studies using several performance metrics and system configurations.