scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 1999"


Proceedings Article
29 Nov 1999
TL;DR: This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
Abstract: Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

5,492 citations


Journal ArticleDOI
Vladimir Vapnik1
TL;DR: How the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms are demonstrated and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems are demonstrated.
Abstract: Statistical learning theory was introduced in the late 1960's. Until the 1990's it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990's new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions. This article presents a very general overview of statistical learning theory including both theoretical and algorithmic aspects of the theory. The goal of this overview is to demonstrate how the abstract learning theory established conditions for generalization which are more general than those discussed in classical statistical paradigms and how the understanding of these conditions inspired new algorithmic approaches to function estimation problems.

5,370 citations


Journal ArticleDOI
TL;DR: It is shown that options enable temporally abstract knowledge and action to be included in the reinforcement learning frame- work in a natural and general way and may be used interchangeably with primitive actions in planning methods such as dynamic pro- gramming and in learning methodssuch as Q-learning.

3,233 citations


Journal ArticleDOI
TL;DR: Experience Weighted Attraction (EWA) as mentioned in this paper is a special case of reinforcement learning that combines reinforcement learning and belief learning, and hybridizes their key elements, allowing attractions to begin and grow flexibly as choice reinforcement does but reinforcing unchosen strategies substantially as belief-based models implicitly do.
Abstract: In ‘experience-weighted attraction’ (EWA) learning, strategies have attractions that reflect initial predispositions, are updated based on payoff experience, and determine choice probabilities according to some rule (e.g., logit). A key feature is a parameter δ that weights the strength of hypothetical reinforcement of strategies that were not chosen according to the payoff they would have yielded, relative to reinforcement of chosen strategies according to received payoffs. The other key features are two discount rates, φ and ρ, which separately discount previous attractions, and an experience weight. EWA includes reinforcement learning and weighted fictitious play (belief learning) as special cases, and hybridizes their key elements. When δ= 0 and ρ= 0, cumulative choice reinforcement results. When δ= 1 and ρ=φ, levels of reinforcement of strategies are exactly the same as expected payoffs given weighted fictitious play beliefs. Using three sets of experimental data, parameter estimates of the model were calibrated on part of the data and used to predict a holdout sample. Estimates of δ are generally around .50, φ around .8 − 1, and ρ varies from 0 to φ. Reinforcement and belief-learning special cases are generally rejected in favor of EWA, though belief models do better in some constant-sum games. EWA is able to combine the best features of previous approaches, allowing attractions to begin and grow flexibly as choice reinforcement does, but reinforcing unchosen strategies substantially as belief-based models implicitly do.

1,450 citations


Journal ArticleDOI
TL;DR: This paper investigates how the learning modules specialized for these three kinds of learning can be assembled into goal-oriented behaving systems and presents a novel view that their computational roles can be characterized by asking what are the "goals" of their computation.

734 citations


Book
12 Feb 1999
TL;DR: Systems That Learn presents a mathematical framework for the study of learning in a variety of domains that provides the basic concepts and techniques of learning theory as well as a comprehensive account of what is currently known about a range of learning paradigms.
Abstract: Formal learning theory is one of several mathematical approaches to the study of intelligent adaptation to the environment. The analysis developed in this book is based on a number theoretical approach to learning and uses the tools of recursive-function theory to understand how learners come to an accurate view of reality. This revised and expanded edition of a successful text provides a comprehensive, self-contained introduction to the concepts and techniques of the theory. Exercises throughout the text provide experience in the use of computational arguments to prove facts about learning.

582 citations


Proceedings Article
31 Jul 1999
TL;DR: In this paper, the authors present an algorithm that, given only a generative model (simulator) for an arbitrary MDP, performs near-optimal planning with a running time that has no dependence on the number of states.
Abstract: An issue that is critical for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP. In stochastic environments with very large or even infinite state spaces, traditional planning and reinforcement learning algorithms are often inapplicable, since their running time typically scales linearly with the state space size. In this paper we present a new algorithm that, given only a generative model (simulator) for an arbitrary MDP, performs near-optimal planning with a running time that has no dependence on the number of states. Although the running time is exponential in the horizon time (which depends only on the discount factor 7 and the desired degree of approximation to the optimal policy), our results establish for the first time that there are no theoretical barriers to computing near-optimal policies in arbitrarily large, unstructured MDPs. Our algorithm is based on the idea of sparse sampling. We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP. Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs [KMN99].

491 citations


Book
01 Jan 1999
TL;DR: A kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces and shows that the limiting distribution of the value function estimate is a Gaussian process.
Abstract: We present a kernel-based approach to reinforcement learning that overcomes the stability problems of temporal-difference learning in continuous state-spaces. First, our algorithm converges to a unique solution of an approximate Bellman's equation regardless of its initialization values. Second, the method is consistent in the sense that the resulting policy converges asymptotically to the optimal policy. Parametric value function estimates such as neural networks do not possess this property. Our kernel-based approach also allows us to show that the limiting distribution of the value function estimate is a Gaussian process. This information is useful in studying the bias-variance tradeoff in reinforcement learning. We find that all reinforcement learning approaches to estimating the value function, parametric or non-parametric, are subject to a bias. This bias is typically larger in reinforcement learning than in a comparable regression problem.

419 citations


01 Jan 1999
TL;DR: This paper gives a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail, and presents experimental results which support the theoretical findings.
Abstract: Reinforcement learning techniques address the problem of learning to select actions in unknown, dynamic environments. It is widely acknowledged that to be of use in complex domains, reinforcement learning techniques must be combined with generalizing function approximation methods such as artificial neural networks. Little, however, is understood about the theoretical properties of such combinations, and many researchers have encountered failures in practice. In this paper we identify a prime source of such failures—namely, a systematic overestimation of utility values. Using Watkins’ Q-Learning [18] as an example, we give a theoretical account of the phenomenon, deriving conditions under which one may expected it to cause learning to fail. Employing some of the most popular function approximators, we present experimental results which support the theoretical findings.

367 citations


Journal ArticleDOI
TL;DR: Strengths and weaknesses of the evolutionary approach to reinforcement learning are presented, along with a survey of representative applications.
Abstract: There are two distinct approaches to solving reinforcement learning problems, namely, searching in value function space and searching in policy space. Temporal difference methods and evolutionary algorithms are well-known examples of these approaches. Kaelbling, Littman and Moore recently provided an informative survey of temporal difference methods. This article focuses on the application of evolutionary algorithms to the reinforcement learning problem, emphasizing alternative policy representations, credit assignment methods, and problem-specific genetic operators. Strengths and weaknesses of the evolutionary approach to reinforcement learning are presented, along with a survey of representative applications.

351 citations


Proceedings Article
27 Jun 1999
TL;DR: This paper presents a simpler derivation of the LSTD algorithm, which generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression.
Abstract: TD( ) is a popular family of algorithms for approximate policy evaluation in large MDPs. TD( ) works by incrementally updating the value function after each observed transition. It has two major drawbacks: it makes ine cient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and = 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (Bradtke and Barto, 1996) eliminates all stepsize parameters and improves data e ciency. This paper extends Bradtke and Barto's work in three signi cant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from = 0 to arbitrary values of ; at the extreme of = 1, the resulting algorithm is shown to be a practical formulation of supervised linear regression. Third, it presents a novel, intuitive interpretation of LSTD as a model-based reinforcement learning technique.

Journal ArticleDOI
TL;DR: The striking similarities in teaching signals and learning behavior between the computational and biological results suggest that dopamine-like reward responses may serve as effective teaching signals for learning behavioral tasks that are typical for primate cognitive behavior, such as spatial delayed responding.

Proceedings Article
27 Jun 1999
TL;DR: This paper presents an algorithm for learning a value function that maps hyperlinks to future discounted reward using a naive Bayes text classifier and shows a threefold improvement in spidering efficiency over traditional breadth-first search, and up to a two-fold improvement over reinforcement learning with immediate reward.
Abstract: Consider the task of exploring the Web in order to find pages of a particular kind or on a particular topic. This task arises in the construction of search engines and Web knowledge bases. This paper argues that the creation of efficient web spiders is best framed and solved by reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision making. One strength of reinforcement learning is that it provides a formalism for measuring the utility of actions that give benefit only in the future. We present an algorithm for learning a value function that maps hyperlinks to future discounted reward using a naive Bayes text classifier. Experiments on two real-world spidering tasks show a threefold improvement in spidering efficiency over traditional breadth-first search, and up to a two-fold improvement over reinforcement learning with immediate reward only.

Proceedings Article
29 Nov 1999
TL;DR: A general software tool (RLDS, for Reinforcement Learning for Dialogue Systems) based on the MDP framework is built and applied to dialogue corpora gathered from two dialogue systems built at AT&T Labs, demonstrating that RLDS holds promise as a tool for "browsing" and understanding correlations in complex, temporally dependent dialogue Corpora.
Abstract: Recently, a number of authors have proposed treating dialogue systems as Markov decision processes (MDPs). However, the practical application of MDP algorithms to dialogue systems faces a number of severe technical challenges. We have built a general software tool (RLDS, for Reinforcement Learning for Dialogue Systems) based on the MDP framework, and have applied it to dialogue corpora gathered from two dialogue systems built at AT&T Labs. Our experiments demonstrate that RLDS holds promise as a tool for "browsing" and understanding correlations in complex, temporally dependent dialogue corpora.

Journal ArticleDOI
TL;DR: Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed, which are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature.
Abstract: Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.

Proceedings Article
30 Jul 1999
TL;DR: In this article, the authors investigate ways to represent and reason about this uncertainty in algorithms where the system attempts to learn a model of its environment and explicitly represent uncertainty about the parameters of the model and build probability distributions over Q-values based on these.
Abstract: Reinforcement learning systems are often concerned with balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information - the expected improvement in future decision quality arising from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper we investigate ways to represent and reason about this uncertainty in algorithms where the system attempts to learn a model of its environment. We explicitly represent uncertainty about the parameters of the model and build probability distributions over Q-values based on these. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation


Posted Content
TL;DR: This paper surveys the emerging science of how to design a “COllective INtelligence” (COIN) and introduces an entirely new, profound design problem: Assuming the RL algorithms are able to achieve high rewards, what reward functions for the individual agents will result in high world utility?
Abstract: This paper surveys the emerging science of how to design a “COllective INtelligence” (COIN). A COIN is a large multi-agent system where: i) There is little to no centralized communication or control. ii) There is a provided world utility function that rates the possible histories of the full system. In particular, we are interested in COINs in which each agent runs a reinforcement learning (RL) algorithm. The conventional approach to designing large distributed systems to optimize a world utility does not use agents running RL algorithms. Rather, that approach begins with explicit modeling of the dynamics of the overall system, followed by detailed hand-tuning of the interactions between the components to ensure that they “cooperate” as far as the world utility is concerned. This approach is labor-intensive, often results in highly nonrobust systems, and usually results in design techniques that have limited applicability. In contrast, we wish to solve the COIN design problem implicitly, via the “adaptive” character of the RL algorithms of each of the agents. This approach introduces an entirely new, profound design problem: Assuming the RL algorithms are able to achieve high rewards, what reward functions for the individual agents will, when pursued by those agents, result in high world utility? In other words, what reward functions will best ensure that we do not have phenomena like the tragedy of the commons, Braess’s paradox, or the liquidity trap? Although still very young, research specifically concentrating on the COIN design problem has already resulted in successes in artificial domains, in particular in packet-routing, the leader-follower problem, and in variants of Arthur’s El Farol bar problem. It is expected that as it matures and draws upon other disciplines related to COINs, this research will greatly expand the range of tasks addressable by human engineers. Moreover, in addition to drawing on them, such a fully developed science of COIN design may provide much insight into other already established scientific fields, such as economics, game theory, and population biology.

Journal ArticleDOI
TL;DR: A new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique) is presented, which is presented a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system.
Abstract: A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we extend RL to a more general class of decision tasks that are referred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.

Proceedings Article
31 Jul 1999
TL;DR: The use of machine learning techniques are proposed to greatly automate the creation and maintenance of domain-specific search engines and new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments is described.
Abstract: Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also difficult and time-consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justrcsettrch.com.

Proceedings Article
31 Jul 1999
TL;DR: This work presents a provably efficient and near-optimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN).
Abstract: We present a provably efficient and near-optimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN). Our algorithm generalizes the recent E3 algorithm of Kearns and Singh, and assumes that we are given both an algorithm for approximate planning, and the graphical structure (but not the parameters) of the DBN. Unlike the original E algorithm, our new algorithm exploits the DBN structure to achieve a running time that scales polynomially in the number of parameters of the DBN, which may be exponentially smaller than the number of global states.

Proceedings ArticleDOI
27 Jun 1999
TL;DR: An algorithm for distributed reinforcement learning based on distributing the representation of the value function across nodes, which shows that the distributed value function algorithm outperforms the others and an analysis of what problems are best suited for distributed value functions.
Abstract: Many interesting problems, such as power grids, network switches, and traffic flow, thatare candidates for solving with reinforcementlearning (RL), also have properties that makedistributed solutions desirable. We propose an algorithm for distributed reinforcement learning based on distributing the representation of the value function across nodes. Each node in the system only has the ability to sense state locally, choose actions locally, and receive reward locally (the goal of the system is to maximize the sum of the rewards over all nodes and over all time). However each node is allowed to give its neighbors the current estimate of its value function for the states it passes through. We present a value function learning rule, using that information, that allows each node to learn a value function that is an estimate of a weighted sum of future rewards for all the nodes in the network. With this representation, each node can choose actions to improve the performance of the over- all system. We demonstrate our algorithm on the distributed control of a simulated power grid. We compare it against other methods including: use of a global reward signal, nodes that act locally with no communication, and nodes that share rewards (but not value function) information with each other. Our results show that the distributed value function algorithm outperforms the others, and we conclude with an analysis of what problems are best suited for distributed value functions and the new research directions opened up by this work.

Journal ArticleDOI
TL;DR: A powerful new theorem is presented that can provide a unified analysis of value-function-based reinforcement-learning algorithms and allows the convergence of a complex asynchronous reinforcement- learning algorithm to be proved by verifying that a simpler synchronous algorithm converges.
Abstract: Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Proceedings Article
29 Nov 1999
TL;DR: Upper bounds on the sample complexity are proved showing that, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class II, and exponentially on the horizon time.
Abstract: We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies II in a partially observable Markov decision process (POMDP). We assume we are given the ability to simulate the POMDP, and study what might be called the sample complexity -- that is, the amount of data one must generate in the POMDP in order to choose a good strategy. We prove upper bounds on the sample complexity showing that, even for infinitely large and arbitrarily complex POMDPs, the amount of data needed can be finite, and depends only linearly on the complexity of the restricted strategy class II, and exponentially on the horizon time. This latter dependence can be eased in a variety of ways, including the application of gradient and local search algorithms. Our measure of complexity generalizes the classical supervised learning notion of VC dimension to the settings of reinforcement learning and planning.

Posted Content
TL;DR: In this article, the authors compare the properties of reinforcement learning and stochastic fictitious play and find that they are far more similar than were thought, in particular, exponential fictitious play has the same expected motion and therefore will have the same asymptotic behaviour.
Abstract: Reinforcement learning and stochastic fictitious play are apparent rivals as models of human learning. They embody quite different assumptions about the processing of information and optimisation. This paper compares their properties and finds that they are far more similar than were thought. In particular, exponential fictitious play and suitably perturbed reinforcement model have the same expected motion and therefore will have the same asymptotic behaviour. It is also shown that more general models of stochastic fictitious play and perturbed reinforcement between the two models is speed: stochastic fictitious play gives rise to faster learning.

Journal ArticleDOI
TL;DR: This paper proposes an alternative approach to solving the dynamic channel assignment (DCA) problem through a form of real-time reinforcement learning known as Q learning, which is able to perform better than the FCA in various situations and is capable of achieving a similar performance to that achieved by MAXAVAIL, but with a significantly reduced computational complexity.
Abstract: This paper deals with the problem of channel assignment in mobile communication systems. In particular, we propose an alternative approach to solving the dynamic channel assignment (DCA) problem through a form of real-time reinforcement learning known as Q learning. Instead of relying on a known teacher, the system is designed to learn an optimal assignment policy by directly interacting with the mobile communication environment. The performance of the Q-learning-based DCA was examined by extensive simulation studies on a 49-cell mobile communication system under various conditions including homogeneous and inhomogeneous traffic distributions, time-varying traffic patterns, and channel failures. Comparative studies with the fixed channel assignment (FCA) scheme and one of the best dynamic channel assignment strategies (MAXAVAIL) have revealed that the proposed approach is able to perform better than the FCA in various situations and is capable of achieving a similar performance to that achieved by MAXAVAIL, but with a significantly reduced computational complexity.

01 Jan 1999
TL;DR: This thesis aims for a middle ground, between algorithms that don't scale well because they use an impoverished representation for the evaluation function and algorithms that the authors can't analyze because they used too complicated a representation.
Abstract: One of the basic problems of machine learning is deciding how to act in an uncertain world. For example, if I want my robot to bring me a cup of coffee, it must be able to compute the correct sequence of electrical impulses to send to its motors to navigate from the coffee pot to my office. In fact, since the results of its actions are not completely predictable, it is not enough just to compute the correct sequence; instead the robot must sense and correct for deviations from its intended path. In order for any machine learner to act reasonably in an uncertain environment, it must solve problems like the above one quickly and reliably. Unfortunately, the world is often so complicated that it is difficult or impossible to find the optimal sequence of actions to achieve a given goal. So, in order to scale our learners up to real-world problems, we usually must settle for approximate solutions. One representation for a learner's environment and goals is a Markov decision process or MDP. MDPs allow us to represent actions that have probabilistic outcomes, and to plan for complicated, temporally-extended goals. An MDP consists of a set of states that the environment can be in, together with rules for how the environment can change state and for what the learner is supposed to do. One way to approach a large MDP is to try to compute an approximation to its optimal state evaluation function, the function which tells us how much reward the learner can be expected to achieve if the world is in a particular state. If the approximation is good enough, we can use a shallow search to find a good action from most states. Researchers have tried many different ways to approximate evaluation functions. This thesis aims for a middle ground, between algorithms that don't scale well because they use an impoverished representation for the evaluation function and algorithms that we can't analyze because they use too complicated a representation.

01 Jan 1999
TL;DR: New research in reinforcement learning, information extraction and text classification that enables efficient spidering, identifying informative text segments, and populating topic hierarchies is described.
Abstract: Domain-specific search engines are growing in popularity because they offer increased accuracy and extra functionality not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by age-group, size, location and cost over .summer camps. Unfortunately these domain-specific search engines are difficult and timeconsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, identifying informative text segments, and populating topic hierarchies. Using these techniques, we have built a demonstration system: a search engine for computer science research papers. It already contaius over 50,000 papers and is publicly available at ~w. cora.justres earch, com.

Book ChapterDOI
06 Dec 1999
TL;DR: A method suitable for control tasks which require continuous actions, in response to continuous states, is described, which consists of a neureil network coupled with a novel interpolator.
Abstract: Q-learning can be used to learn a control policy that maximises a scalar reward through interaction with the environment. Q- learning is commonly applied to problems with discrete states and actions. We describe a method suitable for control tasks which require continuous actions, in response to continuous states. The system consists of a neureil network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of Q-learning, is shown enhance learning speed and reliability for this task.

Journal ArticleDOI
TL;DR: A vision-based reinforcement learning method that acquires cooperative behaviors in a dynamic environment using the robot soccer game initiated by RoboCup to illustrate the effectiveness of the proposed method.