scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2002"


Journal ArticleDOI
TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
Abstract: Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

6,361 citations


Journal ArticleDOI
TL;DR: This paper presented a unified account of two neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a generic error-processing system associated with the anterior cingulate cortex.
Abstract: The authors present a unified account of 2 neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a “generic” error-processing system associated with the anterior cingulate cortex The existence of the error-processing system has been inferred from the error-related negativity (ERN), a component of the event-related brain potential elicited when human participants commit errors in reaction-time tasks The authors propose that the ERN is generated when a negative reinforcement learning signal is conveyed to the anterior cingulate cortex via the mesencephalic dopamine system and that this signal is used by the anterior cingulate cortex to modify performance on the task at hand They provide support for this proposal using both computational modeling and psychophysiological experimentation

3,438 citations


Journal ArticleDOI
TL;DR: Neural Evolution of Augmenting Topologies (NEAT) as mentioned in this paper employs a principled method of crossover of different topologies, protecting structural innovation using speciation, and incrementally growing from minimal structure.
Abstract: An important question in neuroevolution is how to gain an advantage from evolving neural network topologies along with weights. We present a method, NeuroEvolution of Augmenting Topologies (NEAT), which outperforms the best fixed-topology method on a challenging benchmark reinforcement learning task. We claim that the increased efficiency is due to (1) employing a principled method of crossover of different topologies, (2) protecting structural innovation using speciation, and (3) incrementally growing from minimal structure. We test this claim through a series of ablation studies that demonstrate that each component is necessary to the system as a whole and to each other. What results is significantly faster learning. NEAT is also an important contribution to GAs because it shows how it is possible for evolution to both optimize and complexify solutions simultaneously, offering the possibility of evolving increasingly complex solutions over generations, and strengthening the analogy with biological evolution.

3,265 citations


Proceedings Article
08 Jul 2002

842 citations


Journal ArticleDOI
TL;DR: This article introduces the WoLF principle, “Win or Learn Fast”, for varying the learning rate, and examines this technique theoretically, proving convergence in self-play on a restricted class of iterated matrix games.

807 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the number of actions required to approach the optimal return is lower bounded by the mixing time of the optimal policy (in the undiscounted case) or by the horizon time T in the discounted case.
Abstract: We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

802 citations


Journal ArticleDOI
10 Oct 2002-Neuron
TL;DR: This work reviews the data and considers the involvement of a rich collection of different neural systems in various aspects of these forms of conditioning, including dopamine, which plays a pivotal, but complicated, role.

778 citations


Proceedings Article
01 Jan 2002
TL;DR: By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system.
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-of-freedom humanoid robot exemplify the feasibility and robustness of our approach.

667 citations


Journal ArticleDOI
TL;DR: The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics to enable multiple model-based reinforcement learning for nonlinear, nonstationary control tasks.
Abstract: We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple model-based reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The system is composed of multiple modules, each of which consists of a state prediction model and a reinforcement learning controller. The "responsibility signal," which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules, as well as to gate the learning of the prediction models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, finite-state case and continuous-time, continuous-state case. The performance of MMRL was demonstrated for discrete case in a nonstationary hunting task in a grid world and for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters.

495 citations


Journal ArticleDOI
TL;DR: The proposal calls for the design of TRFN by either neural network or genetic algorithms depending on the learning environment, which develops from a series of recurrent fuzzy if-then rules with TSK-type consequent parts.
Abstract: In this paper, a TSK-type recurrent fuzzy network (TRFN) structure is proposed. The proposal calls for the design of TRFN by either neural network or genetic algorithms depending on the learning environment. A recurrent fuzzy network is described which develops from a series of recurrent fuzzy if-then rules with TSK-type consequent parts. The recurrent property comes from feeding the internal variables, derived from fuzzy firing strengths, back to both the network input and output layers. In this configuration, each internal variable is responsible for memorizing the temporal history of its corresponding fuzzy rule. The internal variable is also combined with external input variables in each rule's consequence, which shows an increase in network learning ability. TRFN design under different learning environments is next advanced. For problems where supervised training data is directly available, TRFN with supervised learning (TRFN-S) is proposed, and a neural network (NN) learning approach is adopted for TRFN-S design. An online learning algorithm with concurrent structure and parameter learning is proposed. With flexibility of partition in the precondition part, and outcome of TSK-type, the TRFN-S displays both small network size and high learning accuracy. For problems where gradient information for NN learning is costly to obtain or unavailable, like reinforcement learning, TRFN with Genetic learning (TRFN-G) is put forward. The precondition parts of TRFN-G are also partitioned in a flexible way, and all free parameters are designed concurrently by genetic algorithm. Owing to the well-designed network structure of TRFN, TRFN-G, like TRFN-S, is characterized by high learning accuracy. To demonstrate the superior properties of TRFN, TRFN-S is applied to dynamic system identification and TRFN-G to dynamic system control. By comparing the results to other types of recurrent networks and design configurations, the efficiency of TRFN is verified.

449 citations


Journal ArticleDOI
TL;DR: This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.
Abstract: A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states The running time is exponential in the horizon time (which depends only on the discount factor γ and the desired degree of approximation to the optimal policy) Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter Our algorithm is based on the idea of sparse sampling We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng Neural information processing systems 13, to appear)

Proceedings ArticleDOI
07 Aug 2002
TL;DR: This paper introduces a framework for reinforcement learning on mobile robots and describes the experiments using it to learn simple tasks.
Abstract: Programming mobile robots can be a long, time-consuming process. Specifying the low-level mapping from sensors to actuators is prone to programmer misconceptions, and debugging such a mapping can be tedious. The idea of having a robot learn how to accomplish a task, rather than being told explicitly, is an appealing one. It seems easier and much more intuitive for the programmer to specify what the robot should be doing, and to let it learn the fine details of how to do it. In this paper, we introduce a framework for reinforcement learning on mobile robots and describe our experiments using it to learn simple tasks.

Journal ArticleDOI
TL;DR: The design, construction and empirical evaluation of NJFun, an experimental spoken dialogue system that provides users with access to information about fun things to do in New Jersey, are reported on.
Abstract: Designing the dialogue policy of a spoken dialogue system involves many nontrivial choices. This paper presents a reinforcement learning approach for automatically optimizing a dialogue policy, which addresses the technical challenges in applying reinforcement learning to a working dialogue system with human users. We report on the design, construction and empirical evaluation of NJFun, an experimental spoken dialogue system that provides users with access to information about fun things to do in New Jersey. Our results show that by optimizing its performance via reinforcement learning, NJFun measurably improves system performance.

Proceedings Article
08 Jul 2002
TL;DR: These methods differ from many previous reinforcement learning approaches to multiagent coordination in that structured communication and coordination between agents appears at the core of both the learning algorithm and the execution architecture.
Abstract: We present several new algorithms for multiagent reinforcement learning. A common feature of these algorithms is a parameterized, structured representation of a policy or value function. This structure is leveraged in an approach we call coordinated reinforcement learning, by which agents coordinate both their action selection activities and their parameter updates. Within the limits of our parametric representations, the agents will determine a jointly optimal action without explicitly considering every possible action in their exponentially large joint action space. Our methods differ from many previous reinforcement learning approaches to multiagent coordination in that structured communication and coordination between agents appears at the core of both the learning algorithm and the execution architecture.

Journal ArticleDOI
TL;DR: This paper evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.
Abstract: The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the “Car on the hill”. It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

Journal ArticleDOI
TL;DR: This paper updates Bradtke and Barto's work in three significant ways: first, it presents a simpler derivation of the LSTD algorithm; second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of κ, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression.
Abstract: TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ e 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ e 0 to arbitrary values of λs at the extreme of λ e 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

Book ChapterDOI
02 Aug 2002
TL;DR: This paper empirically explores a simple approach to creating options based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals, and proposes a greedy algorithm for identifying subgoal counts based on state visitation counts.
Abstract: Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of howgood options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goalachievement tasks in an environment that is otherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989).We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.

Proceedings Article
09 Jul 2002
TL;DR: NEAT shows that when structure is evolved with a principled method of crossover, by protecting structural innovation, and through incremental growth from minimal structure, learning is significantly faster and stronger than with the best fixed-topology methods.
Abstract: Neuroevolution is currently the strongest method on the pole-balancing benchmark reinforcement learning tasks. Although earlier studies suggested that there was an advantage in evolving the network topology as well as connection weights, the leading neuroevolution systems evolve fixed networks. Whether evolving structure can improve performance is an open question. In this article, we introduce such a system, NeuroEvolution of Augmenting Topologies (NEAT). We show that when structure is evolved (1) with a principled method of crossover, (2) by protecting structural innovation, and (3) through incremental growth from minimal structure, learning is significantly faster and stronger than with the best fixed-topology methods. NEAT also shows that it is possible to evolve populations of increasingly large genomes, achieving highly complex solutions that would otherwise be difficult to optimize.

Proceedings Article
01 Jan 2002
TL;DR: In this article, the authors present an adaptive learning algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game with probability 2 in the presence of multiple Nash equilibria.
Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm's parameters are easy to set to meet the convergence conditions.

Book ChapterDOI
19 Aug 2002
TL;DR: The Q-Cut algorithm is presented, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm, and extended to the Segmented Q- cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenECks in complex environments.
Abstract: We present the Q-Cut algorithm, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm. The learning agent creates an on-line map of the process history, and uses an efficient Max-Flow/Min-Cut algorithm for identifying bottlenecks. The policies for reaching bottlenecks are separately learned and added to the model in a form of options (macro-actions). We then extend the basic Q-Cut algorithm to the Segmented Q-Cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenecks in complex environments. Experiments showsign ificant performance improvements, particulary in the initial learning phase.

Proceedings ArticleDOI
28 Jul 2002
TL;DR: This paper explores safe state abstraction in hierarchical reinforcement learning, where learned behaviors must conform to a given partial, hierarchical program, and shows how to achieve this for a partial programming language that is essentially Lisp augmented with nondeterministic constructs.
Abstract: Safe state abstraction in reinforcement learning allows an agent to ignore aspects of its current state that are irrelevant to its current decision, and therefore speeds up dynamic programming and learning. This paper explores safe state abstraction in hierarchical reinforcement learning, where learned behaviors must conform to a given partial, hierarchical program. Unlike previous approaches to this problem, our methods yield significant state abstraction while maintaining hierarchical optimality, i.e., optimality among all policies consistent with the partial program. We show how to achieve this for a partial programming language that is essentially Lisp augmented with nondeterministic constructs. We demonstrate our methods on two variants of Dietterich's taxi domain, showing how state abstraction and hierarchical optimality result in faster learning of better policies and enable the transfer of learned skills from one problem to another.

Proceedings ArticleDOI
01 Jul 2002
TL;DR: An autonomous animated dog is built that can be trained with a technique used to train real dogs called "clicker training" and capabilities demonstrated include being trained to recognize and use acoustic patterns as cues for actions, as well as to synthesize new actions from novel paths through its motion space.
Abstract: The ability to learn is a potentially compelling and important quality for interactive synthetic characters. To that end, we describe a practical approach to real-time learning for synthetic characters. Our implementation is grounded in the techniques of reinforcement learning and informed by insights from animal training. It simplifies the learning task for characters by (a) enabling them to take advantage of predictable regularities in their world, (b) allowing them to make maximal use of any supervisory signals, and (c) making them easy to train by humans.We built an autonomous animated dog that can be trained with a technique used to train real dogs called "clicker training". Capabilities demonstrated include being trained to recognize and use acoustic patterns as cues for actions, as well as to synthesize new actions from novel paths through its motion space.A key contribution of this paper is to demonstrate that by addressing the three problems of state, action, and state-action space discovery at the same time, the solution for each becomes easier. Finally, we articulate heuristics and design principles that make learning practical for synthetic characters.

Proceedings Article
08 Jul 2002
TL;DR: HEXQ, an algorithm which automatically attempts to decompose and solve a model-free factored MDP hierarchically is described, which uses temporal and state abstraction to construct a hierarchy of interlinked smaller MDPs.
Abstract: An open problem in reinforcement learning is discovering hierarchical structure. HEXQ, an algorithm which automatically attempts to decompose and solve a model-free factored MDP hierarchically is described. By searching for aliased Markov sub-space regions based on the state variables the algorithm uses temporal and state abstraction to construct a hierarchy of interlinked smaller MDPs.

Proceedings ArticleDOI
28 Jul 2002
TL;DR: This investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems focuses on a novel action selection strategy for Q-learning (Watkins 1989), and demonstrates empirically that this extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.
Abstract: We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on a novel action selection strategy for Q-learning (Watkins 1989). The new technique is applicable to scenarios where mutual observation of actions is not possible.To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results (Claus & Boutilier 1998) by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.

Journal ArticleDOI
TL;DR: A new method that controls the balance between exploitation and exploration is presented, based on model-based RL, in which the Bayes inference with forgetting effect estimates the state-transition probability of the environment.

Journal ArticleDOI
TL;DR: In this article, the authors formulated the automatic generation control (AGC) problem as a stochastic multistage decision problem and used reinforcement learning (RL) to obtain an AGC controller.

Journal ArticleDOI
TL;DR: This work provides a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching in bumblebees, and shows risk aversion to emerge even when bees are evolved in a completely risk-less environment.
Abstract: Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.

Journal ArticleDOI
TL;DR: This paper presents an approach to manage inventory decisions at all stages of the supply chain in an integrated manner that allows an inventory order policy to be determined, aimed at optimizing the performance of the whole supply chain.

Journal ArticleDOI
TL;DR: In this article, the expected motion of stochastic fictitious play and reinforcement learning with experimentation can both be written as a perturbed form of the evolutionary replicator dynamics, and they will in many cases have the same asymptotic behavior.
Abstract: Reinforcement learning and stochastic fictitious play are apparent rivals as models of human learning. They embody quite different assumptions about the processing of information and optimization. This paper compares their properties and finds that they are far more similar than were thought. In particular, the expected motion of stochastic fictitious play and reinforcement learning with experimentation can both be written as a perturbed form of the evolutionary replicator dynamics. Therefore they will in many cases have the same asymptotic behavior. In particular, local stability of mixed equilibria under stochastic fictitious play implies local stability under perturbed reinforcement learning. The main identifiable difference between the two models is speed: stochastic fictitious play gives rise to faster learning.

Journal ArticleDOI
Gerald Tesauro1
TL;DR: This paper views machine learning as a tool in a programmer's toolkit, and considers how it can be combined with other programming techniques to achieve and surpass world-class backgammon play.