Showing papers on "Reinforcement learning published in 2002"

PDF

Open Access

Journal Article•DOI•

Finite-time Analysis of the Multiarmed Bandit Problem

[...]

Peter Auer¹, Nicolò Cesa-Bianchi², Paul Fischer³•Institutions (3)

Graz University of Technology¹, University of Milan², Technical University of Dortmund³

01 May 2002-Machine Learning

TL;DR: This work shows that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

...read moreread less

Abstract: Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

...read moreread less

6,361 citations

Journal Article•DOI•

The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity.

[...]

Clay B. Holroyd¹, Michael G. H. Coles¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Oct 2002-Psychological Review

TL;DR: This paper presented a unified account of two neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a generic error-processing system associated with the anterior cingulate cortex.

...read moreread less

Abstract: The authors present a unified account of 2 neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a “generic” error-processing system associated with the anterior cingulate cortex The existence of the error-processing system has been inferred from the error-related negativity (ERN), a component of the event-related brain potential elicited when human participants commit errors in reaction-time tasks The authors propose that the ERN is generated when a negative reinforcement learning signal is conveyed to the anterior cingulate cortex via the mesencephalic dopamine system and that this signal is used by the anterior cingulate cortex to modify performance on the task at hand They provide support for this proposal using both computational modeling and psychophysiological experimentation

...read moreread less

3,438 citations

Journal Article•DOI•

Evolving neural networks through augmenting topologies

[...]

Kenneth O. Stanley¹, Risto Miikkulainen¹•Institutions (1)

University of Texas at Austin¹

01 Jun 2002-Evolutionary Computation

TL;DR: Neural Evolution of Augmenting Topologies (NEAT) as mentioned in this paper employs a principled method of crossover of different topologies, protecting structural innovation using speciation, and incrementally growing from minimal structure.

...read moreread less

Abstract: An important question in neuroevolution is how to gain an advantage from evolving neural network topologies along with weights. We present a method, NeuroEvolution of Augmenting Topologies (NEAT), which outperforms the best fixed-topology method on a challenging benchmark reinforcement learning task. We claim that the increased efficiency is due to (1) employing a principled method of crossover of different topologies, (2) protecting structural innovation using speciation, and (3) incrementally growing from minimal structure. We test this claim through a series of ablation studies that demonstrate that each component is necessary to the system as a whole and to each other. What results is significantly faster learning. NEAT is also an important contribution to GAs because it shows how it is possible for evolution to both optimize and complexify solutions simultaneously, offering the possibility of evolving increasingly complex solutions over generations, and strengthening the analogy with biological evolution.

...read moreread less

3,265 citations

Proceedings Article•

Approximately Optimal Approximate Reinforcement Learning

[...]

Sham M. Kakade, John Langford

08 Jul 2002

842 citations

Journal Article•DOI•

Multiagent Learning Using a Variable Learning Rate

[...]

Michael Bowling¹, Manuela Veloso¹•Institutions (1)

Carnegie Mellon University¹

01 Apr 2002-Artificial Intelligence

TL;DR: This article introduces the WoLF principle, “Win or Learn Fast”, for varying the learning rate, and examines this technique theoretically, proving convergence in self-play on a restricted class of iterated matrix games.

...read moreread less

807 citations

Journal Article•DOI•

Near-Optimal Reinforcement Learning in Polynomial Time

[...]

Michael Kearns¹, Satinder Singh•Institutions (1)

University of Pennsylvania¹

01 Nov 2002-Machine Learning

TL;DR: In this paper, the authors show that the number of actions required to approach the optimal return is lower bounded by the mixing time of the optimal policy (in the undiscounted case) or by the horizon time T in the discounted case.

...read moreread less

Abstract: We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

...read moreread less

802 citations

Journal Article•DOI•

Reward, Motivation, and Reinforcement Learning

[...]

Peter Dayan¹, Bernard W. Balleine²•Institutions (2)

University College London¹, University of California, Los Angeles²

10 Oct 2002-Neuron

TL;DR: This work reviews the data and considers the involvement of a rich collection of different neural systems in various aspects of these forms of conditioning, including dopamine, which plays a pivotal, but complicated, role.

...read moreread less

778 citations

Proceedings Article•

Learning Attractor Landscapes for Learning Motor Primitives

[...]

Auke Jan Ijspeert¹, Jun Nakanishi, Stefan Schaal¹•Institutions (1)

University of Southern California¹

01 Jan 2002

TL;DR: By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system.

...read moreread less

Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-of-freedom humanoid robot exemplify the feasibility and robustness of our approach.

...read moreread less

667 citations

Journal Article•DOI•

Multiple model-based reinforcement learning

[...]

Kenji Doya¹, Kazuyuki Samejima, Ken'Ichi Katagiri¹, Mitsuo Kawato²•Institutions (2)

Nara Institute of Science and Technology¹, National Archives and Records Administration²

01 Jun 2002-Neural Computation

TL;DR: The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics to enable multiple model-based reinforcement learning for nonlinear, nonstationary control tasks.

...read moreread less

Abstract: We propose a modular reinforcement learning architecture for nonlinear, nonstationary control tasks, which we call multiple model-based reinforcement learning (MMRL). The basic idea is to decompose a complex task into multiple domains in space and time based on the predictability of the environmental dynamics. The system is composed of multiple modules, each of which consists of a state prediction model and a reinforcement learning controller. The "responsibility signal," which is given by the softmax function of the prediction errors, is used to weight the outputs of multiple modules, as well as to gate the learning of the prediction models and the reinforcement learning controllers. We formulate MMRL for both discrete-time, finite-state case and continuous-time, continuous-state case. The performance of MMRL was demonstrated for discrete case in a nonstationary hunting task in a grid world and for continuous case in a nonlinear, nonstationary control task of swinging up a pendulum with variable physical parameters.

...read moreread less

495 citations

Journal Article•DOI•

A TSK-type recurrent fuzzy network for dynamic systems processing by neural network and genetic algorithms

[...]

Chia-Feng Juang¹•Institutions (1)

National Chung Hsing University¹

01 Apr 2002-IEEE Transactions on Fuzzy Systems

TL;DR: The proposal calls for the design of TRFN by either neural network or genetic algorithms depending on the learning environment, which develops from a series of recurrent fuzzy if-then rules with TSK-type consequent parts.

...read moreread less

Abstract: In this paper, a TSK-type recurrent fuzzy network (TRFN) structure is proposed. The proposal calls for the design of TRFN by either neural network or genetic algorithms depending on the learning environment. A recurrent fuzzy network is described which develops from a series of recurrent fuzzy if-then rules with TSK-type consequent parts. The recurrent property comes from feeding the internal variables, derived from fuzzy firing strengths, back to both the network input and output layers. In this configuration, each internal variable is responsible for memorizing the temporal history of its corresponding fuzzy rule. The internal variable is also combined with external input variables in each rule's consequence, which shows an increase in network learning ability. TRFN design under different learning environments is next advanced. For problems where supervised training data is directly available, TRFN with supervised learning (TRFN-S) is proposed, and a neural network (NN) learning approach is adopted for TRFN-S design. An online learning algorithm with concurrent structure and parameter learning is proposed. With flexibility of partition in the precondition part, and outcome of TSK-type, the TRFN-S displays both small network size and high learning accuracy. For problems where gradient information for NN learning is costly to obtain or unavailable, like reinforcement learning, TRFN with Genetic learning (TRFN-G) is put forward. The precondition parts of TRFN-G are also partitioned in a flexible way, and all free parameters are designed concurrently by genetic algorithm. Owing to the well-designed network structure of TRFN, TRFN-G, like TRFN-S, is characterized by high learning accuracy. To demonstrate the superior properties of TRFN, TRFN-S is applied to dynamic system identification and TRFN-G to dynamic system control. By comparing the results to other types of recurrent networks and design configurations, the efficiency of TRFN is verified.

...read moreread less

449 citations

Journal Article•DOI•

A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes

[...]

Michael Kearns¹, Yishay Mansour², Andrew Y. Ng³•Institutions (3)

University of Pennsylvania¹, Tel Aviv University², University of California, Berkeley³

01 Nov 2002-Machine Learning

TL;DR: This paper presents a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states.

...read moreread less

Abstract: A critical issue for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP In stochastic environments with very large or infinite state spaces, traditional planning and reinforcement learning algorithms may be inapplicable, since their running time typically grows linearly with the state space size in the worst case In this paper we present a new algorithm that, given only a generative model (a natural and common type of simulator) for an arbitrary MDP, performs on-line, near-optimal planning with a per-state running time that has no dependence on the number of states The running time is exponential in the horizon time (which depends only on the discount factor γ and the desired degree of approximation to the optimal policy) Our algorithm thus provides a different complexity trade-off than classical algorithms such as value iteration—rather than scaling linearly in both horizon time and state space size, our running time trades an exponential dependence on the former in exchange for no dependence on the latter Our algorithm is based on the idea of sparse sampling We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near-best strategy from a given class of strategies in very large partially observable MDPs (Kearns, Mansour, & Ng Neural information processing systems 13, to appear)

...read moreread less

Proceedings Article•DOI•

Effective reinforcement learning for mobile robots

[...]

William D. Smart¹, L. Pack Kaelbling²•Institutions (2)

Washington University in St. Louis¹, Massachusetts Institute of Technology²

07 Aug 2002

TL;DR: This paper introduces a framework for reinforcement learning on mobile robots and describes the experiments using it to learn simple tasks.

...read moreread less

Abstract: Programming mobile robots can be a long, time-consuming process. Specifying the low-level mapping from sensors to actuators is prone to programmer misconceptions, and debugging such a mapping can be tedious. The idea of having a robot learn how to accomplish a task, rather than being told explicitly, is an appealing one. It seems easier and much more intuitive for the programmer to specify what the robot should be doing, and to let it learn the fine details of how to do it. In this paper, we introduce a framework for reinforcement learning on mobile robots and describe our experiments using it to learn simple tasks.

...read moreread less

Journal Article•DOI•

Optimizing dialogue management with reinforcement learning: experiments with the NJFun system

[...]

Satinder Singh, Diane J. Litman¹, Michael Kearns², Marilyn A. Walker³•Institutions (3)

University of Pittsburgh¹, University of Pennsylvania², AT&T Labs³

01 Jan 2002-Journal of Artificial Intelligence Research

TL;DR: The design, construction and empirical evaluation of NJFun, an experimental spoken dialogue system that provides users with access to information about fun things to do in New Jersey, are reported on.

...read moreread less

Abstract: Designing the dialogue policy of a spoken dialogue system involves many nontrivial choices. This paper presents a reinforcement learning approach for automatically optimizing a dialogue policy, which addresses the technical challenges in applying reinforcement learning to a working dialogue system with human users. We report on the design, construction and empirical evaluation of NJFun, an experimental spoken dialogue system that provides users with access to information about fun things to do in New Jersey. Our results show that by optimizing its performance via reinforcement learning, NJFun measurably improves system performance.

...read moreread less

Proceedings Article•

Coordinated Reinforcement Learning

[...]

Carlos Guestrin¹, Michail G. Lagoudakis², Ronald Parr²•Institutions (2)

Stanford University¹, Duke University²

08 Jul 2002

TL;DR: These methods differ from many previous reinforcement learning approaches to multiagent coordination in that structured communication and coordination between agents appears at the core of both the learning algorithm and the execution architecture.

...read moreread less

Abstract: We present several new algorithms for multiagent reinforcement learning. A common feature of these algorithms is a parameterized, structured representation of a policy or value function. This structure is leveraged in an approach we call coordinated reinforcement learning, by which agents coordinate both their action selection activities and their parameter updates. Within the limits of our parametric representations, the agents will determine a jointly optimal action without explicitly considering every possible action in their exponentially large joint action space. Our methods differ from many previous reinforcement learning approaches to multiagent coordination in that structured communication and coordination between agents appears at the core of both the learning algorithm and the execution architecture.

...read moreread less

Journal Article•DOI•

Variable Resolution Discretization in Optimal Control

[...]

Rémi Munos¹, Andrew W. Moore²•Institutions (2)

École Polytechnique¹, Carnegie Mellon University²

01 Nov 2002-Machine Learning

TL;DR: This paper evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

...read moreread less

Abstract: The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the “Car on the hill”. It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.

...read moreread less

Journal Article•DOI•

Technical Update: Least-Squares Temporal Difference Learning

[...]

Justin A. Boyan

01 Nov 2002-Machine Learning

TL;DR: This paper updates Bradtke and Barto's work in three significant ways: first, it presents a simpler derivation of the LSTD algorithm; second, it generalizes from λ = 0 to arbitrary values of λ; at the extreme of κ, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression.

...read moreread less

Abstract: TD.λ/ is a popular family of algorithms for approximate policy evaluation in large MDPs. TD.λ/ works by incrementally updating the value function after each observed transition. It has two major drawbacks: it may make inefficient use of data, and it requires the user to manually tune a stepsize schedule for good performance. For the case of linear value function approximations and λ e 0, the Least-Squares TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning, 22:1–3, 33–57) eliminates all stepsize parameters and improves data efficiency. This paper updates Bradtke and Barto's work in three significant ways. First, it presents a simpler derivation of the LSTD algorithm. Second, it generalizes from λ e 0 to arbitrary values of λs at the extreme of λ e 1, the resulting new algorithm is shown to be a practical, incremental formulation of supervised linear regression. Third, it presents a novel and intuitive interpretation of LSTD as a model-based reinforcement learning technique.

...read moreread less

Book Chapter•DOI•

Learning Options in Reinforcement Learning

[...]

Martin Stolle¹, Doina Precup¹•Institutions (1)

McGill University¹

02 Aug 2002

TL;DR: This paper empirically explores a simple approach to creating options based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals, and proposes a greedy algorithm for identifying subgoal counts based on state visitation counts.

...read moreread less

Abstract: Temporally extended actions (e.g., macro actions) have proven very useful for speeding up learning, ensuring robustness and building prior knowledge into AI systems. The options framework (Precup, 2000; Sutton, Precup & Singh, 1999) provides a natural way of incorporating such actions into reinforcement learning systems, but leaves open the issue of howgood options might be identified. In this paper, we empirically explore a simple approach to creating options. The underlying assumption is that the agent will be asked to perform different goalachievement tasks in an environment that is otherwise the same over time. Our approach is based on the intuition that states that are frequently visited on system trajectories, could prove to be useful subgoals (e.g., McGovern & Barto, 2001; Iba, 1989).We propose a greedy algorithm for identifying subgoals based on state visitation counts. We present empirical studies of this approach in two gridworld navigation tasks. One of the environments we explored contains bottleneck states, and the algorithm indeed finds these states, as expected. The second environment is an empty gridworld with no obstacles. Although the environment does not contain any obvious subgoals, our approach still finds useful options, which essentially allow the agent to explore the environment more quickly.

...read moreread less

Proceedings Article•

Efficient Reinforcement Learning Through Evolving Neural Network Topologies

[...]

Kenneth O. Stanley¹, Risto Miikkulainen¹•Institutions (1)

University of Texas at Austin¹

09 Jul 2002

TL;DR: NEAT shows that when structure is evolved with a principled method of crossover, by protecting structural innovation, and through incremental growth from minimal structure, learning is significantly faster and stronger than with the best fixed-topology methods.

...read moreread less

Abstract: Neuroevolution is currently the strongest method on the pole-balancing benchmark reinforcement learning tasks. Although earlier studies suggested that there was an advantage in evolving the network topology as well as connection weights, the leading neuroevolution systems evolve fixed networks. Whether evolving structure can improve performance is an open question. In this article, we introduce such a system, NeuroEvolution of Augmenting Topologies (NEAT). We show that when structure is evolved (1) with a principled method of crossover, (2) by protecting structural innovation, and (3) through incremental growth from minimal structure, learning is significantly faster and stronger than with the best fixed-topology methods. NEAT also shows that it is possible to evolve populations of increasingly large genomes, achieving highly complex solutions that would otherwise be difficult to optimize.

...read moreread less

Proceedings Article•

Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

[...]

XiaoFeng Wang¹, Tuomas Sandholm¹•Institutions (1)

Carnegie Mellon University¹

01 Jan 2002

TL;DR: In this article, the authors present an adaptive learning algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game with probability 2 in the presence of multiple Nash equilibria.

...read moreread less

Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conflicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the first algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm's parameters are easy to set to meet the convergence conditions.

...read moreread less

Book Chapter•DOI•

Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning

[...]

Ishai Menache¹, Shie Mannor¹, Nahum Shimkin¹•Institutions (1)

Technion – Israel Institute of Technology¹

19 Aug 2002

TL;DR: The Q-Cut algorithm is presented, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm, and extended to the Segmented Q- cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenECks in complex environments.

...read moreread less

Abstract: We present the Q-Cut algorithm, a graph theoretic approach for automatic detection of sub-goals in a dynamic environment, which is used for acceleration of the Q-Learning algorithm. The learning agent creates an on-line map of the process history, and uses an efficient Max-Flow/Min-Cut algorithm for identifying bottlenecks. The policies for reaching bottlenecks are separately learned and added to the model in a form of options (macro-actions). We then extend the basic Q-Cut algorithm to the Segmented Q-Cut algorithm, which uses previously identified bottlenecks for state space partitioning, necessary for finding additional bottlenecks in complex environments. Experiments showsign ificant performance improvements, particulary in the initial learning phase.

...read moreread less

Proceedings Article•DOI•

State abstraction for programmable reinforcement learning agents

[...]

David Andre¹, Stuart Russell¹•Institutions (1)

University of California, Berkeley¹

28 Jul 2002

TL;DR: This paper explores safe state abstraction in hierarchical reinforcement learning, where learned behaviors must conform to a given partial, hierarchical program, and shows how to achieve this for a partial programming language that is essentially Lisp augmented with nondeterministic constructs.

...read moreread less

Abstract: Safe state abstraction in reinforcement learning allows an agent to ignore aspects of its current state that are irrelevant to its current decision, and therefore speeds up dynamic programming and learning. This paper explores safe state abstraction in hierarchical reinforcement learning, where learned behaviors must conform to a given partial, hierarchical program. Unlike previous approaches to this problem, our methods yield significant state abstraction while maintaining hierarchical optimality, i.e., optimality among all policies consistent with the partial program. We show how to achieve this for a partial programming language that is essentially Lisp augmented with nondeterministic constructs. We demonstrate our methods on two variants of Dietterich's taxi domain, showing how state abstraction and hierarchical optimality result in faster learning of better policies and enable the transfer of learned skills from one problem to another.

...read moreread less

Proceedings Article•DOI•

Integrated learning for interactive synthetic characters

[...]

Bruce Blumberg¹, Marc Downie¹, Yuri A. Ivanov¹, Matt Berlin¹, Michael Patrick Johnson¹, Bill Tomlinson¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Jul 2002

TL;DR: An autonomous animated dog is built that can be trained with a technique used to train real dogs called "clicker training" and capabilities demonstrated include being trained to recognize and use acoustic patterns as cues for actions, as well as to synthesize new actions from novel paths through its motion space.

...read moreread less

Abstract: The ability to learn is a potentially compelling and important quality for interactive synthetic characters. To that end, we describe a practical approach to real-time learning for synthetic characters. Our implementation is grounded in the techniques of reinforcement learning and informed by insights from animal training. It simplifies the learning task for characters by (a) enabling them to take advantage of predictable regularities in their world, (b) allowing them to make maximal use of any supervisory signals, and (c) making them easy to train by humans.We built an autonomous animated dog that can be trained with a technique used to train real dogs called "clicker training". Capabilities demonstrated include being trained to recognize and use acoustic patterns as cues for actions, as well as to synthesize new actions from novel paths through its motion space.A key contribution of this paper is to demonstrate that by addressing the three problems of state, action, and state-action space discovery at the same time, the solution for each becomes easier. Finally, we articulate heuristics and design principles that make learning practical for synthetic characters.

...read moreread less

Proceedings Article•

Discovering Hierarchy in Reinforcement Learning with HEXQ

[...]

Bernhard Hengst¹•Institutions (1)

University of New South Wales¹

08 Jul 2002

TL;DR: HEXQ, an algorithm which automatically attempts to decompose and solve a model-free factored MDP hierarchically is described, which uses temporal and state abstraction to construct a hierarchy of interlinked smaller MDPs.

...read moreread less

Abstract: An open problem in reinforcement learning is discovering hierarchical structure. HEXQ, an algorithm which automatically attempts to decompose and solve a model-free factored MDP hierarchically is described. By searching for aliased Markov sub-space regions based on the state variables the algorithm uses temporal and state abstraction to construct a hierarchy of interlinked smaller MDPs.

...read moreread less

Proceedings Article•DOI•

Reinforcement learning of coordination in cooperative multi-agent systems

[...]

Spiros Kapetanakis¹, Daniel Kudenko¹•Institutions (1)

University of York¹

28 Jul 2002

TL;DR: This investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems focuses on a novel action selection strategy for Q-learning (Watkins 1989), and demonstrates empirically that this extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.

...read moreread less

Abstract: We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on a novel action selection strategy for Q-learning (Watkins 1989). The new technique is applicable to scenarios where mutual observation of actions is not possible.To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results (Claus & Boutilier 1998) by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.

...read moreread less

Journal Article•DOI•

Control of exploitation-exploration meta-parameter in reinforcement learning

[...]

Shin Ishii¹, Wako Yoshida¹, Junichiro Yoshimoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Jun 2002-Neural Networks

TL;DR: A new method that controls the balance between exploitation and exploration is presented, based on model-based RL, in which the Bayes inference with forgetting effect estimates the state-transition probability of the environment.

...read moreread less

Journal Article•DOI•

A reinforcement learning approach to automatic generation control

[...]

T. P. Imthias Ahamed¹, P.S. Nagendra Rao¹, P. S. Sastry¹•Institutions (1)

Indian Institute of Science¹

01 Aug 2002-Electric Power Systems Research

TL;DR: In this article, the authors formulated the automatic generation control (AGC) problem as a stochastic multistage decision problem and used reinforcement learning (RL) to obtain an AGC controller.

...read moreread less

Journal Article•DOI•

Evolution of reinforcement learning in uncertain environments: a simple explanation for complex foraging behaviors

[...]

Yael Niv¹, Daphna Joel¹, Isaac Meilijson¹, Eytan Ruppin¹•Institutions (1)

Tel Aviv University¹

01 Apr 2002-Adaptive Behavior

TL;DR: This work provides a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching in bumblebees, and shows risk aversion to emerge even when bees are evolved in a completely risk-less environment.

...read moreread less

Abstract: Reinforcement learning is a fundamental process by which organisms learn to achieve goals from their interactions with the environment. Using evolutionary computation techniques we evolve (near-)optimal neuronal learning rules in a simple neural network model of reinforcement learning in bumblebees foraging for nectar. The resulting neural networks exhibit efficient reinforcement learning, allowing the bees to respond rapidly to changes in reward contingencies. The evolved synaptic plasticity dynamics give rise to varying exploration/exploitation levels and to the well-documented choice strategies of risk aversion and probability matching. Additionally, risk aversion is shown to emerge even when bees are evolved in a completely risk-less environment. In contrast to existing theories in economics and game theory, risk-averse behavior is shown to be a direct consequence of (near-)optimal reinforcement learning, without requiring additional assumptions such as the existence of a nonlinear subjective utility function for rewards. Our results are corroborated by a rigorous mathematical analysis, and their robustness in real-world situations is supported by experiments in a mobile robot. Thus we provide a biologically founded, parsimonious, and novel explanation for risk aversion and probability matching.

...read moreread less

Journal Article•DOI•

Inventory management in supply chains: a reinforcement learning approach

[...]

Ilaria Giannoccaro¹, Pierpaolo Pontrandolfo¹•Institutions (1)

Instituto Politécnico Nacional¹

21 Jul 2002-International Journal of Production Economics

TL;DR: This paper presents an approach to manage inventory decisions at all stages of the supply chain in an integrated manner that allows an inventory order policy to be determined, aimed at optimizing the performance of the whole supply chain.

...read moreread less

Journal Article•DOI•

Two Competing Models of How People Learn in Games

[...]

Ed Hopkins

01 Nov 2002-Econometrica

TL;DR: In this article, the expected motion of stochastic fictitious play and reinforcement learning with experimentation can both be written as a perturbed form of the evolutionary replicator dynamics, and they will in many cases have the same asymptotic behavior.

...read moreread less

Abstract: Reinforcement learning and stochastic fictitious play are apparent rivals as models of human learning. They embody quite different assumptions about the processing of information and optimization. This paper compares their properties and finds that they are far more similar than were thought. In particular, the expected motion of stochastic fictitious play and reinforcement learning with experimentation can both be written as a perturbed form of the evolutionary replicator dynamics. Therefore they will in many cases have the same asymptotic behavior. In particular, local stability of mixed equilibria under stochastic fictitious play implies local stability under perturbed reinforcement learning. The main identifiable difference between the two models is speed: stochastic fictitious play gives rise to faster learning.

...read moreread less

Journal Article•DOI•

Programming backgammon using self-teaching neural nets

[...]

Gerald Tesauro¹•Institutions (1)

IBM¹

24 Jan 2002-Artificial Intelligence

TL;DR: This paper views machine learning as a tool in a programmer's toolkit, and considers how it can be combined with other programming techniques to achieve and surpass world-class backgammon play.

...read moreread less

Collapse