Showing papers on "Reinforcement learning published in 2006"

PDF

Open Access

Proceedings Article•

An Application of Reinforcement Learning to Aerobatic Helicopter Flight

[...]

Pieter Abbeel¹, Adam Coates¹, Morgan Quigley¹, Andrew Y. Ng¹•Institutions (1)

04 Dec 2006

TL;DR: This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose- in funnel using differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR).

...read moreread less

Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR).

...read moreread less

621 citations

Journal Article•

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

[...]

Eyal Even-Dar, Shie Mannor, Yishay Mansour

01 Dec 2006-Journal of Machine Learning Research

TL;DR: A framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability) is described and a model-based and model-free variants of the elimination method are provided.

...read moreread less

Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/e2)log(1/δ)) times to find an e-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over e-greedy Q-learning.

...read moreread less

604 citations

Proceedings Article•DOI•

Policy Gradient Methods for Robotics

[...]

Jan Peters¹, Stefan Schaal¹•Institutions (1)

University of Southern California¹

01 Oct 2006

TL;DR: An overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field is given and how the most recently developed methods can significantly improve learning performance is shown.

...read moreread less

Abstract: The acquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely pre-structured environments. However, to date only few existing reinforcement learning methods have been scaled into the domains of high-dimensional robots such as manipulator, legged or humanoid robots. Policy gradient methods remain one of the few exceptions and have found a variety of applications. Nevertheless, the application of such methods is not without peril if done in an uninformed manner. In this paper, we give an overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field. We outline previous applications to robotics and show how the most recently developed methods can significantly improve learning performance. Finally, we evaluate our most promising algorithm in the application of hitting a baseball with an anthropomorphic arm

...read moreread less

598 citations

Book Chapter•DOI•

Autonomous Inverted Helicopter Flight via Reinforcement Learning

[...]

Andrew Y. Ng¹, Adam Coates¹, Mark Diel, Varun Ganapathi¹, Jamie Schulte¹, Ben Tse, Eric Berger¹, Eric Liang¹ - Show less +4 more•Institutions (1)

Stanford University¹

01 Jan 2006

TL;DR: A successful application of reinforcement learning to designing a controller for sustained inverted flight on an autonomous helicopter, using a stochastic, nonlinear model of the helicopter’s dynamics.

...read moreread less

Abstract: Helicopters have highly stochastic, nonlinear, dynamics, and autonomous helicopter flight is widely regarded to be a challenging control problem. As helicopters are highly unstable at low speeds, it is particularly difficult to design controllers for low speed aerobatic maneuvers. In this paper, we describe a successful application of reinforcement learning to designing a controller for sustained inverted flight on an autonomous helicopter. Using data collected from the helicopter in flight, we began by learning a stochastic, nonlinear model of the helicopter’s dynamics. Then, a reinforcement learning algorithm was applied to automatically learn a controller for autonomous inverted hovering. Finally, the resulting controller was successfully tested on our autonomous helicopter platform.

...read moreread less

587 citations

Journal Article•DOI•

Anatomy of a decision: striato-orbitofrontal interactions in reinforcement learning, decision making, and reversal.

[...]

Michael J. Frank, Eric D. Claus¹•Institutions (1)

University of Colorado Boulder¹

01 Apr 2006-Psychological Review

TL;DR: The model successfully captures patterns of behavior resulting from OFC damage in decision making, reversal learning, and devaluation paradigms and makes additional predictions for the underlying source of these deficits.

...read moreread less

Abstract: The authors explore the division of labor between the basal ganglia-dopamine (BG-DA) system and the orbitofrontal cortex (OFC) in decision making. They show that a primitive neural network model of the BG-DA system slowly learns to make decisions on the basis of the relative probability of rewards but is not as sensitive to (a) recency or (b) the value of specific rewards. An augmented model that explores BG-OFC interactions is more successful at estimating the true expected value of decisions and is faster at switching behavior when reinforcement contingencies change. In the augmented model, OFC areas exert top-down control on the BG and premotor areas by representing reinforcement magnitudes in working memory. The model successfully captures patterns of behavior resulting from OFC damage in decision making, reversal learning, and devaluation paradigms and makes additional predictions for the underlying source of these deficits.

...read moreread less

580 citations

Journal Article•DOI•

SSL: a theory of how people learn to select strategies.

[...]

Joerg Rieskamp¹, Philipp E. Otto²•Institutions (2)

Max Planck Society¹, University of Warwick²

01 May 2006-Journal of Experimental Psychology: General

TL;DR: In all 4 studies, the best-performing strategy from the participants' repertoires most accurately predicted the inferences after sufficient learning opportunities, and when testing SSL against 3 models representing extensions of SSL and against an exemplar model assuming a memory-based inference process, the authors found that SSL predicted theinferences most accurately.

...read moreread less

Abstract: The assumption that people possess a repertoire of strategies to solve the inference problems they face has been raised repeatedly. However, a computational model specifying how people select strategies from their repertoire is still lacking. The proposed strategy selection learning (SSL) theory predicts a strategy selection process on the basis of reinforcement learning. The theory assumes that individuals develop subjective expectations for the strategies they have and select strategies proportional to their expectations, which are then updated on the basis of subsequent experience. The learning assumption was supported in 4 experimental studies. Participants substantially improved their inferences through feedback. In all 4 studies, the best-performing strategy from the participants' repertoires most accurately predicted the inferences after sufficient learning opportunities. When testing SSL against 3 models representing extensions of SSL and against an exemplar model assuming a memory-based inference process, the authors found that SSL predicted the inferences most accurately.

...read moreread less

551 citations

Journal Article•DOI•

The Role of the Ventromedial Prefrontal Cortex in Abstract State-Based Inference during Decision Making in Humans

[...]

Alan N. Hampton¹, Peter Bossaerts¹, John P. O'Doherty¹•Institutions (1)

California Institute of Technology¹

09 Aug 2006-The Journal of Neuroscience

TL;DR: The results suggest that brain regions, such as the vmPFC, use an abstract model of task structure to guide behavioral choice, computations that may underlie the human capacity for complex social interactions and abstract strategizing.

...read moreread less

Abstract: Many real-life decision-making problems incorporate higher-order structure, involving interdependencies between different stimuli, actions, and subsequent rewards. It is not known whether brain regions implicated in decision making, such as the ventromedial prefrontal cortex (vmPFC), use a stored model of the task structure to guide choice (model-based decision making) or merely learn action or state values without assuming higher-order structure as in standard reinforcement learning. To discriminate between these possibilities, we scanned human subjects with functional magnetic resonance imaging while they performed a simple decision-making task with higher-order structure, probabilistic reversal learning. We found that neural activity in a key decision-making region, the vmPFC, was more consistent with a computational model that exploits higher-order structure than with simple reinforcement learning. These results suggest that brain regions, such as the vmPFC, use an abstract model of task structure to guide behavioral choice, computations that may underlie the human capacity for complex social interactions and abstract strategizing.

...read moreread less

482 citations

Proceedings Article•DOI•

PAC model-free reinforcement learning

[...]

Alexander L. Strehl¹, Lihong Li¹, Eric Wiewiora², John Langford, Michael L. Littman¹ - Show less +1 more•Institutions (2)

Rutgers University¹, University of California, San Diego²

25 Jun 2006

TL;DR: This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.

...read moreread less

Abstract: For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm---Delayed Q-Learning. We prove it is PAC, achieving near optimal performance except for O(SA) timesteps using O(SA) space, improving on the O(S2 A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience---no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.

...read moreread less

474 citations

Proceedings Article•

Linearly-solvable Markov decision problems

[...]

Emanuel Todorov¹•Institutions (1)

University of California, San Diego¹

04 Dec 2006

TL;DR: A class of MPDs which greatly simplify Reinforcement Learning, which have discrete state spaces and continuous control spaces and enable efficient approximations to traditional MDPs.

...read moreread less

Abstract: We introduce a class of MPDs which greatly simplify Reinforcement Learning. They have discrete state spaces and continuous control spaces. The controls have the effect of rescaling the transition probabilities of an underlying Markov chain. A control cost penalizing KL divergence between controlled and uncontrolled transition probabilities makes the minimization problem convex, and allows analytical computation of the optimal controls given the optimal value function. An exponential transformation of the optimal value function makes the minimized Bellman equation linear. Apart from their theoretical significance, the new MDPs enable efficient approximations to traditional MDPs. Shortest path problems are approximated to arbitrary precision with largest eigenvalue problems, yielding an O (n) algorithm. Accurate approximations to generic MDPs are obtained via continuous embedding reminiscent of LP relaxation in integer programming. Off-policy learning of the optimal value function is possible without need for state-action values; the new algorithm (Z-learning) outperforms Q-learning.

...read moreread less

430 citations

Journal Article•DOI•

A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies

[...]

Jost Schatzmann¹, Karl Weilhammer, Matt Stuttle, Steve Young•Institutions (1)

University of Cambridge¹

01 Jun 2006-Knowledge Engineering Review

TL;DR: The role of the dialogue manager in a spoken dialogue system is summarized, a short introduction to reinforcement-learning of dialogue management strategies is given, the literature on user modelling for simulation-based strategy learning is reviewed and recent work on user model evaluation is described.

...read moreread less

Abstract: Within the broad field of spoken dialogue systems, the application of machine-learning approaches to dialogue management strategy design is a rapidly growing research area. The main motivation is the hope of building systems that learn through trial-and-error interaction what constitutes a good dialogue strategy. Training of such systems could in theory be done using human users or using corpora of human–computer dialogue, but in practice the typically vast space of possible dialogue states and strategies cannot be explored without the use of automatic user simulation tools.This requirement for training statistical dialogue models has created an interesting new application area for predictive statistical user modelling and a variety of different techniques for simulating user behaviour have been presented in the literature ranging from simple Markov models to Bayesian networks. The development of reliable user simulation tools is critical to further progress on automatic dialogue management design but it holds many challenges, some of which have been encountered in other areas of current research on statistical user modelling, such as the problem of ‘concept drift’, the problem of combining content-based and collaboration-based modelling techniques, and user model evaluation. The latter topic is of particular interest, because simulation-based learning is currently one of the few applications of statistical user modelling that employs both direct ‘accuracy-based’ and indirect ‘utility-based’ evaluation techniques.In this paper, we briefly summarize the role of the dialogue manager in a spoken dialogue system, give a short introduction to reinforcement-learning of dialogue management strategies and review the literature on user modelling for simulation-based strategy learning. We further describe recent work on user model evaluation and discuss some of the current research issues in simulation-based learning from a user modelling perspective.

...read moreread less

378 citations

Proceedings Article•DOI•

A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation

[...]

Gerald Tesauro¹, N.K. Jong, Rajarshi Das¹, Mohamed N. Bennani²•Institutions (2)

IBM¹, George Mason University²

12 Jun 2006

TL;DR: In this article, the authors combine the strengths of reinforcement learning and queuing models in a hybrid approach in which RL trains offline on data collected while a queuing model policy controls the system.

...read moreread less

Abstract: Reinforcement Learning (RL) provides a promising new approach to systems performance management that differs radically from standard queuing-theoretic approaches making use of explicit system performance models. In principle, RL can automatically learn high-quality management policies without an explicit performance model or traffic model and with little or no built-in system specific knowledge. In our original work [1], [2], [3] we showed the feasibility of using online RL to learn resource valuation estimates (in lookup table form) which can be used to make high-quality server allocation decisions in a multi-application prototype Data Center scenario. The present work shows how to combine the strengths of both RL and queuing models in a hybrid approach in which RL trains offline on data collected while a queuing model policy controls the system. By training offline we avoid suffering potentially poor performance in live online training. We also now use RL to train nonlinear function approximators (e.g. multi-layer perceptrons) instead of lookup tables; this enables scaling to substantially larger state spaces. Our results now show that in both open-loop and closed-loop traffic, hybrid RL training can achieve significant performance improvements over a variety of initial model-based policies. We also find that, as expected, RL can deal effectively with both transients and switching delays, which lie outside the scope of traditional steady-state queuing theory.

...read moreread less

Journal Article•DOI•

Collaborative Multiagent Reinforcement Learning by Payoff Propagation

[...]

Jelle R. Kok, Nikos Vlassis¹•Institutions (1)

University of Luxembourg¹

01 Dec 2006-Journal of Machine Learning Research

TL;DR: A set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting using the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) and introduces different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph.

...read moreread less

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences.

...read moreread less

Proceedings Article•DOI•

An analytic solution to discrete Bayesian reinforcement learning

[...]

Pascal Poupart¹, Nikos Vlassis², Jesse Hoey³, Kevin Regan¹•Institutions (3)

University of Waterloo¹, University of Amsterdam², University of Toronto³

25 Jun 2006

TL;DR: This work proposes a new algorithm, called BEETLE, for effective online learning that is computationally efficient while minimizing the amount of exploration, and takes a Bayesian model-based approach, framing RL as a partially observable Markov decision process.

...read moreread less

Abstract: Reinforcement learning (RL) was originally proposed as a framework to allow agents to learn in an online fashion as they interact with their environment. Existing RL algorithms come short of achieving this goal because the amount of exploration required is often too costly and/or too time consuming for online learning. As a result, RL is mostly used for offline learning in simulated environments. We propose a new algorithm, called BEETLE, for effective online learning that is computationally efficient while minimizing the amount of exploration. We take a Bayesian model-based approach, framing RL as a partially observable Markov decision process. Our two main contributions are the analytical derivation that the optimal value function is the upper envelope of a set of multivariate polynomials, and an efficient point-based value iteration algorithm that exploits this simple parameterization.

...read moreread less

Proceedings Article•DOI•

Probabilistic policy reuse in a reinforcement learning agent

[...]

Fernando Fernández¹, Manuela Veloso¹•Institutions (1)

Carnegie Mellon University¹

08 May 2006

TL;DR: Interestingly and almost as a side effect, Policy Reuse also identifies classes of similar policies revealing a basis of core policies of the domain, demonstrating that such a basis can be built incrementally, contributing the learning of the structure of a domain.

...read moreread less

Abstract: We contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. We introduce the algorithm and its major components: an exploration strategy to include the new reuse bias, and a similarity function to estimate the similarity of past policies with respect to a new one. We provide empirical results demonstrating that Policy Reuse improves the learning performance over different strategies that learn without reuse. Interestingly and almost as a side effect, Policy Reuse also identifies classes of similar policies revealing a basis of core policies of the domain. We demonstrate that such a basis can be built incrementally, contributing the learning of the structure of a domain.

...read moreread less

Proceedings Article•

Reinforcement learning with human teachers: evidence of feedback and guidance with implications for learning performance

[...]

Andrea L. Thomaz¹, Cynthia Breazeal¹•Institutions (1)

Massachusetts Institute of Technology¹

16 Jul 2006

TL;DR: The importance of understanding the human-teacher/robot-learner system as a whole in order to design algorithms that support how people want to teach while simultaneously improving the robot's learning performance is demonstrated.

...read moreread less

Abstract: As robots become a mass consumer product, they will need to learn new skills by interacting with typical human users. Past approaches have adapted reinforcement learning (RL) to accept a human reward signal; however, we question the implicit assumption that people shall only want to give the learner feedback on its past actions. We present findings from a human user study showing that people use the reward signal not only to provide feedback about past actions, but also to provide future directed rewards to guide subsequent actions. Given this, we made specific modifications to the simulated RL robot to incorporate guidance. We then analyze and evaluate its learning performance in a second user study, and we report significant improvements on several measures. This work demonstrates the importance of understanding the human-teacher/robot-learner system as a whole in order to design algorithms that support how people want to teach while simultaneously improving the robot's learning performance.

...read moreread less

Journal Article•DOI•

Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts

[...]

Jürgen Schmidhuber¹•Institutions (1)

Technische Universität München¹

01 Jun 2006-Connection Science

TL;DR: It is pointed out how the fine arts can be formally understood as a consequence of the basic principle: given some subjective observer, great works of art and music yield observation histories exhibiting more novel, previously unknown compressibility/regularity/predictability than lesser works, thus deepening the observer’s understanding of the world and what is possible in it.

...read moreread less

Abstract: Even in the absence of external reward, babies and scientists and others explore their world. Using some sort of adaptive predictive world model, they improve their ability to answer questions such as what happens if I do this or that? They lose interest in both the predictable things and those predicted to remain unpredictable despite some effort. One can design curious robots that do the same. The author’s basic idea (1990, 1991) for doing so is a reinforcement learning (RL) controller is rewarded for action sequences that improve the predictor. Here, this idea is revisited in the context of recent results on optimal predictors and optimal RL machines. Several new variants of the basic principle are proposed. Finally, it is pointed out how the fine arts can be formally understood as a consequence of the principle: given some subjective observer, great works of art and music yield observation histories exhibiting more novel, previously unknown compressibility/regularity/predictability (with respect to th...

...read moreread less

Journal Article•DOI•

The misbehavior of value and the discipline of the will

[...]

Peter Dayan¹, Yael Niv², Ben Seymour, Nathaniel D. Daw•Institutions (2)

University College London¹, Hebrew University of Jerusalem²

01 Oct 2006-Neural Networks

TL;DR: This paper study the consequences of importing this competition into a reinforcement learning context, and demonstrate the resulting effects in an omission schedule and a maze navigation task, and discuss how it may be disciplined.

...read moreread less

Book Chapter•DOI•

Kernel-Based reinforcement learning

[...]

Guanghua Hu¹, Yuqin Qiu¹, Liming Xiang²•Institutions (2)

Yunnan University¹, City University of Hong Kong²

16 Aug 2006

TL;DR: Two kernel-based reinforcement learning algorithms, the e – KRL and the least squares kernel based reinforcement learning (LS-KRL) are proposed and an example shows that the proposed methods can deal effectively with the reinforcement learning problem without having to explore many states.

...read moreread less

Abstract: We consider the problem of approximating the cost-to-go functions in reinforcement learning By mapping the state implicitly into a feature space, we perform a simple algorithm in the feature space, which corresponds to a complex algorithm in the original state space Two kernel-based reinforcement learning algorithms, the e -insensitive kernel based reinforcement learning (e – KRL) and the least squares kernel based reinforcement learning (LS-KRL) are proposed An example shows that the proposed methods can deal effectively with the reinforcement learning problem without having to explore many states

...read moreread less

Journal Article•

Evolutionary Function Approximation for Reinforcement Learning

[...]

Shimon Whiteson¹, Peter Stone•Institutions (1)

University of Texas at Austin¹

01 Dec 2006-Journal of Machine Learning Research

TL;DR: A fully implemented instantiation of evolutionary function approximation is presented which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method, and the resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators.

...read moreread less

Abstract: Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice.

...read moreread less

Proceedings Article•DOI•

Using inaccurate models in reinforcement learning

[...]

Pieter Abbeel¹, Morgan Quigley¹, Andrew Y. Ng¹•Institutions (1)

Stanford University¹

25 Jun 2006

TL;DR: This paper presents a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials, and achieves near-optimal performance in the real system, even when the model is only approximate.

...read moreread less

Abstract: In the model-based policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for high-dimensional continuous-state tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in real-life. The other extreme, model-free RL, tends to require infeasibly large numbers of real-life trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials. The key idea is to successively "ground" the policy evaluations using real-life trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves near-optimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate that---when given only a crude model and a small number of real-life trials---our algorithm can obtain near-optimal performance in the real system.

...read moreread less

Journal Article•DOI•

Learning tetris using the noisy cross-entropy method

[...]

Istvan Szita¹, András Lörincz¹•Institutions (1)

Eötvös Loránd University¹

01 Dec 2006-Neural Computation

TL;DR: Noise is applied for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration, and the resulting policy outperforms previous RL algorithms by almost two orders of magnitude.

...read moreread less

Abstract: The cross-entropy method is an efficient and general optimization algorithm. However, its applicability in reinforcement learning (RL) seems to be limited because it often converges to suboptimal policies. We apply noise for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration. The resulting policy outperforms previous RL algorithms by almost two orders of magnitude.

...read moreread less

Proceedings Article•

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

[...]

Peter Auer¹, Ronald Ortner¹•Institutions (1)

University of Leoben¹

04 Dec 2006

TL;DR: Upper confidence bounds are used to show that the UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

...read moreread less

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

...read moreread less

Proceedings Article•DOI•

Autonomous shaping: knowledge transfer in reinforcement learning

[...]

George Konidaris¹, Andrew G. Barto¹•Institutions (1)

University of Massachusetts Amherst¹

25 Jun 2006

TL;DR: The use of learned shaping rewards in reinforcement learning tasks, where an agent uses prior experience on a sequence of tasks to learn a portable predictor that estimates intermediate rewards, resulting in accelerated learning in later tasks that are related but distinct is introduced.

...read moreread less

Abstract: We introduce the use of learned shaping rewards in reinforcement learning tasks, where an agent uses prior experience on a sequence of tasks to learn a portable predictor that estimates intermediate rewards, resulting in accelerated learning in later tasks that are related but distinct. Such agents can be trained on a sequence of relatively easy tasks in order to develop a more informative measure of reward that can be transferred to improve performance on more difficult tasks without requiring a hand coded shaping function. We use a rod positioning task to show that this significantly improves performance even after a very brief training period.

...read moreread less

Proceedings Article•DOI•

Reinforcement learning for optimized trade execution

[...]

Yuriy Nevmyvaka¹, Yi Feng², Michael Kearns²•Institutions (2)

Lehman Brothers¹, University of Pennsylvania²

25 Jun 2006

TL;DR: This work presents the first large-scale empirical application of reinforcement learning to the important problem of optimized trade execution in modern financial markets, based on 1.5 years of millisecond time-scale limit order data from NASDAQ.

...read moreread less

Abstract: We present the first large-scale empirical application of reinforcement learning to the important problem of optimized trade execution in modern financial markets. Our experiments are based on 1.5 years of millisecond time-scale limit order data from NASDAQ, and demonstrate the promise of reinforcement learning methods to market microstructure problems. Our learning algorithm introduces and exploits a natural "low-impact" factorization of the state space.

...read moreread less

Journal Article•DOI•

Opposition-Based Reinforcement Learning

[...]

Hamid R. Tizhoosh¹•Institutions (1)

University of Waterloo¹

20 Jul 2006-Journal of Advanced Computational Intelligence and Intelligent Informatics

TL;DR: Opposition-based reinforcement learning, inspired by opposition-based learning, is introduced, to speed up convergence by Considering opposite actions simultaneously enables individual states to be updated more than once shortening exploration and expediting convergence.

...read moreread less

Abstract: Reinforcement learning is a machine intelligence scheme for learning in highly dynamic, probabilistic environments By interaction with the environment, reinforcement agents learn optimal control policies, especially in the absence of a priori knowledge and/or a sufficiently large amount of training data Despite its advantages, however, reinforcement learning suffers from a major drawback – high calculation cost because convergence to an optimal solution usually requires that all states be visited frequently to ensure that policy is reliable This is not always possible, however, due to the complex, high-dimensional state space in many applications This paper introduces opposition-based reinforcement learning, inspired by opposition-based learning, to speed up convergence Considering opposite actions simultaneously enables individual states to be updated more than once shortening exploration and expediting convergence Three versions of Q-learning algorithm will be given as examples Experimental results for the grid world problem of different sizes demonstrate the superior performance of the proposed approach

...read moreread less

Proceedings Article•DOI•

Automatic basis function construction for approximate dynamic programming and reinforcement learning

[...]

Philipp W. Keller¹, Shie Mannor¹, Doina Precup¹•Institutions (1)

McGill University¹

25 Jun 2006

TL;DR: This work proposes to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a low-dimensional space, based on the Bellman error, or on the temporal difference (TD) error.

...read moreread less

Abstract: We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castanon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a low-dimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lower-dimensional space. These are added as new features for the linear function approximator. This approach is applied to a high-dimensional inventory control problem.

...read moreread less

Proceedings Article•DOI•

Dealing with non-stationary environments using context detection

[...]

Bruno da Silva, Eduardo W. Basso, Ana L. C. Bazzan, Paulo Martins Engel

25 Jun 2006

TL;DR: It is shown that RL-CD performs better than two standard reinforcement learning algorithms and that it has advantages over methods specifically designed to cope with non-stationarity.

...read moreread less

Abstract: In this paper we introduce RL-CD, a method for solving reinforcement learning problems in non-stationary environments. The method is based on a mechanism for creating, updating and selecting one among several partial models of the environment. The partial models are incrementally built according to the system's capability of making predictions regarding a given sequence of observations. We propose, formalize and show the efficiency of this method both in a simple non-stationary environment and in a noisy scenario. We show that RL-CD performs better than two standard reinforcement learning algorithms and that it has advantages over methods specifically designed to cope with non-stationarity. Finally, we present known limitations of the method and future works.

...read moreread less

Journal Article•DOI•

An automated FX trading system using adaptive reinforcement learning

[...]

Michael A. H. Dempster¹, V. Leemans¹•Institutions (1)

University of Cambridge¹

01 Apr 2006-Expert Systems With Applications

TL;DR: In this article, an adaptive reinforcement learning (ARLRL) algorithm is used to trade foreign exchange markets and relies on a layered structure consisting of a machine learning algorithm, a risk management overlay and a dynamic utility optimization layer.

...read moreread less

Abstract: This paper introduces adaptive reinforcement learning (ARL) as the basis for a fully automated trading system application. The system is designed to trade foreign exchange (FX) markets and relies on a layered structure consisting of a machine learning algorithm, a risk management overlay and a dynamic utility optimization layer. An existing machine-learning method called recurrent reinforcement learning (RRL) was chosen as the underlying algorithm for ARL. One of the strengths of our approach is that the dynamic optimization layer makes a fixed choice of model tuning parameters unnecessary. It also allows for a risk-return trade-off to be made by the user within the system. The trading system is able to make consistent gains out-of-sample while avoiding large draw-downs.

...read moreread less

Journal Article•DOI•

Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning

[...]

Masahiko Haruno, Mitsuo Kawato

01 Oct 2006-Neural Networks

TL;DR: A "heterarchical reinforcement learning" model is proposed, where reward prediction made by more limbic and cognitive loops is propagated to motor loops by spiral projections between the striatum and substantia nigra, assisted by cortical projections to the pedunculopontine tegmental nucleus, which sends excitatory input to the substantia Nigra.

...read moreread less

Journal Article•DOI•

Hierarchical multi-agent reinforcement learning

[...]

Mohammad Ghavamzadeh¹, Sridhar Mahadevan², Rajbala Makar³•Institutions (3)

University of Alberta¹, University of Massachusetts Amherst², Agilent Technologies³

01 Sep 2006-Autonomous Agents and Multi-Agent Systems

TL;DR: The multi-agent HRL framework is extended to include communication decisions and a cooperative multi- agent HRL algorithm called COM-Cooperative HRL is proposed, which allows agents to learn coordination faster by sharing information at the level of cooperative subtasks.

...read moreread less

Abstract: In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.

...read moreread less

Collapse