scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2006"


Proceedings Article
04 Dec 2006
TL;DR: This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose- in funnel using differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR).
Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR).

621 citations


Journal Article
TL;DR: A framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability) is described and a model-based and model-free variants of the elimination method are provided.
Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/e2)log(1/δ)) times to find an e-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over e-greedy Q-learning.

604 citations


Proceedings ArticleDOI
01 Oct 2006
TL;DR: An overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field is given and how the most recently developed methods can significantly improve learning performance is shown.
Abstract: The acquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely pre-structured environments. However, to date only few existing reinforcement learning methods have been scaled into the domains of high-dimensional robots such as manipulator, legged or humanoid robots. Policy gradient methods remain one of the few exceptions and have found a variety of applications. Nevertheless, the application of such methods is not without peril if done in an uninformed manner. In this paper, we give an overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field. We outline previous applications to robotics and show how the most recently developed methods can significantly improve learning performance. Finally, we evaluate our most promising algorithm in the application of hitting a baseball with an anthropomorphic arm

598 citations


Book ChapterDOI
01 Jan 2006
TL;DR: A successful application of reinforcement learning to designing a controller for sustained inverted flight on an autonomous helicopter, using a stochastic, nonlinear model of the helicopter’s dynamics.
Abstract: Helicopters have highly stochastic, nonlinear, dynamics, and autonomous helicopter flight is widely regarded to be a challenging control problem. As helicopters are highly unstable at low speeds, it is particularly difficult to design controllers for low speed aerobatic maneuvers. In this paper, we describe a successful application of reinforcement learning to designing a controller for sustained inverted flight on an autonomous helicopter. Using data collected from the helicopter in flight, we began by learning a stochastic, nonlinear model of the helicopter’s dynamics. Then, a reinforcement learning algorithm was applied to automatically learn a controller for autonomous inverted hovering. Finally, the resulting controller was successfully tested on our autonomous helicopter platform.

587 citations


Journal ArticleDOI
TL;DR: The model successfully captures patterns of behavior resulting from OFC damage in decision making, reversal learning, and devaluation paradigms and makes additional predictions for the underlying source of these deficits.
Abstract: The authors explore the division of labor between the basal ganglia-dopamine (BG-DA) system and the orbitofrontal cortex (OFC) in decision making. They show that a primitive neural network model of the BG-DA system slowly learns to make decisions on the basis of the relative probability of rewards but is not as sensitive to (a) recency or (b) the value of specific rewards. An augmented model that explores BG-OFC interactions is more successful at estimating the true expected value of decisions and is faster at switching behavior when reinforcement contingencies change. In the augmented model, OFC areas exert top-down control on the BG and premotor areas by representing reinforcement magnitudes in working memory. The model successfully captures patterns of behavior resulting from OFC damage in decision making, reversal learning, and devaluation paradigms and makes additional predictions for the underlying source of these deficits.

580 citations


Journal ArticleDOI
TL;DR: In all 4 studies, the best-performing strategy from the participants' repertoires most accurately predicted the inferences after sufficient learning opportunities, and when testing SSL against 3 models representing extensions of SSL and against an exemplar model assuming a memory-based inference process, the authors found that SSL predicted theinferences most accurately.
Abstract: The assumption that people possess a repertoire of strategies to solve the inference problems they face has been raised repeatedly. However, a computational model specifying how people select strategies from their repertoire is still lacking. The proposed strategy selection learning (SSL) theory predicts a strategy selection process on the basis of reinforcement learning. The theory assumes that individuals develop subjective expectations for the strategies they have and select strategies proportional to their expectations, which are then updated on the basis of subsequent experience. The learning assumption was supported in 4 experimental studies. Participants substantially improved their inferences through feedback. In all 4 studies, the best-performing strategy from the participants' repertoires most accurately predicted the inferences after sufficient learning opportunities. When testing SSL against 3 models representing extensions of SSL and against an exemplar model assuming a memory-based inference process, the authors found that SSL predicted the inferences most accurately.

551 citations


Journal ArticleDOI
TL;DR: The results suggest that brain regions, such as the vmPFC, use an abstract model of task structure to guide behavioral choice, computations that may underlie the human capacity for complex social interactions and abstract strategizing.
Abstract: Many real-life decision-making problems incorporate higher-order structure, involving interdependencies between different stimuli, actions, and subsequent rewards. It is not known whether brain regions implicated in decision making, such as the ventromedial prefrontal cortex (vmPFC), use a stored model of the task structure to guide choice (model-based decision making) or merely learn action or state values without assuming higher-order structure as in standard reinforcement learning. To discriminate between these possibilities, we scanned human subjects with functional magnetic resonance imaging while they performed a simple decision-making task with higher-order structure, probabilistic reversal learning. We found that neural activity in a key decision-making region, the vmPFC, was more consistent with a computational model that exploits higher-order structure than with simple reinforcement learning. These results suggest that brain regions, such as the vmPFC, use an abstract model of task structure to guide behavioral choice, computations that may underlie the human capacity for complex social interactions and abstract strategizing.

482 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience, and Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.
Abstract: For a Markov Decision Process with finite state (size S) and action spaces (size A per state), we propose a new algorithm---Delayed Q-Learning. We prove it is PAC, achieving near optimal performance except for O(SA) timesteps using O(SA) space, improving on the O(S2 A) bounds of best previous algorithms. This result proves efficient reinforcement learning is possible without learning a model of the MDP from experience. Learning takes place from a single continuous thread of experience---no resets nor parallel sampling is used. Beyond its smaller storage and experience requirements, Delayed Q-learning's per-experience computation cost is much less than that of previous PAC algorithms.

474 citations


Proceedings Article
04 Dec 2006
TL;DR: A class of MPDs which greatly simplify Reinforcement Learning, which have discrete state spaces and continuous control spaces and enable efficient approximations to traditional MDPs.
Abstract: We introduce a class of MPDs which greatly simplify Reinforcement Learning. They have discrete state spaces and continuous control spaces. The controls have the effect of rescaling the transition probabilities of an underlying Markov chain. A control cost penalizing KL divergence between controlled and uncontrolled transition probabilities makes the minimization problem convex, and allows analytical computation of the optimal controls given the optimal value function. An exponential transformation of the optimal value function makes the minimized Bellman equation linear. Apart from their theoretical significance, the new MDPs enable efficient approximations to traditional MDPs. Shortest path problems are approximated to arbitrary precision with largest eigenvalue problems, yielding an O (n) algorithm. Accurate approximations to generic MDPs are obtained via continuous embedding reminiscent of LP relaxation in integer programming. Off-policy learning of the optimal value function is possible without need for state-action values; the new algorithm (Z-learning) outperforms Q-learning.

430 citations


Journal ArticleDOI
TL;DR: The role of the dialogue manager in a spoken dialogue system is summarized, a short introduction to reinforcement-learning of dialogue management strategies is given, the literature on user modelling for simulation-based strategy learning is reviewed and recent work on user model evaluation is described.
Abstract: Within the broad field of spoken dialogue systems, the application of machine-learning approaches to dialogue management strategy design is a rapidly growing research area. The main motivation is the hope of building systems that learn through trial-and-error interaction what constitutes a good dialogue strategy. Training of such systems could in theory be done using human users or using corpora of human–computer dialogue, but in practice the typically vast space of possible dialogue states and strategies cannot be explored without the use of automatic user simulation tools.This requirement for training statistical dialogue models has created an interesting new application area for predictive statistical user modelling and a variety of different techniques for simulating user behaviour have been presented in the literature ranging from simple Markov models to Bayesian networks. The development of reliable user simulation tools is critical to further progress on automatic dialogue management design but it holds many challenges, some of which have been encountered in other areas of current research on statistical user modelling, such as the problem of ‘concept drift’, the problem of combining content-based and collaboration-based modelling techniques, and user model evaluation. The latter topic is of particular interest, because simulation-based learning is currently one of the few applications of statistical user modelling that employs both direct ‘accuracy-based’ and indirect ‘utility-based’ evaluation techniques.In this paper, we briefly summarize the role of the dialogue manager in a spoken dialogue system, give a short introduction to reinforcement-learning of dialogue management strategies and review the literature on user modelling for simulation-based strategy learning. We further describe recent work on user model evaluation and discuss some of the current research issues in simulation-based learning from a user modelling perspective.

378 citations


Proceedings ArticleDOI
12 Jun 2006
TL;DR: In this article, the authors combine the strengths of reinforcement learning and queuing models in a hybrid approach in which RL trains offline on data collected while a queuing model policy controls the system.
Abstract: Reinforcement Learning (RL) provides a promising new approach to systems performance management that differs radically from standard queuing-theoretic approaches making use of explicit system performance models. In principle, RL can automatically learn high-quality management policies without an explicit performance model or traffic model and with little or no built-in system specific knowledge. In our original work [1], [2], [3] we showed the feasibility of using online RL to learn resource valuation estimates (in lookup table form) which can be used to make high-quality server allocation decisions in a multi-application prototype Data Center scenario. The present work shows how to combine the strengths of both RL and queuing models in a hybrid approach in which RL trains offline on data collected while a queuing model policy controls the system. By training offline we avoid suffering potentially poor performance in live online training. We also now use RL to train nonlinear function approximators (e.g. multi-layer perceptrons) instead of lookup tables; this enables scaling to substantially larger state spaces. Our results now show that in both open-loop and closed-loop traffic, hybrid RL training can achieve significant performance improvements over a variety of initial model-based policies. We also find that, as expected, RL can deal effectively with both transients and switching delays, which lie outside the scope of traditional steady-state queuing theory.

Journal ArticleDOI
TL;DR: A set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting using the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) and introduces different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph.
Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences.

Proceedings ArticleDOI
25 Jun 2006
TL;DR: This work proposes a new algorithm, called BEETLE, for effective online learning that is computationally efficient while minimizing the amount of exploration, and takes a Bayesian model-based approach, framing RL as a partially observable Markov decision process.
Abstract: Reinforcement learning (RL) was originally proposed as a framework to allow agents to learn in an online fashion as they interact with their environment. Existing RL algorithms come short of achieving this goal because the amount of exploration required is often too costly and/or too time consuming for online learning. As a result, RL is mostly used for offline learning in simulated environments. We propose a new algorithm, called BEETLE, for effective online learning that is computationally efficient while minimizing the amount of exploration. We take a Bayesian model-based approach, framing RL as a partially observable Markov decision process. Our two main contributions are the analytical derivation that the optimal value function is the upper envelope of a set of multivariate polynomials, and an efficient point-based value iteration algorithm that exploits this simple parameterization.

Proceedings ArticleDOI
08 May 2006
TL;DR: Interestingly and almost as a side effect, Policy Reuse also identifies classes of similar policies revealing a basis of core policies of the domain, demonstrating that such a basis can be built incrementally, contributing the learning of the structure of a domain.
Abstract: We contribute Policy Reuse as a technique to improve a reinforcement learning agent with guidance from past learned similar policies. Our method relies on using the past policies as a probabilistic bias where the learning agent faces three choices: the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. We introduce the algorithm and its major components: an exploration strategy to include the new reuse bias, and a similarity function to estimate the similarity of past policies with respect to a new one. We provide empirical results demonstrating that Policy Reuse improves the learning performance over different strategies that learn without reuse. Interestingly and almost as a side effect, Policy Reuse also identifies classes of similar policies revealing a basis of core policies of the domain. We demonstrate that such a basis can be built incrementally, contributing the learning of the structure of a domain.

Proceedings Article
16 Jul 2006
TL;DR: The importance of understanding the human-teacher/robot-learner system as a whole in order to design algorithms that support how people want to teach while simultaneously improving the robot's learning performance is demonstrated.
Abstract: As robots become a mass consumer product, they will need to learn new skills by interacting with typical human users. Past approaches have adapted reinforcement learning (RL) to accept a human reward signal; however, we question the implicit assumption that people shall only want to give the learner feedback on its past actions. We present findings from a human user study showing that people use the reward signal not only to provide feedback about past actions, but also to provide future directed rewards to guide subsequent actions. Given this, we made specific modifications to the simulated RL robot to incorporate guidance. We then analyze and evaluate its learning performance in a second user study, and we report significant improvements on several measures. This work demonstrates the importance of understanding the human-teacher/robot-learner system as a whole in order to design algorithms that support how people want to teach while simultaneously improving the robot's learning performance.

Journal ArticleDOI
TL;DR: It is pointed out how the fine arts can be formally understood as a consequence of the basic principle: given some subjective observer, great works of art and music yield observation histories exhibiting more novel, previously unknown compressibility/regularity/predictability than lesser works, thus deepening the observer’s understanding of the world and what is possible in it.
Abstract: Even in the absence of external reward, babies and scientists and others explore their world. Using some sort of adaptive predictive world model, they improve their ability to answer questions such as what happens if I do this or that? They lose interest in both the predictable things and those predicted to remain unpredictable despite some effort. One can design curious robots that do the same. The author’s basic idea (1990, 1991) for doing so is a reinforcement learning (RL) controller is rewarded for action sequences that improve the predictor. Here, this idea is revisited in the context of recent results on optimal predictors and optimal RL machines. Several new variants of the basic principle are proposed. Finally, it is pointed out how the fine arts can be formally understood as a consequence of the principle: given some subjective observer, great works of art and music yield observation histories exhibiting more novel, previously unknown compressibility/regularity/predictability (with respect to th...

Journal ArticleDOI
TL;DR: This paper study the consequences of importing this competition into a reinforcement learning context, and demonstrate the resulting effects in an omission schedule and a maze navigation task, and discuss how it may be disciplined.

Book ChapterDOI
16 Aug 2006
TL;DR: Two kernel-based reinforcement learning algorithms, the e – KRL and the least squares kernel based reinforcement learning (LS-KRL) are proposed and an example shows that the proposed methods can deal effectively with the reinforcement learning problem without having to explore many states.
Abstract: We consider the problem of approximating the cost-to-go functions in reinforcement learning By mapping the state implicitly into a feature space, we perform a simple algorithm in the feature space, which corresponds to a complex algorithm in the original state space Two kernel-based reinforcement learning algorithms, the e -insensitive kernel based reinforcement learning (e – KRL) and the least squares kernel based reinforcement learning (LS-KRL) are proposed An example shows that the proposed methods can deal effectively with the reinforcement learning problem without having to explore many states

Journal Article
TL;DR: A fully implemented instantiation of evolutionary function approximation is presented which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method, and the resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators.
Abstract: Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice.

Proceedings ArticleDOI
25 Jun 2006
TL;DR: This paper presents a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials, and achieves near-optimal performance in the real system, even when the model is only approximate.
Abstract: In the model-based policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for high-dimensional continuous-state tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in real-life. The other extreme, model-free RL, tends to require infeasibly large numbers of real-life trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials. The key idea is to successively "ground" the policy evaluations using real-life trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves near-optimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate that---when given only a crude model and a small number of real-life trials---our algorithm can obtain near-optimal performance in the real system.

Journal ArticleDOI
TL;DR: Noise is applied for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration, and the resulting policy outperforms previous RL algorithms by almost two orders of magnitude.
Abstract: The cross-entropy method is an efficient and general optimization algorithm. However, its applicability in reinforcement learning (RL) seems to be limited because it often converges to suboptimal policies. We apply noise for preventing early convergence of the cross-entropy method, using Tetris, a computer game, for demonstration. The resulting policy outperforms previous RL algorithms by almost two orders of magnitude.

Proceedings Article
04 Dec 2006
TL;DR: Upper confidence bounds are used to show that the UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.

Proceedings ArticleDOI
25 Jun 2006
TL;DR: The use of learned shaping rewards in reinforcement learning tasks, where an agent uses prior experience on a sequence of tasks to learn a portable predictor that estimates intermediate rewards, resulting in accelerated learning in later tasks that are related but distinct is introduced.
Abstract: We introduce the use of learned shaping rewards in reinforcement learning tasks, where an agent uses prior experience on a sequence of tasks to learn a portable predictor that estimates intermediate rewards, resulting in accelerated learning in later tasks that are related but distinct. Such agents can be trained on a sequence of relatively easy tasks in order to develop a more informative measure of reward that can be transferred to improve performance on more difficult tasks without requiring a hand coded shaping function. We use a rod positioning task to show that this significantly improves performance even after a very brief training period.

Proceedings ArticleDOI
25 Jun 2006
TL;DR: This work presents the first large-scale empirical application of reinforcement learning to the important problem of optimized trade execution in modern financial markets, based on 1.5 years of millisecond time-scale limit order data from NASDAQ.
Abstract: We present the first large-scale empirical application of reinforcement learning to the important problem of optimized trade execution in modern financial markets. Our experiments are based on 1.5 years of millisecond time-scale limit order data from NASDAQ, and demonstrate the promise of reinforcement learning methods to market microstructure problems. Our learning algorithm introduces and exploits a natural "low-impact" factorization of the state space.

Journal ArticleDOI
TL;DR: Opposition-based reinforcement learning, inspired by opposition-based learning, is introduced, to speed up convergence by Considering opposite actions simultaneously enables individual states to be updated more than once shortening exploration and expediting convergence.
Abstract: Reinforcement learning is a machine intelligence scheme for learning in highly dynamic, probabilistic environments By interaction with the environment, reinforcement agents learn optimal control policies, especially in the absence of a priori knowledge and/or a sufficiently large amount of training data Despite its advantages, however, reinforcement learning suffers from a major drawback – high calculation cost because convergence to an optimal solution usually requires that all states be visited frequently to ensure that policy is reliable This is not always possible, however, due to the complex, high-dimensional state space in many applications This paper introduces opposition-based reinforcement learning, inspired by opposition-based learning, to speed up convergence Considering opposite actions simultaneously enables individual states to be updated more than once shortening exploration and expediting convergence Three versions of Q-learning algorithm will be given as examples Experimental results for the grid world problem of different sizes demonstrate the superior performance of the proposed approach

Proceedings ArticleDOI
25 Jun 2006
TL;DR: This work proposes to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a low-dimensional space, based on the Bellman error, or on the temporal difference (TD) error.
Abstract: We address the problem of automatically constructing basis functions for linear approximation of the value function of a Markov Decision Process (MDP). Our work builds on results by Bertsekas and Castanon (1989) who proposed a method for automatically aggregating states to speed up value iteration. We propose to use neighborhood component analysis (Goldberger et al., 2005), a dimensionality reduction technique created for supervised learning, in order to map a high-dimensional state space to a low-dimensional space, based on the Bellman error, or on the temporal difference (TD) error. We then place basis function in the lower-dimensional space. These are added as new features for the linear function approximator. This approach is applied to a high-dimensional inventory control problem.

Proceedings ArticleDOI
25 Jun 2006
TL;DR: It is shown that RL-CD performs better than two standard reinforcement learning algorithms and that it has advantages over methods specifically designed to cope with non-stationarity.
Abstract: In this paper we introduce RL-CD, a method for solving reinforcement learning problems in non-stationary environments. The method is based on a mechanism for creating, updating and selecting one among several partial models of the environment. The partial models are incrementally built according to the system's capability of making predictions regarding a given sequence of observations. We propose, formalize and show the efficiency of this method both in a simple non-stationary environment and in a noisy scenario. We show that RL-CD performs better than two standard reinforcement learning algorithms and that it has advantages over methods specifically designed to cope with non-stationarity. Finally, we present known limitations of the method and future works.

Journal ArticleDOI
TL;DR: In this article, an adaptive reinforcement learning (ARLRL) algorithm is used to trade foreign exchange markets and relies on a layered structure consisting of a machine learning algorithm, a risk management overlay and a dynamic utility optimization layer.
Abstract: This paper introduces adaptive reinforcement learning (ARL) as the basis for a fully automated trading system application. The system is designed to trade foreign exchange (FX) markets and relies on a layered structure consisting of a machine learning algorithm, a risk management overlay and a dynamic utility optimization layer. An existing machine-learning method called recurrent reinforcement learning (RRL) was chosen as the underlying algorithm for ARL. One of the strengths of our approach is that the dynamic optimization layer makes a fixed choice of model tuning parameters unnecessary. It also allows for a risk-return trade-off to be made by the user within the system. The trading system is able to make consistent gains out-of-sample while avoiding large draw-downs.

Journal ArticleDOI
TL;DR: A "heterarchical reinforcement learning" model is proposed, where reward prediction made by more limbic and cognitive loops is propagated to motor loops by spiral projections between the striatum and substantia nigra, assisted by cortical projections to the pedunculopontine tegmental nucleus, which sends excitatory input to the substantia Nigra.

Journal ArticleDOI
TL;DR: The multi-agent HRL framework is extended to include communication decisions and a cooperative multi- agent HRL algorithm called COM-Cooperative HRL is proposed, which allows agents to learn coordination faster by sharing information at the level of cooperative subtasks.
Abstract: In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.