scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2007"


Book
01 Jan 2007
TL;DR: In Introduction to Statistical Relational Learning, leading researchers in this emerging area of machine learning describe current formalisms, models, and algorithms that enable effective and robust reasoning about richly structured systems and data.
Abstract: Handling inherent uncertainty and exploiting compositional structure are fundamental to understanding and designing large-scale systems. Statistical relational learning builds on ideas from probability theory and statistics to address uncertainty while incorporating tools from logic, databases and programming languages to represent structure. In Introduction to Statistical Relational Learning, leading researchers in this emerging area of machine learning describe current formalisms, models, and algorithms that enable effective and robust reasoning about richly structured systems and data. The early chapters provide tutorials for material used in later chapters, offering introductions to representation, inference and learning in graphical models, and logic. The book then describes object-oriented approaches, including probabilistic relational models, relational Markov networks, and probabilistic entity-relationship models as well as logic-based formalisms including Bayesian logic programs, Markov logic, and stochastic logic programs. Later chapters discuss such topics as probabilistic models with unknown objects, relational dependency networks, reinforcement learning in relational domains, and information extraction. By presenting a variety of approaches, the book highlights commonalities and clarifies important differences among proposed approaches and, along the way, identifies important representational and algorithmic issues. Numerous applications are provided throughout.Lise Getoor is Assistant Professor in the Department of Computer Science at the University of Maryland. Ben Taskar is Assistant Professor in the Computer and Information Science Department at the University of Pennsylvania.

1,141 citations


Journal ArticleDOI
TL;DR: The mechanism of Intelligent Adaptive Curiosity is presented, an intrinsic motivation system which pushes a robot towards situations in which it maximizes its learning progress, thus permitting autonomous mental development.
Abstract: Exploratory activities seem to be intrinsically rewarding for children and crucial for their cognitive development. Can a machine be endowed with such an intrinsic motivation system? This is the question we study in this paper, presenting a number of computational systems that try to capture this drive towards novel or curious situations. After discussing related research coming from developmental psychology, neuroscience, developmental robotics, and active learning, this paper presents the mechanism of Intelligent Adaptive Curiosity, an intrinsic motivation system which pushes a robot towards situations in which it maximizes its learning progress. This drive makes the robot focus on situations which are neither too predictable nor too unpredictable, thus permitting autonomous mental development. The complexity of the robot's activities autonomously increases and complex developmental sequences self-organize without being constructed in a supervised manner. Two experiments are presented illustrating the stage-like organization emerging with this mechanism. In one of them, a physical robot is placed on a baby play mat with objects that it can learn to manipulate. Experimental results show that the robot first spends time in situations which are easy to learn, then shifts its attention progressively to situations of increasing difficulty, avoiding situations in which nothing can be learned. Finally, these various results are discussed in relation to more complex forms of behavioral organization and data coming from developmental psychology

1,134 citations


Proceedings Article
06 Jan 2007
TL;DR: This paper shows how to combine prior knowledge and evidence from the expert's actions to derive a probability distribution over the space of reward functions and presents efficient algorithms that find solutions for the reward learning and apprenticeship learning tasks that generalize well over these distributions.
Abstract: Inverse Reinforcement Learning (IRL) is the problem of learning the reward function underlying a Markov Decision Process given the dynamics of the system and the behaviour of an expert IRL is motivated by situations where knowledge of the rewards is a goal by itself (as in preference elicitation) and by the task of apprenticeship learning (learning policies from an expert) In this paper we show how to combine prior knowledge and evidence from the expert's actions to derive a probability distribution over the space of reward functions We present efficient algorithms that find solutions for the reward learning and apprenticeship learning tasks that generalize well over these distributions Experimental results show strong improvement for our methods over previous heuristic-based approaches

663 citations


Journal ArticleDOI
TL;DR: A neurocomputational dissociation between striatal and prefrontal dopaminergic mechanisms in reinforcement learning is supported and independent gene effects on three reinforcement learning parameters that can explain the observed dissociations are revealed.
Abstract: What are the genetic and neural components that support adaptive learning from positive and negative outcomes? Here, we show with genetic analyses that three independent dopaminergic mechanisms contribute to reward and avoidance learning in humans. A polymorphism in the DARPP-32 gene, associated with striatal dopamine function, predicted relatively better probabilistic reward learning. Conversely, the C957T polymorphism of the DRD2 gene, associated with striatal D2 receptor function, predicted the degree to which participants learned to avoid choices that had been probabilistically associated with negative outcomes. The Val/Met polymorphism of the COMT gene, associated with prefrontal cortical dopamine function, predicted participants' ability to rapidly adapt behavior on a trial-to-trial basis. These findings support a neurocomputational dissociation between striatal and prefrontal dopaminergic mechanisms in reinforcement learning. Computational maximum likelihood analyses reveal independent gene effects on three reinforcement learning parameters that can explain the observed dissociations.

644 citations


Journal ArticleDOI
TL;DR: It is shown that the modulation of STDP by a global reward signal leads to reinforcement learning, and analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity are derived, which may be used for training generic artificial spiking neural networks, regardless of the neural model used.
Abstract: The persistent modification of synaptic efficacy as a function of the relative timing of pre- and postsynaptic spikes is a phenomenon known as spike-timing-dependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrate-and-fire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), and the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of pre- and postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both rate coded and temporally coded input and to learn a target output firing-rate pattern. These learning rules are biologically plausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of reward-modulated STDP.

383 citations


Journal ArticleDOI
TL;DR: In this article, the authors used a simple four-armed bandit task in which subjects were almost evenly split into two groups on the basis of their performance: those who did learn to favor choice of the optimal action and those who do not.
Abstract: The computational framework of reinforcement learning has been used to forward our understanding of the neural mechanisms underlying reward learning and decision-making behavior. It is known that humans vary widely in their performance in decision-making tasks. Here, we used a simple four-armed bandit task in which subjects are almost evenly split into two groups on the basis of their performance: those who do learn to favor choice of the optimal action and those who do not. Using models of reinforcement learning we sought to determine the neural basis of these intrinsic differences in performance by scanning both groups with functional magnetic resonance imaging. We scanned 29 subjects while they performed the reward-based decision-making task. Our results suggest that these two groups differ markedly in the degree to which reinforcement learning signals in the striatum are engaged during task performance. While the learners showed robust prediction error signals in both the ventral and dorsal striatum during learning, the nonlearner group showed a marked absence of such signals. Moreover, the magnitude of prediction error signals in a region of dorsal striatum correlated significantly with a measure of behavioral performance across all subjects. These findings support a crucial role of prediction error signals, likely originating from dopaminergic midbrain neurons, in enabling learning of action selection preferences on the basis of obtained rewards. Thus, spontaneously observed individual differences in decision making performance demonstrate the suggested dependence of this type of learning on the functional integrity of the dopaminergic striatal system in humans.

372 citations


01 Jan 2007
TL;DR: This dissertation describes a novel framework for the design and analysis of online learning algorithms and proposes a new perspective on regret bounds which is based on the notion of duality in convex optimization.
Abstract: Online learning is the process of answering a sequence of questions given knowledge of the correct answers to previous questions and possibly additional available information. Answering questions in an intelligent fashion and being able to make rational decisions as a result is a basic feature of everyday life. Will it rain today (so should I take an umbrella)? Should I fight the wild animal that is after me, or should I run away? Should I open an attachment in an email message or is it a virus? The study of online learning algorithms is thus an important domain in machine learning, and one that has interesting theoretical properties and practical applications. This dissertation describes a novel framework for the design and analysis of online learning algorithms. We show that various online learning algorithms can all be derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Online learning is performed in a sequence of consecutive rounds, where at each round the learner is given a question and is required to provide an answer to this question. After predicting an answer, the correct answer is revealed and the learner suffers a loss if there is a discrepancy between his answer and the correct one. The algorithmic framework for online learning we propose in this dissertation stems from a connection that we make between the notions of regret in online learning and weak duality in convex optimization. Regret bounds are the common thread in the analysis of online learning algorithms. A regret bound measures the performance of an online algorithm relative to the performance of a competing prediction mechanism, called a competing hypothesis. The competing hypothesis can be chosen in hindsight from a class of hypotheses, after observing the entire sequence of question- answer pairs. Over the years, competitive analysis techniques have been refined and extended to numerous prediction problems by employing complex and varied notions of progress toward a good competing hypothesis. We propose a new perspective on regret bounds which is based on the notion of duality in convex optimization. Regret bounds are universal in the sense that they hold for any possible fixed hypothesis in a given hypothesis class. We therefore cast the universal bound as a lower bound

359 citations


Journal ArticleDOI
TL;DR: The authors construct a TDRL model that can accommodate extinction and renewal through two simple processes: a T DRL process that learns the value of situation-action pairs and a situation recognition process that categorizes the observed cues into situations.
Abstract: Because learned associations are quickly renewed following extinction, the extinction process must include processes other than unlearning. However, reinforcement learning models, such as the temporal difference reinforcement learning (TDRL) model, treat extinction as an unlearning of associated value and are thus unable to capture renewal. TDRL models are based on the hypothesis that dopamine carries a reward prediction error signal; these models predict reward by driving that reward error to zero. The authors construct a TDRL model that can accommodate extinction and renewal through two simple processes: (a) a TDRL process that learns the value of situation-action pairs and (b) a situation recognition process that categorizes the observed cues into situations. This model has implications for dysfunctional states, including relapse after addiction and problem gambling.

339 citations


Journal Article
TL;DR: A novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies is introduced, and several strategies for scaling the proposed framework to large MDPs are outlined.
Abstract: This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nystrom extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.

336 citations


Journal ArticleDOI
TL;DR: An extended algorithm, GNP with Reinforcement Learning (GNPRL) is proposed which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments.
Abstract: This paper proposes a graph-based evolutionary algorithm called Genetic Network Programming (GNP). Our goal is to develop GNP, which can deal with dynamic environments efficiently and effectively, based on the distinguished expression ability of the graph (network) structure. The characteristics of GNP are as follows. 1) GNP programs are composed of a number of nodes which execute simple judgment/processing, and these nodes are connected by directed links to each other. 2) The graph structure enables GNP to re-use nodes, thus the structure can be very compact. 3) The node transition of GNP is executed according to its node connections without any terminal nodes, thus the past history of the node transition affects the current node to be used and this characteristic works as an implicit memory function. These structural characteristics are useful for dealing with dynamic environments. Furthermore, we propose an extended algorithm, “GNP with Reinforcement Learning (GNPRL)” which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments. In this paper, we applied GNP to the problem of determining agents' behavior to evaluate its effectiveness. Tileworld was used as the simulation environment. The results show some advantages for GNP over conventional methods.

329 citations


Proceedings ArticleDOI
20 Jun 2007
TL;DR: This work considers the problem of multi-task reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes chosen randomly from a fixed but unknown distribution, using a hierarchical Bayesian infinite mixture model.
Abstract: We consider the problem of multi-task reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes (MDPs) chosen randomly from a fixed but unknown distribution. We model the distribution over MDPs using a hierarchical Bayesian infinite mixture model. For each novel MDP, we use the previously learned distribution as an informed prior for modelbased Bayesian reinforcement learning. The hierarchical Bayesian framework provides a strong prior that allows us to rapidly infer the characteristics of new environments based on previous environments, while the use of a nonparametric model allows us to quickly adapt to environments we have not encountered before. In addition, the use of infinite mixtures allows for the model to automatically learn the number of underlying MDP components. We evaluate our approach and show that it leads to significant speedups in convergence to an optimal policy after observing only a small number of tasks.

Journal ArticleDOI
TL;DR: The authors found that the magnitude of ERPs after losing to the computer opponent predicted whether subjects would change decision behavior on the subsequent trial, and that FRNs to decision outcomes were disproportionately larger over the motor cortex contralateral to the response hand that was used to make the decision.
Abstract: Optimal behavior in a competitive world requires the flexibility to adapt decision strategies based on recent outcomes. In the present study, we tested the hypothesis that this flexibility emerges through a reinforcement learning process, in which reward prediction errors are used dynamically to adjust representations of decision options. We recorded event-related brain potentials (ERPs) while subjects played a strategic economic game against a computer opponent to evaluate how neural responses to outcomes related to subsequent decision-making. Analyses of ERP data focused on the feedback-related negativity (FRN), an outcome-locked potential thought to reflect a neural prediction error signal. Consistent with predictions of a computational reinforcement learning model, we found that the magnitude of ERPs after losing to the computer opponent predicted whether subjects would change decision behavior on the subsequent trial. Furthermore, FRNs to decision outcomes were disproportionately larger over the motor cortex contralateral to the response hand that was used to make the decision. These findings provide novel evidence that humans engage a reinforcement learning process to adjust representations of competing decision options.

01 Jan 2007
TL;DR: In this paper, the authors presented a cognitive model that simulates how people seek information on the Web, called SNIF-ACT, which stands for Scent-based Navigation and Information Foraging in the ACT architecture.
Abstract: SNIF-ACT: A Cognitive Model of User Navigation on the World Wide Web Wai-Tat Fu (wfu@uiuc.edu) Human Factors Division and Beckman Institute University of Illinois at Urbana-Champaign 405 North Mathews Avenue Urbana, IL 61801, USA Peter Pirolli (pirolli@parc.com) Palo Alto Research Center 3333 Coyote Hill Rd Palo Alto, CA 94304, USA pages. SNIF-ACT 2.0 was validated on a data set obtained from 74 subjects. Monte Carlo simulations of the model showed that SNIF-ACT 2.0 provided better fits to human data than SNIF-ACT 1.0 and a Position model that used position of links on a Web page to decide which link to select. We conclude that the combination of the IFT and the BSM provides a good description of user-Web interaction. Practical implications of the model will be discussed. Keywords: Information Seeking, Cognitive Models, Reinforcement Learning, Bayesian Satisficing Model. We will present a computational cognitive model that simulates how people seek information on the Web (Fu & Pirolli, 2007). This model is called SNIF-ACT, which stands for Scent-based Navigation and Information Foraging in the ACT architecture. SNIF-ACT provides an account of how people use information scent cues, such as the text associated with Web links, in order to make navigation decisions such as judging where to go next on the Web, or when to give up on a particular path of knowledge search. SNIF-ACT is shaped by rational analyses of the Web developed by combining the Bayesian satisficing model (Fu & Gray, 2006; Fu, 2007) with the information foraging theory (Pirolli & Card, 1999). We will describe the current status of the SNIF-ACT model and the results from testing the model against two data sets from real-world human subjects. At this point, our goal is to validate the model’s predictions on unfamiliar information-seeking tasks for general users. Our model was successful in predicting users’ behavior in these tasks, especially in identifying the “attractor” pages that most users visited. We will focus on the newest development of the model called SNIF-ACT 2.0 here. In this version, we included an adaptive link selection mechanism that sequentially evaluates links on a Web page according to their position. The mechanism was derived based on a rational analysis of link selection on a Web page and the process of satisficing in action selection (Simon, 1956). The mechanism allowed the model to dynamically update the evaluation of actions (e.g., to follow a link or leave a Web site) based on sequential assessments of link texts on a Web page. This dynamic assessment allows online adjustment of the aspiration levels of different actions in the satisficing process based on the information scent values of links as well as implicit feedback (or reinforcement) received during each action cycle (Fu & Anderson, 2006), such that the action selection process is directly influenced by the content of the Web page. For example, the model’s decision on when to click on a link or leave a page will be sensitive to experiences with previously visited links and Acknowledgments Portions of this research have been supported by funding from the Human Factors Division and Beckman Institute of the University of Illinois at Urbana-Champaign to the first author, and an Office of Naval Research Contract No. N00014-96-C-0097 and Advanced Research and Development Activity, Novel Intelligence from Massive Data Program Contract No. MDA904-03-C-0404 to the second author. References Fu, W.-T. (2007), Adaptive Tradeoffs between Exploration and Exploitation: A Rational-ecological Approach. Exploitation: A Rational-Ecological Approach. In Gray, W.D. (Ed), Integrated Models of Cognitive Systems, Oxford: Oxford University Press.. In Gray, W.D. (Ed), Integrated Models of Cognitive Systems, Oxford: Oxford University Press. Fu, W.-T., Anderson, J. R. (2006), From Recurrent Choice to Skilled Learning: A Reinforcement Learning Model Learning: A Reinforcement Learning Model.. Journal of Experimental Psychology: General, 135(2), 184-206. Fu, W.-T., & Gray, W. D. (2006). Suboptimal Tradeoffs in Information-Seeking. Cognitive Psychology 52, 195-242. Fu, W.-T., & Pirolli, P. (2007). SNIF-ACT: A Cognitive Model of User Navigation on the World Wide Web. Human Computer Interaction. Pirolli, P. & Card, S.K. (1999). Information foraging. Psychological Review, 106, 643-675. Simon, HA. (1956). Rational choice and the structure of environments. Psychological Review, 63, 129-13

Journal ArticleDOI
TL;DR: The KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs) and can be applied to online learning control by incorporating an initial controller to ensure online performance.
Abstract: In this paper, we present a kernel-based least squares policy iteration (KLSPI) algorithm for reinforcement learning (RL) in large or continuous state spaces, which can be used to realize adaptive feedback control of uncertain dynamic systems. By using KLSPI, near-optimal control policies can be obtained without much a priori knowledge on dynamic models of control plants. In KLSPI, Mercer kernels are used in the policy evaluation of a policy iteration process, where a new kernel-based least squares temporal-difference algorithm called KLSTD-Q is proposed for efficient policy evaluation. To keep the sparsity and improve the generalization ability of KLSTD-Q solutions, a kernel sparsification procedure based on approximate linear dependency (ALD) is performed. Compared to the previous works on approximate RL methods, KLSPI makes two progresses to eliminate the main difficulties of existing results. One is the better convergence and (near) optimality guarantee by using the KLSTD-Q algorithm for policy evaluation with high precision. The other is the automatic feature selection using the ALD-based kernel sparsification. Therefore, the KLSPI algorithm provides a general RL method with generalization performance and convergence guarantee for large-scale Markov decision problems (MDPs). Experimental results on a typical RL task for a stochastic chain problem demonstrate that KLSPI can consistently achieve better learning efficiency and policy quality than the previous least squares policy iteration (LSPI) algorithm. Furthermore, the KLSPI method was also evaluated on two nonlinear feedback control problems, including a ship heading control problem and the swing up control of a double-link underactuated pendulum called acrobot. Simulation results illustrate that the proposed method can optimize controller performance using little a priori information of uncertain dynamic systems. It is also demonstrated that KLSPI can be applied to online learning control by incorporating an initial controller to ensure online performance.

Journal ArticleDOI
TL;DR: Using a large cohort of subjects and fMRI, it is reported that fictive learning signals strongly predict changes in subjects' investment behavior and correlate with fMRI signals measured in dopaminoceptive structures known to be involved in valuation and choice.
Abstract: .¶ ‡ § Reinforcement learning models now provide principled guides for a wide range of reward learning experiments in animals and humans. One key learning (error) signal in these models is experiential and reports ongoing temporal differences between expected and experienced reward. However, these same abstract learning models also accommodate the existence of another class of learning signal that takes the form of a fictive error encoding ongoing differences between experienced returns and returns that ‘‘could-have-been-experienced’’ if decisions had been different. These observations suggest the hypothesis that, for all real-world learning tasks, one should expect the presence of both experiential and fictive learning signals. Motivated by this possibility, we used a sequential investment game and fMRI to probe ongoing brain responses to both experiential and fictive learning signals generated throughout the game. Using a large cohort of subjects (n 54), we report that fictive learning signals strongly predict changes in subjects’ investment behavior and correlate with fMRI signals measured in dopaminoceptive structures known to be involved in valuation and choice. counterfactual signals decision-making neuroeconomics reinforcement learning

Proceedings Article
06 Jan 2007
TL;DR: This work introduces the notion of learning options in agentspace, the space generated by a feature set that is present and retains the same semantics across successive problem instances, rather than in problemspace.
Abstract: The options framework provides methods for reinforcement learning agents to build new high-level skills. However, since options are usually learned in the same state space as the problem the agent is solving, they cannot be used in other tasks that are similar but have different state spaces. We introduce the notion of learning options in agentspace, the space generated by a feature set that is present and retains the same semantics across successive problem instances, rather than in problemspace. Agent-space options can be reused in later tasks that share the same agent-space but have different problem-spaces. We present experimental results demonstrating the use of agent-space options in building transferrable skills, and show that they perform best when used in conjunction with problem-space options.

Proceedings ArticleDOI
01 Apr 2007
TL;DR: This work presents a new class of algorithms named continuous actor critic learning automaton (CACLA) that can handle continuous states and actions and shows that CACLA performs much better than the other algorithms, especially when it is combined with a Gaussian exploration method.
Abstract: Quite some research has been done on reinforcement learning in continuous environments, but the research on problems where the actions can also be chosen from a continuous space is much more limited. We present a new class of algorithms named continuous actor critic learning automaton (CACLA) that can handle continuous states and actions. The resulting algorithm is straightforward to implement. An experimental comparison is made between this algorithm and other algorithms that can handle continuous action spaces. These experiments show that CACLA performs much better than the other algorithms, especially when it is combined with a Gaussian exploration method

Journal ArticleDOI
TL;DR: This article compares learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrates that directly transferring the action-value function can lead to a dramatic speedup in learning with all three.
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005).

Proceedings ArticleDOI
20 Jun 2007
TL;DR: This work uses a generalization of the EM-base reinforcement learning framework suggested by Dayan & Hinton to reduce the problem of learning with immediate rewards to a reward-weighted regression problem with an adaptive, integrated reward transformation for faster convergence.
Abstract: Many robot control problems of practical importance, including operational space control, can be reformulated as immediate reward reinforcement learning problems. However, few of the known optimization or reinforcement learning algorithms can be used in online learning control for robots, as they are either prohibitively slow, do not scale to interesting domains of complex robots, or require trying out policies generated by random search, which are infeasible for a physical system. Using a generalization of the EM-base reinforcement learning framework suggested by Dayan & Hinton, we reduce the problem of learning with immediate rewards to a reward-weighted regression problem with an adaptive, integrated reward transformation for faster convergence. The resulting algorithm is efficient, learns smoothly without dangerous jumps in solution space, and works well in applications of complex high degree-of-freedom robots.

Proceedings ArticleDOI
29 Oct 2007
TL;DR: This article focuses on decentralized reinforcement learning (RL) in cooperative MAS, where a team of independent learning robots (IL) try to coordinate their individual behavior to reach a coherent joint behavior, and suggests a Q-learning extension for ILs, called hysteretic Q- learning.
Abstract: Multi-agent systems (MAS) are a field of study of growing interest in a variety of domains such as robotics or distributed controls. The article focuses on decentralized reinforcement learning (RL) in cooperative MAS, where a team of independent learning robots (IL) try to coordinate their individual behavior to reach a coherent joint behavior. We assume that each robot has no information about its teammates' actions. To date, RL approaches for such ILs did not guarantee convergence to the optimal joint policy in scenarios where the coordination is difficult. We report an investigation of existing algorithms for the learning of coordination in cooperative MAS, and suggest a Q-learning extension for ILs, called hysteretic Q-learning. This algorithm does not require any additional communication between robots. Its advantages are showing off and compared to other methods on various applications: bi-matrix games, collaborative ball balancing task and pursuit domain.

Proceedings Article
03 Dec 2007
TL;DR: A rigorous analysis is provided of a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values, proved to be the first finite-time bound for value-function based algorithms for continuous state and action problems.
Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems.

Proceedings ArticleDOI
10 Apr 2007
TL;DR: This paper shows how the GP-enhanced model can be used in conjunction with reinforcement learning to generate a blimp controller that is superior to those learned with ODE or GP models alone.
Abstract: Blimps are a promising platform for aerial robotics and have been studied extensively for this purpose. Unlike other aerial vehicles, blimps are relatively safe and also possess the ability to loiter for long periods. These advantages, however, have been difficult to exploit because blimp dynamics are complex and inherently non-linear. The classical approach to system modeling represents the system as an ordinary differential equation (ODE) based on Newtonian principles. A more recent modeling approach is based on representing state transitions as a Gaussian process (GP). In this paper, we present a general technique for system identification that combines these two modeling approaches into a single formulation. This is done by training a Gaussian process on the residual between the non-linear model and ground truth training data. The result is a GP-enhanced model that provides an estimate of uncertainty in addition to giving better state predictions than either ODE or GP alone. We show how the GP-enhanced model can be used in conjunction with reinforcement learning to generate a blimp controller that is superior to those learned with ODE or GP models alone.

Journal ArticleDOI
TL;DR: Self-tuning experience weighted attraction does as well as EWA in predicting behavior in new games, even though it has fewer parameters, and fits reliably better than the QRE equilibrium benchmark.

Journal ArticleDOI
01 Apr 2007
TL;DR: In this paper, an adaptive-critic-based neural network (NN) controller in discrete time is designed to deliver a desired tracking performance for a class of nonlinear systems in the presence of actuator constraints.
Abstract: A novel adaptive-critic-based neural network (NN) controller in discrete time is designed to deliver a desired tracking performance for a class of nonlinear systems in the presence of actuator constraints. The constraints of the actuator are treated in the controller design as the saturation nonlinearity. The adaptive critic NN controller architecture based on state feedback includes two NNs: the critic NN is used to approximate the "strategic" utility function, whereas the action NN is employed to minimize both the strategic utility function and the unknown nonlinear dynamic estimation errors. The critic and action NN weight updates are derived by minimizing certain quadratic performance indexes. Using the Lyapunov approach and with novel weight updates, the uniformly ultimate boundedness of the closed-loop tracking error and weight estimates is shown in the presence of NN approximation errors and bounded unknown disturbances. The proposed NN controller works in the presence of multiple nonlinearities, unlike other schemes that normally approximate one nonlinearity. Moreover, the adaptive critic NN controller does not require an explicit offline training phase, and the NN weights can be initialized at zero or random. Simulation results justify the theoretical analysis

Proceedings ArticleDOI
20 Jun 2007
TL;DR: It is demonstrated that Rule Transfer can effectively speed up learning in Keepaway, a benchmark RL problem in the robot soccer domain, based on experience from source tasks in the gridworld domain through the use of three distinct transfer metrics.
Abstract: A typical goal for transfer learning algorithms is to utilize knowledge gained in a source task to learn a target task faster. Recently introduced transfer methods in reinforcement learning settings have shown considerable promise, but they typically transfer between pairs of very similar tasks. This work introduces Rule Transfer, a transfer algorithm that first learns rules to summarize a source task policy and then leverages those rules to learn faster in a target task. This paper demonstrates that Rule Transfer can effectively speed up learning in Keepaway, a benchmark RL problem in the robot soccer domain, based on experience from source tasks in the gridworld domain. We empirically show, through the use of three distinct transfer metrics, that Rule Transfer is effective across these domains.

Book
02 Jul 2007
TL;DR: This monograph provides a concise introduction to the subject of multiagent systems, covering the theoretical foundations as well as more recent developments in a coherent and readable manner.
Abstract: Multiagent systems is an expanding field that blends classical fields like game theory and decentralized control with modern fields like computer science and machine learning. This monograph provides a concise introduction to the subject, covering the theoretical foundations as well as more recent developments in a coherent and readable manner. The text is centered on the concept of an agent as decision maker. Chapter 1 is a short introduction to the field of multiagent systems. Chapter 2 covers the basic theory of singleagent decision making under uncertainty. Chapter 3 is a brief introduction to game theory, explaining classical concepts like Nash equilibrium. Chapter 4 deals with the fundamental problem of coordinating a team of collaborative agents. Chapter 5 studies the problem of multiagent reasoning and decision making under partial observability. Chapter 6 focuses on the design of protocols that are stable against manipulations by self-interested agents. Chapter 7 provides a short introduction to the rapidly expanding field of multiagent reinforcement learning. The material can be used for teaching a half-semester course on multiagent systems covering, roughly, one chapter per lecture.

Book
12 Oct 2007
TL;DR: In this paper, a unified framework based on a sensitivity point of view is proposed for performance optimization of modern engineering systems. But it is not suitable for modeling complex engineering systems and the system parameters cannot be easily identified, so learning techniques have to be applied.
Abstract: Performance optimization is vital in the design and operation of modern engineering systems, including communications, manufacturing, robotics, and logistics. Most engineering systems are too complicated to model, or the system parameters cannot be easily identified, so learning techniques have to be applied. This book provides a unified framework based on a sensitivity point of view. It also introduces new approaches and proposes new research topics within this sensitivity-based framework. This new perspective on a popular topic is presented by a well respected expert in the field.

Proceedings Article
03 Dec 2007
TL;DR: The results extend prior two-timescale convergence results for actor-critic methods by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor- Criterion methods by providing the first convergence proofs and the first fully incremental algorithms.
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.

Book ChapterDOI
09 Sep 2007
TL;DR: Recurrent Policy Gradients, a modelfree reinforcement learning (RL) method creating limited-memory stochastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations is presented.
Abstract: This paper presents Recurrent Policy Gradients, a modelfree reinforcement learning (RL) method creating limited-memory stochastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations. The approach involves approximating a policy gradient for a Recurrent Neural Network (RNN) by backpropagating return-weighted characteristic eligibilities through time. Using a "Long Short-Term Memory" architecture, we are able to outperform other RL methods on two important benchmark tasks. Furthermore, we show promising results on a complex car driving simulation task.

Journal ArticleDOI
TL;DR: A system to teach the robot constrained reaching tasks is described based on a dynamic system generator modulated by a learned speed trajectory combined with a reinforcement learning module to allow the robot to adapt the trajectory when facing a new situation, e.g., in the presence of obstacles.
Abstract: The goal of developing algorithms for programming robots by demonstration is to create an easy way of programming robots that can be accomplished by everyone. When a demonstrator teaches a task to a robot, he/she shows some ways of fulfilling the task, but not all the possibilities. The robot must then be able to reproduce the task even when unexpected perturbations occur. In this case, it has to learn a new solution. In this paper, we describe a system to teach to the robot constrained reaching tasks. Our system is based on a dynamical system generator modulated with a learned speed trajectory and combined with a reinforcement learning module to allow the robot to adapt the trajectory when facing a new situation, such as avoiding obstacles.