scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2009"


Journal ArticleDOI
TL;DR: This work describes mathematical formulations for reinforcement learning and a practical implementation method known as adaptive dynamic programming that give insight into the design of controllers for man-made engineered systems that both learn and exhibit optimal behavior.
Abstract: Living organisms learn by acting on their environment, observing the resulting reward stimulus, and adjusting their actions accordingly to improve the reward. This action-based or reinforcement learning can capture notions of optimal behavior occurring in natural systems. We describe mathematical formulations for reinforcement learning and a practical implementation method known as adaptive dynamic programming. These give us insight into the design of controllers for man-made engineered systems that both learn and exhibit optimal behavior.

1,163 citations


Book
01 Apr 2009
TL;DR: This book describes algorithms with code examples backed up by a website that provides working implementations in Python and includes examples based on widely available datasets and practical and theoretical problems to test understanding and application of the material.
Abstract: Written in an easily accessible style, this book provides the ideal blend of theory and practical, applicable knowledge. It covers neural networks, graphical models, reinforcement learning, evolutionary algorithms, dimensionality reduction methods, and the important area of optimization. It treads the fine line between adequate academic rigor and overwhelming students with equations and mathematical concepts. The author includes examples based on widely available datasets and practical and theoretical problems to test understanding and application of the material. The book describes algorithms with code examples backed up by a website that provides working implementations in Python.

989 citations


Journal ArticleDOI
Yael Niv1
TL;DR: The formal reinforcement learning framework is introduced and aspects of learning not associated with phasic dopamine signals are extended, such as learning of goal-directed responding that may not be dopamine-dependent, and learning about the vigor with which actions should be performed that has been linked to tonic aspects of dopaminergic signaling.

605 citations


Journal ArticleDOI
TL;DR: This paper reexamine behavioral hierarchy and its neural substrates from the point of view of recent developments in computational reinforcement learning and considers a set of approaches known collectively as hierarchical reinforcement learning, which extend the reinforcement learning paradigm by allowing the learning agent to aggregate actions into reusable subroutines or skills.

568 citations


Journal ArticleDOI
TL;DR: Four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas are presented, and their convergence proofs are provided, providing the first convergence proofs and the first fully incremental algorithms.

530 citations


Journal ArticleDOI
TL;DR: This work presents two algorithms, namely, population evolution and reinforcement-learning algorithms for network selection, which can reach the evolutionary equilibrium faster but requires a centralized controller to gather, process, and broadcast information about the users in the corresponding service area.
Abstract: Next-generation wireless networks will integrate multiple wireless access technologies to provide seamless mobility to mobile users with high-speed wireless connectivity. This will give rise to a heterogeneous wireless access environment where network selection becomes crucial for load balancing to avoid network congestion and performance degradation. We study the dynamics of network selection in a heterogeneous wireless network using the theory of evolutionary games. The competition among groups of users in different service areas to share the limited amount of bandwidth in the available wireless access networks is formulated as a dynamic evolutionary game, and the evolutionary equilibrium is considered to be the solution to this game. We present two algorithms, namely, population evolution and reinforcement-learning algorithms for network selection. Although the network-selection algorithm based on population evolution can reach the evolutionary equilibrium faster, it requires a centralized controller to gather, process, and broadcast information about the users in the corresponding service area. In contrast, with reinforcement learning, a user can gradually learn (by interacting with the service provider) and adapt the decision on network selection to reach evolutionary equilibrium without any interaction with other users. Performance of the dynamic evolutionary game-based network-selection algorithms is empirically investigated. The accuracy of the numerical results obtained from the game model is evaluated by using simulations.

487 citations


BookDOI
06 Aug 2009
TL;DR: Artificial Neural Networks Board Games Game Theory Minimaxing Transposition Tables and Memory Memory-Enhanced Test Algorithms Opening Books and Other Set Plays Further Optimizations Turn-Based Strategy Games Supporting Technologies Execution Management Scheduling Anytime Algorithm Level of Detail World Interfacing Communication Getting Knowledge Efficiently Event Managers Polling Stations Sense Management Tools and Content Creation.
Abstract: AI and Games Introduction What Is AI? Model of Game AI Algorithms, Data Structures, and Representations On the Website Layout of the Book Game AI The Complexity Fallacy The Kind of AI in Games Speed and Memory The AI Engine Techniques Movement The Basics of Movement Algorithms Kinematic Movement Algorithms Steering Behaviors Combining Steering Behaviors Predicting Physics Jumping Coordinated Movement Motor Control Movement in the Third Dimension Pathfinding The Pathfinding Graph Dijkstra A* World Representations Improving on A* Hierarchical Pathfinding Other Ideas in Pathfinding Continuous Time Pathfinding Movement Planning Decision Making Overview of Decision Making Decision Trees State Machines Behavior Trees Fuzzy Logic Markov Systems Goal-Oriented Behavior Rule-Based Systems Blackboard Architectures Scripting Action Execution Tactical and Strategic AI Waypoint Tactics Tactical Analyses Tactical Pathfinding Coordinated Action Learning Learning Basics Parameter Modification Action Prediction Decision Learning Naive Bayes Classifiers Decision Tree Learning Reinforcement Learning Artificial Neural Networks Board Games Game Theory Minimaxing Transposition Tables and Memory Memory-Enhanced Test Algorithms Opening Books and Other Set Plays Further Optimizations Turn-Based Strategy Games Supporting Technologies Execution Management Scheduling Anytime Algorithms Level of Detail World Interfacing Communication Getting Knowledge Efficiently Event Managers Polling Stations Sense Management Tools and Content Creation Knowledge for Pathfinding and Waypoint Tactics Knowledge for Movement Knowledge for Decision Making The Toolchain Designing Game AI Designing Game AI The Design Shooters Driving Real-Time Strategy Sports Turn-Based Strategy Games AI-Based Game Genres Teaching Characters Flocking and Herding Games Appendix Books, Periodicals, and Papers Games

472 citations


Journal ArticleDOI
TL;DR: This work proposes a more structured formulation that greatly simplifies the construction of optimal control laws in both discrete and continuous domains, and enables computations that were not possible before.
Abstract: Optimal choice of actions is a fundamental problem relevant to fields as diverse as neuroscience, psychology, economics, computer science, and control engineering. Despite this broad relevance the abstract setting is similar: we have an agent choosing actions over time, an uncertain dynamical system whose state is affected by those actions, and a performance criterion that the agent seeks to optimize. Solving problems of this kind remains hard, in part, because of overly generic formulations. Here, we propose a more structured formulation that greatly simplifies the construction of optimal control laws in both discrete and continuous domains. An exhaustive search over actions is avoided and the problem becomes linear. This yields algorithms that outperform Dynamic Programming and Reinforcement Learning, and thereby solve traditional problems more efficiently. Our framework also enables computations that were not possible before: composing optimal control laws by mixing primitives, applying deterministic methods to stochastic systems, quantifying the benefits of error tolerance, and inferring goals from behavioral data via convex optimization. Development of a general class of easily solvable problems tends to accelerate progress—as linear systems theory has done, for example. Our framework may have similar impact in fields where optimal choice of actions is relevant.

455 citations


Proceedings Article
01 Jan 2009
TL;DR: In this paper, in a continuous-time framework an online approach to direct adaptive optimal control with infinite horizon cost for nonlinear systems is presented and convergence of the algorithm is proven under the realistic assumption that the two neural networks do not provide perfect representations for the nonlinear control and cost functions.
Abstract: In this paper we present in a continuous-time framework an online approach to direct adaptive optimal control with infinite horizon cost for nonlinear systems. The algorithm converges online to the optimal control solution without knowledge of the internal system dynamics. Closed-loop dynamic stability is guaranteed throughout. The algorithm is based on a reinforcement learning scheme, namely Policy Iterations, and makes use of neural networks, in an Actor/Critic structure, to parametrically represent the control policy and the performance of the control system. The two neural networks are trained to express the optimal controller and optimal cost function which describes the infinite horizon control performance. Convergence of the algorithm is proven under the realistic assumption that the two neural networks do not provide perfect representations for the nonlinear control and cost functions. The result is a hybrid control structure which involves a continuous-time controller and a supervisory adaptation structure which operates based on data sampled from the plant and from the continuous-time performance dynamics. Such control structure is unlike any standard form of controllers previously seen in the literature. Simulation results, obtained considering two second-order nonlinear systems, are provided.

422 citations


Journal ArticleDOI
TL;DR: In this article, an online approach to direct adaptive optimal control with infinite horizon cost for nonlinear systems is presented, based on a reinforcement learning scheme, namely Policy Iterations, and makes use of neural networks, in an Actor/Critic structure, to parametrically represent the control policy and the performance of the control system.

411 citations


Journal ArticleDOI
29 Jul 2009-PLOS ONE
TL;DR: It is shown that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception, which results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming.
Abstract: This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

Journal ArticleDOI
TL;DR: A computational view of serotonin's involvement in the control of appetitively and aversively motivated actions is presented, suggesting that it is only a partial reflection of dopamine because of essential asymmetries between the natural statistics of rewards and punishments.
Abstract: Serotonin is a neuromodulator that is extensively entangled in fundamental aspects of brain function and behavior. We present a computational view of its involvement in the control of appetitively and aversively motivated actions. We first describe a range of its effects in invertebrates, endowing specific structurally fixed networks with plasticity at multiple spatial and temporal scales. We then consider its rather widespread distribution in the mammalian brain. We argue that this is associated with a more unified representational and functional role in aversive processing that is amenable to computational analyses with the kinds of reinforcement learning techniques that have helped elucidate dopamine's role in appetitive behavior. Finally, we suggest that it is only a partial reflection of dopamine because of essential asymmetries between the natural statistics of rewards and punishments.

Proceedings ArticleDOI
02 Aug 2009
TL;DR: This paper presents a reinforcement learning approach for mapping natural language instructions to sequences of executable actions, and uses a policy gradient algorithm to estimate the parameters of a log-linear model for action selection.
Abstract: In this paper, we present a reinforcement learning approach for mapping natural language instructions to sequences of executable actions. We assume access to a reward function that defines the quality of the executed actions. During training, the learner repeatedly constructs action sequences for a set of documents, executes those actions, and observes the resulting reward. We use a policy gradient algorithm to estimate the parameters of a log-linear model for action selection. We apply our method to interpret instructions in two domains --- Windows troubleshooting guides and game tutorials. Our results demonstrate that this method can rival supervised learning techniques while requiring few or no annotated training examples.

Journal ArticleDOI
TL;DR: This overview of reinforcement learning is aimed at uncovering the mathematical roots of this science so that readers gain a clear understanding of the core concepts and are able to use them in their own research.
Abstract: In the last few years, reinforcement learning (RL), also called adaptive (or approximate) dynamic programming, has emerged as a powerful tool for solving complex sequential decision-making problems in control theory. Although seminal research in this area was performed in the artificial intelligence (AI) community, more recently it has attracted the attention of optimization theorists because of several noteworthy success stories from operations management. It is on large-scale and complex problems of dynamic optimization, in particular the Markov decision problem (MDP) and its variants, that the power of RL becomes more obvious. It has been known for many years that on large-scale MDPs, the curse of dimensionality and the curse of modeling render classical dynamic programming (DP) ineffective. The excitement in RL stems from its direct attack on these curses, which allows it to solve problems that were considered intractable via classical DP in the past. The success of RL is due to its strong mathematical roots in the principles of DP, Monte Carlo simulation, function approximation, and AI. Topics treated in some detail in this survey are temporal differences, Q-learning, semi-MDPs, and stochastic games. Several recent advances in RL, e.g., policy gradients and hierarchical RL, are covered along with references. Pointers to numerous examples of applications are provided. This overview is aimed at uncovering the mathematical roots of this science so that readers gain a clear understanding of the core concepts and are able to use them in their own research. The survey points to more than 100 references from the literature.

Journal ArticleDOI
TL;DR: The current state-of-the-art for near-optimal behavior in finite Markov Decision Processes with a polynomial number of samples is summarized by presenting bounds for the problem in a unified theoretical framework.
Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

Proceedings ArticleDOI
14 Jun 2009
TL;DR: A simple algorithm is presented, and it is proved that with high probability it is able to perform ε-close to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps.
Abstract: We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to model-based RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intractable for all but very restricted cases. In this paper we present a simple algorithm, and prove that with high probability it is able to perform e-close to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps. The algorithm and analysis are motivated by the so-called PAC-MDP approach, and extend such results into the setting of Bayesian RL. In this setting, we show that we can achieve lower sample complexity bounds than existing algorithms, while using an exploration strategy that is much greedier than the (extremely cautious) exploration of PAC-MDP algorithms.

Journal ArticleDOI
TL;DR: Several variants of the general batch learning framework are discussed, particularly tailored to the use of multilayer perceptrons to approximate value functions over continuous state spaces, which are successfully used to learn crucial skills in soccer-playing robots participating in the RoboCup competitions.
Abstract: Batch reinforcement learning methods provide a powerful framework for learning efficiently and effectively in autonomous robots. The paper reviews some recent work of the authors aiming at the successful application of reinforcement learning in a challenging and complex domain. It discusses several variants of the general batch learning framework, particularly tailored to the use of multilayer perceptrons to approximate value functions over continuous state spaces. The batch learning framework is successfully used to learn crucial skills in our soccer-playing robots participating in the RoboCup competitions. This is demonstrated on three different case studies.

Proceedings Article
07 Dec 2009
TL;DR: A skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward is introduced that creates appropriate skills and achieves performance benefits in a challenging continuous domain.
Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance benefits in a challenging continuous domain.

01 Jan 2009
TL;DR: A transversal view through microfluidics theory and applications, covering different kinds of phenomena, from continuous to multiphase flow, and a vision of two phasemicrofluidic phenomena is given through nonlinear analyses applied to experimental time series.
Abstract: This paper first offers a transversal view through microfluidics theory and applications, starting from a brief overview on microfluidic systems and related theoretical issues, covering different kinds of phenomena, from continuous to multiphase flow. Multidimensional models, from lumped parameters to numerical models and computational solutions, are then considered as preliminary tools for the characterization of spatio-temporal dynamics in microfluidic flows. Following these, experimental approaches through original monitoring opto-electronic interfaces and systems are discussed. Finally, a vision of two phase microfluidic phenomena is given through nonlinear analyses applied to experimental time series.

Proceedings ArticleDOI
Jia Rao1, Xiangping Bu1, Cheng-Zhong Xu1, Le Yi Wang1, George Yin1 
15 Jun 2009
TL;DR: A reinforcement learning (RL) based approach, namely VCONF, to automate the VM configuration process, which employs model-based RL algorithms to address the scalability and adaptability issues in applying RL in systems management.
Abstract: Virtual machine (VM) technology enables multiple VMs to share resources on the same host. Resources allocated to the VMs should be re-configured dynamically in response to the change of application demands or resource supply. Because VM execution involves privileged domain and VM monitor, this causes uncertainties in VMs' resource to performance mapping and poses challenges in online determination of appropriate VM configurations. In this paper, we propose a reinforcement learning (RL) based approach, namely VCONF, to automate the VM configuration process. VCONF employs model-based RL algorithms to address the scalability and adaptability issues in applying RL in systems management. Experimental results on both controlled environments and a testbed of clouds with Xen VMs and representative server workloads demonstrate the effectiveness of VCONF. The approach is able to find optimal (near optimal) configurations in small scale systems and shows good adaptability and scalability.

Journal ArticleDOI
TL;DR: Reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known.
Abstract: There has been significant recent research activity in developing therapies that are tailored to each individual. Finding such therapies in treatment settings involving multiple decision times is a major challenge. In this dissertation, we develop reinforcement learning trials for discovering these optimal regimens for life-threatening diseases such as cancer. A temporal-difference learning method called Q-learning is utilized which involves learning an optimal policy from a single training set of finite longitudinal patient trajectories. Approximating the Q-function with time-indexed parameters can be achieved by using support vector regression or extremely randomized trees. Within this framework, we demonstrate that the procedure can extract optimal strategies directly from clinical data without relying on the identification of any accurate mathematical models, unlike approaches based on adaptive design. We show that reinforcement learning has tremendous potential in clinical research because it can select actions that improve outcomes by taking into account delayed effects even when the relationship between actions and outcomes is not fully known. To support our claims, the methodology's practical utility is firstly illustrated in a virtual simulated clinical trial. We then apply this general strategy with significant refinements to studying and discovering optimal treatments for advanced metastatic stage IIIB/IV non-small cell lung cancer (NSCLC). In addition to the complexity of the NSCLC problem of selecting optimal compounds for first and second-line treatments based on prognostic factors, another primary scientific goal is to determine the optimal time to initiate second-line therapy, either immediately or delayed after induction therapy, yielding the longest overall survival time. We show that reinforcement learning not only successfully identifies optimal strategies for two lines of treatment from clinical data, but also reliably selects the best initial time for second-line therapy while taking into account heterogeneities of NSCLC across patients.

Proceedings ArticleDOI
29 Jun 2009
TL;DR: It is shown that the optimal strategy is different from the fixed one, and supports more effective and efficient interaction sessions, and allows conversational systems to autonomously improve a fixed strategy and eventually learn a better one using reinforcement learning techniques.
Abstract: Conversational recommender systems (CRSs) assist online users in their information-seeking and decision making tasks by supporting an interactive process. Although these processes could be rather diverse, CRSs typically follow a fixed strategy, e.g., based on critiquing or on iterative query reformulation. In a previous paper, we proposed a novel recommendation model that allows conversational systems to autonomously improve a fixed strategy and eventually learn a better one using reinforcement learning techniques. This strategy is optimal for the given model of the interaction and it is adapted to the users' behaviors. In this paper we validate our approach in an online CRS by means of a user study involving several hundreds of testers. We show that the optimal strategy is different from the fixed one, and supports more effective and efficient interaction sessions.

Journal ArticleDOI
TL;DR: A Bayesian optimization method that dynamically trades off exploration and exploitation for optimal sensing with a mobile robot and is applicable to other closely-related domains, including active vision, sequential experimental design, dynamic sensing and calibration with mobile sensors.
Abstract: We address the problem of online path planning for optimal sensing with a mobile robot. The objective of the robot is to learn the most about its pose and the environment given time constraints. We use a POMDP with a utility function that depends on the belief state to model the finite horizon planning problem. We replan as the robot progresses throughout the environment. The POMDP is high-dimensional, continuous, non-differentiable, nonlinear, non-Gaussian and must be solved in real-time. Most existing techniques for stochastic planning and reinforcement learning are therefore inapplicable. To solve this extremely complex problem, we propose a Bayesian optimization method that dynamically trades off exploration (minimizing uncertainty in unknown parts of the policy space) and exploitation (capitalizing on the current best solution). We demonstrate our approach with a visually-guide mobile robot. The solution proposed here is also applicable to other closely-related domains, including active vision, sequential experimental design, dynamic sensing and calibration with mobile sensors.

Journal ArticleDOI
TL;DR: It is found that dopaminergic drugs selectively modulate learning from positive outcomes, and a novel dopamine-dependent effect on decision making that is not accounted for by reinforcement learning models is found.
Abstract: Making appropriate choices often requires the ability to learn the value of available options from experience. Parkinson's disease is characterized by a loss of dopamine neurons in the substantia nigra, neurons hypothesized to play a role in reinforcement learning. Although previous studies have shown that Parkinson's patients are impaired in tasks involving learning from feedback, they have not directly tested the widely held hypothesis that dopamine neuron activity specifically encodes the reward prediction error signal used in reinforcement learning models. To test a key prediction of this hypothesis, we fit choice behavior from a dynamic foraging task with reinforcement learning models and show that treatment with dopaminergic drugs alters choice behavior in a manner consistent with the theory. More specifically, we found that dopaminergic drugs selectively modulate learning from positive outcomes. We observed no effect of dopaminergic drugs on learning from negative outcomes. We also found a novel dopamine-dependent effect on decision making that is not accounted for by reinforcement learning models: perseveration in choice, independent of reward history, increases with Parkinson's disease and decreases with dopamine therapy.

Journal ArticleDOI
TL;DR: This paper has two main objectives: to present problems, methods, approaches and practices in traffic engineering (especially regarding traffic signal control); and to highlight open problems and challenges so that future research in multiagent systems can address them.
Abstract: The increasing demand for mobility in our society poses various challenges to traffic engineering, computer science in general, and artificial intelligence and multiagent systems in particular As it is often the case, it is not possible to provide additional capacity, so that a more efficient use of the available transportation infrastructure is necessary This relates closely to multiagent systems as many problems in traffic management and control are inherently distributed Also, many actors in a transportation system fit very well the concept of autonomous agents: the driver, the pedestrian, the traffic expert; in some cases, also the intersection and the traffic signal controller can be regarded as an autonomous agent However, the "agentification" of a transportation system is associated with some challenging issues: the number of agents is high, typically agents are highly adaptive, they react to changes in the environment at individual level but cause an unpredictable collective pattern, and act in a highly coupled environment Therefore, this domain poses many challenges for standard techniques from multiagent systems such as coordination and learning This paper has two main objectives: (i) to present problems, methods, approaches and practices in traffic engineering (especially regarding traffic signal control); and (ii) to highlight open problems and challenges so that future research in multiagent systems can address them

Journal ArticleDOI
TL;DR: Modified versions of an action–value learning model captured a variety of choice strategies of rats, including win-stay–lose-switch and persevering behavior, and predicted rats' choice sequences better than the best multistep Markov model.
Abstract: Reinforcement learning theory plays a key role in understanding the behavioral and neural mechanisms of choice behavior in animals and humans. Especially, intermediate variables of learning models estimated from behavioral data, such as the expectation of reward for each candidate choice (action value), have been used in searches for the neural correlates of computational elements in learning and decision making. The aims of the present study are as follows: (1) to test which computational model best captures the choice learning process in animals and (2) to elucidate how action values are represented in different parts of the corticobasal ganglia circuit. We compared different behavioral learning algorithms to predict the choice sequences generated by rats during a free-choice task and analyzed associated neural activity in the nucleus accumbens (NAc) and ventral pallidum (VP). The major findings of this study were as follows: (1) modified versions of an action-value learning model captured a variety of choice strategies of rats, including win-stay-lose-switch and persevering behavior, and predicted rats' choice sequences better than the best multistep Markov model; and (2) information about action values and future actions was coded in both the NAc and VP, but was less dominant than information about trial types, selected actions, and reward outcome. The results of our model-based analysis suggest that the primary role of the NAc and VP is to monitor information important for updating choice behaviors. Information represented in the NAc and VP might contribute to a choice mechanism that is situated elsewhere.

Journal ArticleDOI
TL;DR: This article introduces Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm, and proposes to learn probabilistic models of the a priori unknown transition dynamics and the value functions on the fly.

Proceedings ArticleDOI
14 Jun 2009
TL;DR: This paper proposes a regularization framework for the LSTD algorithm, which is robust to irrelevant features and also serves as a method for feature selection, and presents an algorithm similar to the Least Angle Regression algorithm that can efficiently compute the optimal solution.
Abstract: We consider the task of reinforcement learning with linear value function approximation. Temporal difference algorithms, and in particular the Least-Squares Temporal Difference (LSTD) algorithm, provide a method for learning the parameters of the value function, but when the number of features is large this algorithm can over-fit to the data and is computationally expensive. In this paper, we propose a regularization framework for the LSTD algorithm that overcomes these difficulties. In particular, we focus on the case of l1 regularization, which is robust to irrelevant features and also serves as a method for feature selection. Although the l1 regularized LSTD solution cannot be expressed as a convex optimization problem, we present an algorithm similar to the Least Angle Regression (LARS) algorithm that can efficiently compute the optimal solution. Finally, we demonstrate the performance of the algorithm experimentally.

Journal ArticleDOI
01 Apr 2009
TL;DR: This paper compares reinforcement learning with model predictive control in a unified framework and reports experimental results of their application to the synthesis of a controller for a nonlinear and deterministic electrical power oscillations damping problem.
Abstract: This paper compares reinforcement learning (RL) with model predictive control (MPC) in a unified framework and reports experimental results of their application to the synthesis of a controller for a nonlinear and deterministic electrical power oscillations damping problem. Both families of methods are based on the formulation of the control problem as a discrete-time optimal control problem. The considered MPC approach exploits an analytical model of the system dynamics and cost function and computes open-loop policies by applying an interior-point solver to a minimization problem in which the system dynamics are represented by equality constraints. The considered RL approach infers in a model-free way closed-loop policies from a set of system trajectories and instantaneous cost values by solving a sequence of batch-mode supervised learning problems. The results obtained provide insight into the pros and cons of the two approaches and show that RL may certainly be competitive with MPC even in contexts where a good deterministic system model is available.

Journal ArticleDOI
TL;DR: A neural network simulations that capture the interactions between instruction-driven and reinforcement-driven behavior via two potential neural circuits are presented, suggesting the existence of a "confirmation bias" in which the PFC/HC system trains the reinforcement system by amplifying outcomes that are consistent with instructions while diminishing inconsistent outcomes.