Showing papers on "Reinforcement learning published in 2014"

PDF

Open Access

Posted Content•

[...]

Ian Goodfellow¹, Jean Pouget-Abadie², Mehdi Mirza², Bing Xu², David Warde-Farley², Sherjil Ozair², Aaron Courville², Yoshua Bengio² - Show less +4 more•Institutions (2)

Google¹, Université de Montréal²

10 Jun 2014-arXiv: Machine Learning

TL;DR: In this article, a generative adversarial network (GAN) is proposed to estimate generative models via an adversarial process, in which two models are simultaneously trained: a generator G and a discriminator D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

...read moreread less

2,657 citations

Proceedings Article•

Deterministic Policy Gradient Algorithms

[...]

David Silver, Guy Lever¹, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller - Show less +2 more•Institutions (1)

University College London¹

21 Jun 2014

TL;DR: This paper introduces an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy and demonstrates that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.

...read moreread less

Abstract: In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.

...read moreread less

2,174 citations

Posted Content•

Recurrent Models of Visual Attention

[...]

Volodymyr Mnih¹, Nicolas Heess¹, Alex Graves¹, Koray Kavukcuoglu¹•Institutions (1)

Google¹

24 Jun 2014-arXiv: Learning

TL;DR: In this article, a recurrent neural network (RNN) model is proposed to extract information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution.

...read moreread less

Abstract: Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.

...read moreread less

2,107 citations

Proceedings Article•

Recurrent Models of Visual Attention

[...]

Volodymyr Mnih¹, Nicolas Heess¹, Alex Graves¹, Koray Kavukcuoglu¹•Institutions (1)

Google¹

08 Dec 2014

TL;DR: A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.

...read moreread less

1,649 citations

Journal Article•DOI•

A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments

[...]

Tania Lorido-Botran¹, Jose Miguel-Alonso¹, Jose A. Lozano¹•Institutions (1)

University of the Basque Country¹

01 Dec 2014

TL;DR: This work proposes a classification of techniques for automating application scaling in the cloud into five main categories: static threshold-based rules, control theory, reinforcement learning, queuing theory and time series analysis, and uses this classification to carry out a literature review of proposals.

...read moreread less

Abstract: Cloud computing environments allow customers to dynamically scale their applications. The key problem is how to lease the right amount of resources, on a pay-as-you-go basis. Application re-dimensioning can be implemented effortlessly, adapting the resources assigned to the application to the incoming user demand. However, the identification of the right amount of resources to lease in order to meet the required Service Level Agreement, while keeping the overall cost low, is not an easy task. Many techniques have been proposed for automating application scaling. We propose a classification of these techniques into five main categories: static threshold-based rules, control theory, reinforcement learning, queuing theory and time series analysis. Then we use this classification to carry out a literature review of proposals for auto-scaling in the cloud.

...read moreread less

688 citations

Journal Article•DOI•

Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems

[...]

Derong Liu, Qinglai Wei

01 Mar 2014-IEEE Transactions on Neural Networks

TL;DR: It is shown that the iterative performance index function is nonincreasingly convergent to the optimal solution of the Hamilton-Jacobi-Bellman equation and it is proven that any of the iteratives control laws can stabilize the nonlinear systems.

...read moreread less

Abstract: This paper is concerned with a new discrete-time policy iteration adaptive dynamic programming (ADP) method for solving the infinite horizon optimal control problem of nonlinear systems. The idea is to use an iterative ADP technique to obtain the iterative control law, which optimizes the iterative performance index function. The main contribution of this paper is to analyze the convergence and stability properties of policy iteration method for discrete-time nonlinear systems for the first time. It shows that the iterative performance index function is nonincreasingly convergent to the optimal solution of the Hamilton-Jacobi-Bellman equation. It is also proven that any of the iterative control laws can stabilize the nonlinear systems. Neural networks are used to approximate the performance index function and compute the optimal control law, respectively, for facilitating the implementation of the iterative ADP algorithm, where the convergence of the weight matrices is analyzed. Finally, the numerical results and analysis are presented to illustrate the performance of the developed method.

...read moreread less

535 citations

Journal Article•DOI•

Temporal structure of motor variability is dynamically regulated and predicts motor learning ability

[...]

Howard G. Wu¹, Yohsuke R. Miyamoto¹, Luis Nicolas Gonzalez Castro¹, Bence P. Ölveczky¹, Maurice A. Smith¹ - Show less +1 more•Institutions (1)

Harvard University¹

01 Feb 2014-Nature Neuroscience

TL;DR: Surprisingly, it was found that higher levels of task-relevant motor variability predicted faster learning both across individuals and across tasks in two different paradigms, one relying on reward-based learning to shape specific arm movement trajectories and the other relying on error- based learning to adapt movements in novel physical environments.

...read moreread less

Abstract: Individual differences in motor learning ability are widely acknowledged, yet little is known about the factors that underlie them. Here we explore whether movement-to-movement variability in motor output, a ubiquitous if often unwanted characteristic of motor performance, predicts motor learning ability. Surprisingly, we found that higher levels of task-relevant motor variability predicted faster learning both across individuals and across tasks in two different paradigms, one relying on reward-based learning to shape specific arm movement trajectories and the other relying on error-based learning to adapt movements in novel physical environments. We proceeded to show that training can reshape the temporal structure of motor variability, aligning it with the trained task to improve learning. These results provide experimental support for the importance of action exploration, a key idea from reinforcement learning theory, showing that motor variability facilitates motor learning in humans and that our nervous systems actively regulate it to improve learning.

...read moreread less

534 citations

Journal Article•DOI•

Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning

[...]

Hamidreza Modares¹, Frank L. Lewis¹•Institutions (1)

University of Texas at Arlington¹

01 Jul 2014-Automatica

TL;DR: This formulation extends the integral reinforcement learning (IRL) technique, a method for solving optimal regulation problems, to learn the solution to the OTCP, and it also takes into account the input constraints a priori.

...read moreread less

440 citations

Book•DOI•

Reinforcement Learning: State-of-the-Art

[...]

Marco A. Wiering, Martijn van Otterlo

16 Apr 2014-Springer US

TL;DR: The main goal of this book is to present an up-to-date series of survey articles on the main contemporary sub-fields of reinforcement learning, including surveys on partially observable environments, hierarchical task decompositions, relational knowledge representation and predictive state representations.

...read moreread less

Abstract: Reinforcement learning encompasses both a science of adaptive behavior of rational beings in uncertain environments and a computational methodology for finding optimal behaviors for challenging problems in control, optimization and adaptive behavior of intelligent agents. As a field, reinforcement learning has progressed tremendously in the past decade.The main goal of this book is to present an up-to-date series of survey articles on the main contemporary sub-fields of reinforcement learning. This includes surveys on partially observable environments, hierarchical task decompositions, relational knowledge representation and predictive state representations. Furthermore, topics such as transfer, evolutionary methods and continuous spaces in reinforcement learning are surveyed. In addition, several chapters review reinforcement learning methods in robotics, in games, and in computational neuroscience. In total seventeen different subfields are presented by mostly young experts in those areas, and together they truly represent a state-of-the-art of current reinforcement learning research.Marco Wiering works at the artificial intelligence department of the University of Groningen in the Netherlands. He has published extensively on various reinforcement learning topics. Martijn van Otterlo works in the cognitive artificial intelligence group at the Radboud University Nijmegen in The Netherlands. He has mainly focused on expressive knowledgerepresentation in reinforcement learning settings.

...read moreread less

420 citations

Journal Article•DOI•

Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems

[...]

Hamidreza Modares¹, Frank L. Lewis², Mohammad-Bagher Naghibi-Sistani¹•Institutions (2)

Ferdowsi University of Mashhad¹, University of Texas at Arlington²

01 Jan 2014-Automatica

TL;DR: An integral reinforcement learning algorithm on an actor-critic structure is developed to learn online the solution to the Hamilton-Jacobi-Bellman equation for partially-unknown constrained-input systems and it is shown that using this technique, an easy-to-check condition on the richness of the recorded data is sufficient to guarantee convergence to a near-optimal control law.

...read moreread less

410 citations

Journal Article•DOI•

Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics

[...]

Bahare Kiumarsi¹, Frank L. Lewis², Hamidreza Modares¹, Ali Karimpour¹, Mohammad Bagher Naghibi-Sistani¹ - Show less +1 more•Institutions (2)

Ferdowsi University of Mashhad¹, University of Texas at Arlington²

01 Apr 2014-Automatica

TL;DR: A novel approach based on the Q -learning algorithm is proposed to solve the infinite-horizon linear quadratic tracker (LQT) for unknown discrete-time systems in a causal manner and the optimal control input is obtained by only solving an augmented ARE.

...read moreread less

Proceedings Article•

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

[...]

Xiaoxiao Guo¹, Satinder Singh¹, Honglak Lee¹, Richard L. Lewis¹, Xiaoshi Wang¹ - Show less +1 more•Institutions (1)

University of Michigan¹

08 Dec 2014

TL;DR: The central idea is to use the slow planning-based agents to provide training data for a deep-learning architecture capable of real-time play, and proposed new agents based on this idea are proposed and shown to outperform DQN.

...read moreread less

Abstract: The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policy-selection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a useful benchmark set of such applications. A recent breakthrough in combining model-free reinforcement learning with deep learning, called DQN, achieves the best real-time agents thus far. Planning-based approaches achieve far higher scores than the best model-free approaches, but they exploit information that is not available to human players, and they are orders of magnitude slower than needed for real-time play. Our main goal in this work is to build a better real-time Atari game playing agent than DQN. The central idea is to use the slow planning-based agents to provide training data for a deep-learning architecture capable of real-time play. We proposed new agents based on this idea and show that they outperform DQN.

...read moreread less

Journal Article•DOI•

Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive.

[...]

Anne G.E. Collins¹, Michael J. Frank¹•Institutions (1)

Brown University¹

01 Jul 2014-Psychological Review

TL;DR: A novel algorithmic model expanding the classical actor-critic architecture to include fundamental interactive properties of neural circuit models, incorporating both incentive and learning effects into a single theoretical framework simultaneously captures documented effects of dopamine on both learning and choice incentive.

...read moreread less

Abstract: The striatal dopaminergic system has been implicated in reinforcement learning (RL), motor performance, and incentive motivation. Various computational models have been proposed to account for each of these effects individually, but a formal analysis of their interactions is lacking. Here we present a novel algorithmic model expanding the classical actor-critic architecture to include fundamental interactive properties of neural circuit models, incorporating both incentive and learning effects into a single theoretical framework. The standard actor is replaced by a dual opponent actor system representing distinct striatal populations, which come to differentially specialize in discriminating positive and negative action values. Dopamine modulates the degree to which each actor component contributes to both learning and choice discriminations. In contrast to standard frameworks, this model simultaneously captures documented effects of dopamine on both learning and choice incentive-and their interactions-across a variety of studies, including probabilistic RL, effort-based choice, and motor skill learning.

...read moreread less

Journal Article•DOI•

Linear Quadratic Tracking Control of Partially-Unknown Continuous-Time Systems Using Reinforcement Learning

[...]

Hamidreza Modares¹, Frank L. Lewis¹•Institutions (1)

University of Texas at Austin¹

11 Apr 2014-IEEE Transactions on Automatic Control

TL;DR: An online learning algorithm is developed to solve the linear quadratic tracking (LQT) problem for partially-unknown continuous-time systems and it is shown that the value function is Quadratic in terms of the state of the system and the command generator.

...read moreread less

Abstract: In this technical note, an online learning algorithm is developed to solve the linear quadratic tracking (LQT) problem for partially-unknown continuous-time systems. It is shown that the value function is quadratic in terms of the state of the system and the command generator. Based on this quadratic form, an LQT Bellman equation and an LQT algebraic Riccati equation (ARE) are derived to solve the LQT problem. The integral reinforcement learning technique is used to find the solution to the LQT ARE online and without requiring the knowledge of the system drift dynamics or the command generator dynamics. The convergence of the proposed online algorithm to the optimal control solution is verified. To show the efficiency of the proposed approach, a simulation example is provided.

...read moreread less

Proceedings Article•DOI•

Reachability-based safe learning with Gaussian processes

[...]

Anayo K. Akametalu¹, Shahab Kaynama¹, Jaime F. Fisac¹, Melanie N. Zeilinger¹, Jeremy H. Gillula¹, Claire J. Tomlin¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

01 Dec 2014

TL;DR: This work proposes a novel method that uses a principled approach to learn the system's unknown dynamics based on a Gaussian process model and iteratively approximates the maximal safe set and further incorporates safety into the reinforcement learning performance metric, allowing a better integration of safety and learning.

...read moreread less

Abstract: Reinforcement learning for robotic applications faces the challenge of constraint satisfaction, which currently impedes its application to safety critical systems Recent approaches successfully introduce safety based on reachability analysis, determining a safe region of the state space where the system can operate However, overly constraining the freedom of the system can negatively affect performance, while attempting to learn less conservative safety constraints might fail to preserve safety if the learned constraints are inaccurate We propose a novel method that uses a principled approach to learn the system's unknown dynamics based on a Gaussian process model and iteratively approximates the maximal safe set A modified control strategy based on real-time model validation preserves safety under weaker conditions than current approaches Our framework further incorporates safety into the reinforcement learning performance metric, allowing a better integration of safety and learning We demonstrate our algorithm on simulations of a cart-pole system and on an experimental quadrotor application and show how our proposed scheme succeeds in preserving safety where current approaches fail to avoid an unsafe condition

...read moreread less

Book•DOI•

Inductive Logic Programming: 23rd International Conference, ILP 2013, Rio de Janeiro, Brazil, August 28-30, 2013, Revised Selected Papers

[...]

Gerson Zaverucha, Vítor Santos Costa, Aline Paes

24 Sep 2014

Abstract: This book constitutes the thoroughly refereed post-proceedings of the 23rd International Conference on Inductive Logic Programming, ILP 2013, held in Rio de Janeiro, Brazil, in August 2013. The 9 revised extended papers were carefully reviewed and selected from 42 submissions. The conference now focuses on all aspects of learning in logic, multi-relational learning and data mining, statistical relational learning, graph and tree mining, relational reinforcement learning, and other forms of learning from structured data.

...read moreread less

Journal Article•DOI•

Decentralized Stabilization for a Class of Continuous-Time Nonlinear Interconnected Systems Using Online Learning Optimal Control Approach

[...]

Derong Liu, Ding Wang, Hongliang Li

01 Feb 2014-IEEE Transactions on Neural Networks

TL;DR: It is proven that the decentralized control strategy of the overall system can be established by adding appropriate feedback gains to the optimal control policies of the isolated subsystems, and an online policy iteration algorithm is presented to solve the Hamilton-Jacobi-Bellman equations.

...read moreread less

Abstract: In this paper, using a neural-network-based online learning optimal control approach, a novel decentralized control strategy is developed to stabilize a class of continuous-time nonlinear interconnected large-scale systems. First, optimal controllers of the isolated subsystems are designed with cost functions reflecting the bounds of interconnections. Then, it is proven that the decentralized control strategy of the overall system can be established by adding appropriate feedback gains to the optimal control policies of the isolated subsystems. Next, an online policy iteration algorithm is presented to solve the Hamilton-Jacobi-Bellman equations related to the optimal control problem. Through constructing a set of critic neural networks, the cost functions can be obtained approximately, followed by the control policies. Furthermore, the dynamics of the estimation errors of the critic networks are verified to be uniformly and ultimately bounded. Finally, a simulation example is provided to illustrate the effectiveness of the present decentralized control scheme.

...read moreread less

Journal Article•DOI•

Reinforcement Learning Output Feedback NN Control Using Deterministic Learning Technique

[...]

Bin Xu, Chenguang Yang¹, Zhongke Shi•Institutions (1)

University of Plymouth¹

01 Mar 2014-IEEE Transactions on Neural Networks

TL;DR: A novel adaptive-critic-based neural network (NN) controller is investigated for nonlinear pure-feedback systems and a deterministic learning technique has been employed to guarantee that the partial persistent excitation condition of internal states is satisfied during tracking control to a periodic reference orbit.

...read moreread less

Abstract: In this brief, a novel adaptive-critic-based neural network (NN) controller is investigated for nonlinear pure-feedback systems. The controller design is based on the transformed predictor form, and the actor-critic NN control architecture includes two NNs, whereas the critic NN is used to approximate the strategic utility function, and the action NN is employed to minimize both the strategic utility function and the tracking error. A deterministic learning technique has been employed to guarantee that the partial persistent excitation condition of internal states is satisfied during tracking control to a periodic reference orbit. The uniformly ultimate boundedness of closed-loop signals is shown via Lyapunov stability analysis. Simulation results are presented to demonstrate the effectiveness of the proposed control.

...read moreread less

Posted Content•

Reinforcement and Imitation Learning via Interactive No-Regret Learning

[...]

Stephane Ross, J. Andrew Bagnell

23 Jun 2014-arXiv: Learning

TL;DR: This work develops an interactive imitation learning approach that leverages cost information and extends the technique to address reinforcement learning, suggesting a broad new family of algorithms and providing a unifying view of existing techniques for imitation and reinforcement learning.

...read moreread less

Abstract: Recent work has demonstrated that problems-- particularly imitation learning and structured prediction-- where a learner's predictions influence the input-distribution it is tested on can be naturally addressed by an interactive approach and analyzed using no-regret online learning. These approaches to imitation learning, however, neither require nor benefit from information about the cost of actions. We extend existing results in two directions: first, we develop an interactive imitation learning approach that leverages cost information; second, we extend the technique to address reinforcement learning. The results provide theoretical support to the commonly observed successes of online approximate policy iteration. Our approach suggests a broad new family of algorithms and provides a unifying view of existing techniques for imitation and reinforcement learning.

...read moreread less

Journal Article•DOI•

Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design

[...]

Biao Luo¹, Huai-Ning Wu², Tingwen Huang³, Derong Liu¹•Institutions (3)

Chinese Academy of Sciences¹, Beihang University², Qatar Airways³

01 Dec 2014-Automatica

TL;DR: This paper addresses the model-free nonlinear optimal control problem based on data by introducing the reinforcement learning (RL) technique by using a data-based approximate policy iteration (API) method by using real system data rather than a system model.

...read moreread less

Journal Article•DOI•

First steps towards an intelligent laser welding architecture using deep neural networks and reinforcement learning

[...]

Johannes Günther¹, Patrick M. Pilarski², Gerhard Helfrich¹, Hao Shen¹, Klaus Diepold¹ - Show less +1 more•Institutions (2)

Technische Universität München¹, University of Alberta²

01 Jan 2014-Procedia Technology

TL;DR: The intelligent laser-welding architecture introduced in this work has the capacity to improve its performance without further human assistance and therefore addresses key requirements of modern industry.

...read moreread less

Journal Article•DOI•

Learning by the Dendritic Prediction of Somatic Spiking

[...]

Robert Urbanczik¹, Walter Senn¹•Institutions (1)

University of Bern¹

05 Feb 2014-Neuron

TL;DR: A simple compartmental neuron model is presented together with a non-Hebbian, biologically plausible learning rule for dendritic synapses where plasticity is modulated by these three factors, and a single plasticity rule supports diverse learning paradigms.

...read moreread less

Book•

Hierarchical Relative Entropy Policy Search

[...]

Christian Daniel¹, Gerhard Neumann¹, Oliver Kroemer¹, Jan Peters¹•Institutions (1)

Technische Universität Darmstadt¹

04 Jan 2014

TL;DR: In this article, the problem of learning sub-policies in continuous state action spaces is defined as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-tasks for execution by the agent.

...read moreread less

Abstract: Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

...read moreread less

Journal Article•DOI•

Internally generated sequences in learning and executing goal-directed behavior

[...]

Giovanni Pezzulo¹, Matthijs A. A. van der Meer², Carien S. Lansink³, Cyriel M. A. Pennartz³•Institutions (3)

National Research Council¹, University of Waterloo², University of Amsterdam³

01 Dec 2014-Trends in Cognitive Sciences

TL;DR: Using computational modeling, it is proposed that internally generated sequences may be productively considered a component of goal-directed decision systems, implementing a sampling-based inference engine that optimizes goal acquisition at multiple timescales of on-line choice, action control, and learning.

...read moreread less

Book Chapter•DOI•

Verification of Markov Decision Processes Using Learning Algorithms

[...]

Tomáš Brázdil¹, Krishnendu Chatterjee², Martin Chmelík², Vojtěch Forejt³, Jan Křetínský², Marta Kwiatkowska³, David Parker⁴, Mateusz Ujma³ - Show less +4 more•Institutions (4)

Masaryk University¹, Institute of Science and Technology Austria², University of Oxford³, University of Birmingham⁴

03 Nov 2014

TL;DR: A general framework for applying machine-learning algorithms to the verification of Markov decision processes (MDPs) and focuses on probabilistic reachability, which is a core property for verification, and is illustrated through two distinct instantiations.

...read moreread less

Abstract: We present a general framework for applying machine-learning algorithms to the verification of Markov decision processes (MDPs). The primary goal of these techniques is to improve performance by avoiding an exhaustive exploration of the state space. Our framework focuses on probabilistic reachability, which is a core property for verification, and is illustrated through two distinct instantiations. The first assumes that full knowledge of the MDP is available, and performs a heuristic-driven partial exploration of the model, yielding precise lower and upper bounds on the required probability. The second tackles the case where we may only sample the MDP, and yields probabilistic guarantees, again in terms of both the lower and upper bounds, which provides efficient stopping criteria for the approximation. The latter is the first extension of statistical model checking for unbounded properties in MDPs. In contrast with other related techniques, our approach is not restricted to time-bounded (finite-horizon) or discounted properties, nor does it assume any particular properties of the MDP. We also show how our methods extend to LTL objectives. We present experimental results showing the performance of our framework on several examples.

...read moreread less

Journal Article•DOI•

Risk-sensitive reinforcement learning

[...]

Yun Shen, Michael J. Tobia¹, Tobias Sommer¹, Klaus Obermayer•Institutions (1)

University of Hamburg¹

01 Jul 2014-Neural Computation

TL;DR: A risk-sensitive Q-learning algorithm is derived, which is necessary for modeling human behavior when transition probabilities are unknown, and applied to quantify human behavior in a sequential investment task and is found to provide a significantly better fit to the behavioral data and leads to an interpretation of the subject's responses that is indeed consistent with prospect theory.

...read moreread less

Abstract: We derive a family of risk-sensitive reinforcement learning methods for agents, who face sequential decision-making tasks in uncertain environments. By applying a utility function to the temporal difference TD error, nonlinear transformations are effectively applied not only to the received rewards but also to the true transition probabilities of the underlying Markov decision process. When appropriate utility functions are chosen, the agents' behaviors express key features of human behavior as predicted by prospect theory Kahneman & Tversky, 1979, for example, different risk preferences for gains and losses, as well as the shape of subjective probability curves. We derive a risk-sensitive Q-learning algorithm, which is necessary for modeling human behavior when transition probabilities are unknown, and prove its convergence. As a proof of principle for the applicability of the new framework, we apply it to quantify human behavior in a sequential investment task. We find that the risk-sensitive variant provides a significantly better fit to the behavioral data and that it leads to an interpretation of the subject's responses that is indeed consistent with prospect theory. The analysis of simultaneously measured fMRI signals shows a significant correlation of the risk-sensitive TD error with BOLD signal change in the ventral striatum. In addition we find a significant correlation of the risk-sensitive Q-values with neural activity in the striatum, cingulate cortex, and insula that is not present if standard Q-values are used.

...read moreread less

Journal Article•DOI•

Multi-objective reinforcement learning using sets of pareto dominating policies

[...]

Kristof Van Moffaert¹, Ann Nowé¹•Institutions (1)

Vrije Universiteit Brussel¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: A novel temporal dierence learning algorithm that integrates the Pareto dominance relation into a reinforcement learning approach and outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies.

...read moreread less

Abstract: Many real-world problems involve the optimization of multiple, possibly conflicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standard reinforcement learning where the scalar reward signal is extended to multiple feedback signals, in essence, one for each objective. MORL is the process of learning policies that optimize multiple criteria simultaneously. In this paper, we present a novel temporal difference learning algorithm that integrates the Pareto dominance relation into a reinforcement learning approach. This algorithm is a multi-policy algorithm that learns a set of Pareto dominating policies in a single run. We name this algorithm Pareto Q-learning and it is applicable in episodic environments with deterministic as well as stochastic transition functions. A crucial aspect of Pareto Q-learning is the updating mechanism that bootstraps sets of Q-vectors. One of our main contributions in this paper is a mechanism that separates the expected immediate reward vector from the set of expected future discounted reward vectors. This decomposition allows us to update the sets and to exploit the learned policies consistently throughout the state space. To balance exploration and exploitation during learning, we also propose three set evaluation mechanisms. These three mechanisms evaluate the sets of vectors to accommodate for standard action selection strategies, such as e-greedy. More precisely, these mechanisms use multi-objective evaluation principles such as the hypervolume measure, the cardinality indicator and the Pareto dominance relation to select the most promising actions. We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto Q-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. We note that (1) Pareto Q-learning is able to learn the entire Pareto front under the usual assumption that each state-action pair is sufficiently sampled, while (2) not being biased by the shape of the Pareto front. Furthermore, (3) the set evaluation mechanisms provide indicative measures for local action selection and (4) the learned policies can be retrieved throughout the state and action space.

...read moreread less

Journal Article•

Policy evaluation with temporal differences: a survey and comparison

[...]

Christoph Dann¹, Gerhard Neumann¹, Jan Peters²•Institutions (2)

Technische Universität Darmstadt¹, Max Planck Society²

01 Jan 2014-Journal of Machine Learning Research

TL;DR: Policy Evaluation with Temporal Differences: A Survey and Comparison and Comparison Journal of Machine Learning Research, 15, 809-883.

...read moreread less

Abstract: Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual-gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.

...read moreread less

Journal Article•DOI•

Multi-agent discrete-time graphical games and reinforcement learning solutions

[...]

Mohammed Abouheaf¹, Frank L. Lewis², Kyriakos G. Vamvoudakis³, Sofie Haesaert⁴, Robert Babuska⁵ - Show less +1 more•Institutions (5)

King Fahd University of Petroleum and Minerals¹, University of Texas at Arlington², University of California, Santa Barbara³, Eindhoven University of Technology⁴, Delft University of Technology⁵

01 Dec 2014-Automatica

TL;DR: A novel reinforcement learning value iteration algorithm is given to solve the dynamic graphical games in an online manner along with its proof of convergence, and it is proved that this notion holds if all agents are in Nash equilibrium and the graph is strongly connected.

...read moreread less

Journal Article•DOI•

Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework

[...]

Mohamed A. Khamis¹, Walid Gomaa², Walid Gomaa¹•Institutions (2)

Egypt-Japan University of Science and Technology¹, Alexandria University²

01 Mar 2014-Engineering Applications of Artificial Intelligence

TL;DR: A multi-agent multi-objective reinforcement learning (RL) traffic signal control framework that simulates the driver's behavior (acceleration/deceleration) continuously in space and time dimensions and significantly outperforms the underlying single objective controller.

...read moreread less

Collapse