scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2003"


Journal Article
TL;DR: It is shown how a standard tool from statistics, namely confidence bounds, can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off, and improves the regret from O(T3/4) to T1/2.
Abstract: We show how a standard tool from statistics --- namely confidence bounds --- can be used to elegantly deal with situations which exhibit an exploitation-exploration trade-off. Our technique for designing and analyzing algorithms for such situations is general and can be applied when an algorithm has to make exploitation-versus-exploration decisions based on uncertain information provided by a random process. We apply our technique to two models with such an exploitation-exploration trade-off. For the adversarial bandit problem with shifting our new algorithm suffers only O((ST)1/2) regret with high probability over T trials with S shifts. Such a regret bound was previously known only in expectation. The second model we consider is associative reinforcement learning with linear value functions. For this model our technique improves the regret from O(T3/4) to O(T1/2).

1,496 citations


Journal ArticleDOI
TL;DR: The new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework.
Abstract: We propose a new approach to reinforcement learning for control problems which combines value-function approximation with linear architectures and approximate policy iteration. This new approach is motivated by the least-squares temporal-difference learning algorithm (LSTD) for prediction problems, which is known for its efficient use of sample experiences compared to pure temporal-difference algorithms. Heretofore, LSTD has not had a straightforward application to control problems mainly because LSTD learns the state value function of a fixed policy which cannot be used for action selection and control without a model of the underlying process. Our new algorithm, least-squares policy iteration (LSPI), learns the state-action value function which allows for action selection without a model and for incremental policy improvement within a policy-iteration framework. LSPI is a model-free, off-policy method which can use efficiently (and reuse in each iteration) sample experiences collected in any manner. By separating the sample collection method, the choice of the linear approximation architecture, and the solution method, LSPI allows for focused attention on the distinct elements that contribute to practical reinforcement learning. LSPI is tested on the simple task of balancing an inverted pendulum and the harder task of balancing and riding a bicycle to a target location. In both cases, LSPI learns to control the pendulum or the bicycle by merely observing a relatively small number of trials where actions are selected randomly. LSPI is also compared against Q-learning (both with and without experience replay) using the same value function architecture. While LSPI achieves good performance fairly consistently on the difficult bicycle task, Q-learning variants were rarely able to balance for more than a small fraction of the time needed to reach the target location.

1,405 citations


Journal ArticleDOI
TL;DR: This work reviews several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed and discusses extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability.
Abstract: Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination This leads naturally to hierarchical control architectures and associated learning algorithms We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting

1,175 citations


Journal ArticleDOI
TL;DR: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time and formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms.
Abstract: R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, it is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E3 algorithm, covering zero-sum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms. (4) It is simpler, more general, and more efficient than Brafman and Tennenholtz's LSG algorithm for learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Tennenholtz for learning in repeated games. (6) It is the only algorithm for learning in repeated games, to date, which is provably efficient, considerably improving and simplifying previous algorithms by Banos and by Megiddo.

1,011 citations


Proceedings Article
01 Dec 2003
TL;DR: This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic, where the actor updates are based on stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression.
Abstract: This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.

764 citations


Journal ArticleDOI
TL;DR: This article proposes and analyzes a class of actor-critic algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction, based on information provided by the critic.
Abstract: In this article, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction, based on information provided by the critic. We show that the features for the critic should ideally span a subspace prescribed by the choice of parameterization of the actor. We study actor-critic algorithms for Markov decision processes with Polish state and action spaces. We state and prove two results regarding their convergence.

634 citations


Dissertation
01 Jan 2003
TL;DR: Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.
Abstract: This thesis is a detailed investigation into the following question: how much data must an agent collect in order to perform "reinforcement learning" successfully? This question is analogous to the classical issue of the sample complexity in supervised learning, but is harder because of the increased realism of the reinforcement learning setting. This thesis summarizes recent sample complexity results in the reinforcement learning literature and builds on these results to provide novel algorithms with strong performance guarantees. We focus on a variety of reasonable performance criteria and sampling models by which agents may access the environment. For instance, in a policy search setting, we consider the problem of how much simulated experience is required to reliably choose a "good" policy among a restricted class of policies II (as in Kearns, Mansour, and Ng [2000]). In a more online setting, we consider the case in which an agent is placed in an environment and must follow one unbroken chain of experience with no access to "offline" simulation (as in Kearns and Singh [1998]). We build on the sample based algorithms suggested by Kearns, Mansour, and Ng [2000]. Their sample complexity bounds have no dependence on the size of the state space, an exponential dependence on the planning horizon time, and linear dependence on the complexity of II. We suggest novel algorithms with more restricted guarantees whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class II, but have only a polynomial dependence on the horizon time. We pay particular attention to the tradeoffs made by such algorithms.

626 citations


Journal ArticleDOI
TL;DR: An introduction to Q-learning, a simple yet powerful reinforcement learning algorithm, is presented and a case study involving application to traffic signal control is presented, which involves optimal control of heavily congested traffic across a two-dimensional road network.
Abstract: The ability to exert real-time, adaptive control of transportation processes is the core of many intelligent transportation systems decision support tools. Reinforcement learning, an artificial intelligence approach undergoing development in the machine- learning community, offers key advantages in this regard. The ability of a control agent to learn relationships between control actions and their effect on the environment while pursuing a goal is a distinct improvement over prespecified models of the environment. Prespecified models are a prerequisite of conventional control methods and their accuracy limits the performance of control agents. This paper contains an introduction to Q-learning, a simple yet powerful reinforcement learning algorithm, and presents a case study involving application to traffic signal control. Encouraging results of the application to an isolated traffic signal, particularly under variable traffic conditions, are presented. A broader research effort is outlined, including extension to linear and networked signal systems and integration with dynamic route guidance. The research objective involves optimal control of heavily congested traffic across a two-dimensional road network—a challenging task for conventional traffic signal control methodologies.

459 citations


Book
30 Jun 2003
TL;DR: This book examines the mathematical governing principles of simulation-based optimization, thereby providing the reader with the ability to model relevant real-life problems using these techniques, and outlines the computational technology underlying these methods.
Abstract: From the Publisher: "Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning introduces the evolving area of simulation-based optimization. Since it became possible to analyze random systems using computers, scientists and engineers have sought the means to optimize systems using simulation models. Only recently, however, has this objective had success in practice. Cutting-edge work in computational operations research, including non-linear programming (simultaneous perturbation), dynamic programming (reinforcement learning), and game theory (learning automata) has made it possible to use simulation in conjunction with optimization techniques. As a result, this research has given simulation added dimensions and power that it did not have in the recent past." "The book's objective is two-fold: (1) It examines the mathematical governing principles of simulation-based optimization, thereby providing the reader with the ability to model relevant real-life problems using these techniques. (2) It outlines the computational technology underlying these methods. Taken together, these two aspects demonstrate that the mathematical and computational methods discussed in this book do work." "Broadly speaking, the book has two parts: (1) parametric (static) optimization and (2) control (dynamic) optimization. Some of the book's special features are: an accessible introduction to reinforcement learning and parametric-optimization techniques; a step-by-step description of several algorithms of simulation-based optimization; a clear and simple introduction to the methodology of neural networks; a gentle introduction to convergence analysis of some of the methods enumerated above; and Computer programs for many algorithms of simulation-based optimization." This book is written for students and researchers in the fields of engineering (electrical, industrial and computer), computer science, operations research, management science, and applied mathematics.

442 citations


Proceedings Article
01 Sep 2003
TL;DR: This paper discusses different approaches of reinforcement learning in terms of their applicability in humanoid robotics, and demonstrates that ‘vanilla’ policy gradient methods can be significantly improved using the natural policy gradient instead of the regular policy gradient.
Abstract: Reinforcement learning offers one of the most general framework to take traditional robotics towards true autonomy and versatility. However, applying reinforcement learning to high dimensional movement systems like humanoid robots remains an unsolved problem. In this paper, we discuss different approaches of reinforcement learning in terms of their applicability in humanoid robotics. Methods can be coarsely classified into three different categories, i.e., greedy methods, ‘vanilla’ policy gradient methods, and natural gradient methods. We discuss that greedy methods are not likely to scale into the domain humanoid robotics as they are problematic when used with function approximation. ‘Vanilla’ policy gradient methods on the other hand have been successfully applied on real-world robots including at least one humanoid robot [3]. We demonstrate that these methods can be significantly improved using the natural policy gradient instead of the regular policy gradient. A derivation of the natural policy gradient is provided, proving that the average policy gradient of Kakade [10] is indeed the true natural gradient. A general algorithm for estimating the natural gradient, the Natural Actor-Critic algorithm, is introduced. This algorithm converges to the nearest local minimum of the cost function with respect to the Fisher information metric under suitable conditions. The algorithm outperforms non-natural policy gradients by far in a cart-pole balancing evaluation, and for learning nonlinear dynamic motor primitives for humanoid robot control. It offers a promising route for the development of reinforcement learning for truly high-dimensionally continuous state-action systems.

361 citations


Journal ArticleDOI
TL;DR: This paper investigates two models of accuracy-based learning classifier systems on different types of classification problems, and provides a model on the learning complexity of LCS which is based on the representative examples given to the system.
Abstract: Recently, Learning Classifier Systems (LCS) and particularly XCS have arisen as promising methods for classification tasks and data mining. This paper investigates two models of accuracy-based learning classifier systems on different types of classification problems. Departing from XCS, we analyze the evolution of a complete action map as a knowledge representation. We propose an alternative, UCS, which evolves a best action map more efficiently. We also investigate how the fitness pressure guides the search towards accurate classifiers. While XCS bases fitness on a reinforcement learning scheme, UCS defines fitness from a supervised learning scheme. We find significant differences in how the fitness pressure leads towards accuracy, and suggest the use of a supervised approach specially for multi-class problems and problems with unbalanced classes. We also investigate the complexity factors which arise in each type of accuracy-based LCS. We provide a model on the learning complexity of LCS which is based on the representative examples given to the system. The results and observations are also extended to a set of real world classification problems, where accuracy-based LCS are shown to perform competitively with respect to other learning algorithms. The work presents an extended analysis of accuracy-based LCS, gives insight into the understanding of the LCS dynamics, and suggests open issues for further improvement of LCS on classification tasks.

Proceedings Article
09 Dec 2003
TL;DR: This paper first fit a stochastic, nonlinear model of the helicopter dynamics, then uses the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.

01 Jan 2003
TL;DR: The recent work in AI on multi-agent reinforcement learning is surveyed and it is argued that, while exciting, this work is flawed; the fundamental flaw is unclarity about the problem or problems being addressed.
Abstract: We survey the recent work in AI on multi-agent reinforcement learning (that is, learning in stochastic games). We then argue that, while exciting, this work is flawed. The fundamental flaw is unclarity about the problem or problems being addressed. After tracing a representative sample of the recent literature, we identify four well-defined problems in multi-agent reinforcement learning, single out the problem that in our view is most suitable for AI, and make some remarks about how we believe progress is to be made on this problem.

Proceedings Article
09 Dec 2003
TL;DR: It is speculated that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

Journal ArticleDOI
TL;DR: It is suggested that the phasic and tonic components of dopamine neuron firing can encode the signal required for meta-learning of reinforcement learning.

Journal ArticleDOI
TL;DR: A formal model of implicit imitation that can accelerate reinforcement learning dramatically in certain cases is proposed and studied, and the benefits of implicit imitate are illustrated by integrating it with prioritized sweeping, and demonstrating improved performance and convergence through observation of single and multiple mentors.
Abstract: Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments an agent's ability to learn useful behaviors by making intelligent use of the knowledge implicit in behaviors demonstrated by cooperative teachers or other more experienced agents. We propose and study a formal model of implicit imitation that can accelerate reinforcement learning dramatically in certain cases. Roughly, by observing a mentor, a reinforcement-learning agent can extract information about its own capabilities in, and the relative value of, unvisited parts of the state space. We study two specific instantiations of this model, one in which the learning agent and the mentor have identical abilities, and one designed to deal with agents and mentors with difierent action sets. We illustrate the benefits of implicit imitation by integrating it with prioritized sweeping, and demonstrating improved performance and convergence through observation of single and multiple mentors. Though we make some stringent assumptions regarding observability and possible interactions, we briefly comment on extensions of the model that relax these restricitions.

Proceedings ArticleDOI
09 Aug 2003
TL;DR: This work proposes a natural metric on controller parameterization that results from considering the manifold of probability distributions over paths induced by a stochastic controller that leads to a covariant gradient ascent rule.
Abstract: We investigate the problem of non-covariant behavior of policy gradient reinforcement learning algorithms. The policy gradient approach is amenable to analysis by information geometric methods. This leads us to propose a natural metric on controller parameterization that results from considering the manifold of probability distributions over paths induced by a stochastic controller. Investigation of this approach leads to a covariant gradient ascent rule. Interesting properties of this rule are discussed, including its relation with actor-critic style reinforcement learning algorithms. The algorithms discussed here are computationally quite efficient and on some interesting problems lead to dramatic performance improvement over noncovariant rules.

Journal ArticleDOI
01 Sep 2003
TL;DR: A new hybrid, synergistic approach in applying computational intelligence concepts to implement a cooperative, hierarchical, multiagent system for real-time traffic signal control of a complex traffic network is presented.
Abstract: This paper presents a new hybrid, synergistic approach in applying computational intelligence concepts to implement a cooperative, hierarchical, multiagent system for real-time traffic signal control of a complex traffic network. The large-scale traffic signal control problem is divided into various subproblems, and each subproblem is handled by an intelligent agent with a fuzzy neural decision-making module. The decisions made by lower-level agents are mediated by their respective higher-level agents. Through adopting a cooperative distributed problem solving approach, coordinated control by the agents is achieved. In order for the multiagent architecture to adapt itself continuously to the dynamically changing problem domain, a multistage online learning process for each agent is implemented involving reinforcement learning, learning rate and weight adjustment as well as dynamic update of fuzzy relations using an evolutionary algorithm. The test bed used for this research is a section of the Central Business District of Singapore. The performance of the proposed multiagent architecture is evaluated against the set of signal plans used by the current real-time adaptive traffic control system. The multiagent architecture produces significant improvements in the conditions of the traffic network, reducing the total mean delay by 40% and total vehicle stoppage time by 50%.

Proceedings Article
09 Dec 2003
TL;DR: If a "baseline distribution" is given (indicating roughly how often the authors expect a good policy to visit each state), then a policy search algorithm is derived that terminates in a finite number of steps, and for which the author can provide non-trivial performance guarantees.
Abstract: We consider the policy search approach to reinforcement learning. We show that if a "baseline distribution" is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem.

Proceedings ArticleDOI
14 Jul 2003
TL;DR: This work develops tractable approximations to optimal Bayesian exploration in MARL problems that allows these exploration costs to be weighed against their expected benefits using the notion of value of information.
Abstract: Much emphasis in multiagent reinforcement learning (MARL) research is placed on ensuring that MARL algorithms (eventually) converge to desirable equilibria. As in standard reinforcement learning, convergence generally requires sufficient exploration of strategy space. However, exploration often comes at a price in the form of penalties or foregone opportunities. In multiagent settings, the problem is exacerbated by the need for agents to "coordinate" their policies on equilibria. We propose a Bayesian model for optimal exploration in MARL problems that allows these exploration costs to be weighed against their expected benefits using the notion of value of information. Unlike standard RL models, this model requires reasoning about how one's actions will influence the behavior of other agents. We develop tractable approximations to optimal Bayesian exploration, and report on experiments illustrating the benefits of this approach in identical interest games.

Proceedings Article
21 Aug 2003
TL;DR: It is argued that the use of SVMs, particularly in combination with the kernel trick, can make it easier to apply reinforcement learning as an "out-of-the-box" technique, without extensive feature engineering.
Abstract: The basic tools of machine learning appear in the inner loop of most reinforcement learning algorithms, typically in the form of Monte Carlo methods or function approximation techniques. To a large extent, however, current reinforcement learning algorithms draw upon machine learning techniques that are at least ten years old and, with a few exceptions, very little has been done to exploit recent advances in classification learning for the purposes of reinforcement learning. We use a variant of approximate policy iteration based on rollouts that allows us to use a pure classification learner, such as a support vector machine (SVM), in the inner loop of the algorithm. We argue that the use of SVMs, particularly in combination with the kernel trick, can make it easier to apply reinforcement learning as an "out-of-the-box" technique, without extensive feature engineering. Our approach opens the door to modern classification methods, but does not preclude the use of classical methods. We present experimental results in the pendulum balancing and bicycle riding domains using both SVMs and neural networks for classifiers.

Dissertation
01 Jan 2003
TL;DR: This dissertation develops a methodology for solving real world control tasks consisting of an efficient neuroevolution algorithm that solves difficult non-linear control tasks by coevolving neurons, an incremental evolution method to scale the algorithm to the most challenging tasks, and a technique for making controllers robust so that they can transfer from simulation to the real world.
Abstract: Many complex control problems require sophisticated solutions that are not amenable to traditional controller design. Not only is it difficult to model real world systems, but often it is unclear what kind of behavior is required to solve the task. Reinforcement learning approaches have made progress in such problems, but have so far not scaled well. Neuroevolution, has improved upon conventional reinforcement learning, but has still not been successful in full-scale, non-linear control problems. This dissertation develops a methodology for solving real world control tasks consisting of three components: (1) an efficient neuroevolution algorithm that solves difficult non-linear control tasks by coevolving neurons, (2) an incremental evolution method to scale the algorithm to the most challenging tasks, and (3) a technique for making controllers robust so that they can transfer from simulation to the real world. The method is faster than other approaches on a set of difficult learning benchmarks, and is used in two full-scale control tasks demonstrating its applicability to real world problems.

Proceedings Article
21 Aug 2003
TL;DR: This paper presents a method for incorporating arbitrary advice into the reward structure of a reinforcement learning agent without altering the optimal policy, and develops two qualitatively different methods for converting a potential function into advice for the agent.
Abstract: An important issue in reinforcement learning is how to incorporate expert knowledge in a principled manner, especially as we scale up to real-world tasks. In this paper, we present a method for incorporating arbitrary advice into the reward structure of a reinforcement learning agent without altering the optimal policy. This method extends the potential-based shaping method proposed by Ng et al. (1999) to the case of shaping functions based on both states and actions. This allows for much more specific information to guide the agent - which action to choose - without requiring the agent to discover this from the rewards on states alone. We develop two qualitatively different methods for converting a potential function into advice for the agent. We also provide theoretical and experimental justifications for choosing between these advice-giving algorithms based on the properties of the potential function.

Journal ArticleDOI
TL;DR: This work proposes a method for constructing safe, reliable reinforcement learning agents based on Lyapunov design principles that ensures qualitatively satisfactory agent behavior for virtually any reinforcement learning algorithm and at all times, including while the agent is learning and taking exploratory actions.
Abstract: Lyapunov design methods are used widely in control engineering to design controllers that achieve qualitative objectives, such as stabilizing a system or maintaining a system's state in a desired operating range. We propose a method for constructing safe, reliable reinforcement learning agents based on Lyapunov design principles. In our approach, an agent learns to control a system by switching among a number of given, base-level controllers. These controllers are designed using Lyapunov domain knowledge so that any switching policy is safe and enjoys basic performance guarantees. Our approach thus ensures qualitatively satisfactory agent behavior for virtually any reinforcement learning algorithm and at all times, including while the agent is learning and taking exploratory actions. We demonstrate the process of designing safe agents for four different control problems. In simulation experiments, we find that our theoretically motivated designs also enjoy a number of practical benefits, including reasonable performance initially and throughout learning, and accelerated learning.

Proceedings ArticleDOI
08 Dec 2003
TL;DR: The CMA-ES is applied to the optimization of the weights of neural networks for solving reinforcement learning problems, and results with fixed network topologies are significantly better than those reported for the best evolutionary method so far.
Abstract: We apply the CMA-ES, an evolution strategy which efficiently adapts the covariance matrix of the mutation distribution, to the optimization of the weights of neural networks for solving reinforcement learning problems. It turns out that the topology of the networks considerably influences the time to find a suitable control strategy. Still, our results with fixed network topologies are significantly better than those reported for the best evolutionary method so far, which adapts both the weights and the structure of the networks.

Proceedings Article
21 Aug 2003
TL;DR: This paper explores a very simple agent design method called Q-decomposition, wherein a complex agent is built from simpler subagents, wherein each subagent has its own reward function and runs its own reinforcement learning process.
Abstract: The paper explores a very simple agent design method called Q-decomposition, wherein a complex agent is built from simpler subagents Each subagent has its own reward function and runs its own reinforcement learning process It supplies to a central arbitrator the Q-values (according to its own reward function) for each possible action The arbitrator selects an action maximizing the sum of Q-values from all the subagents This approach has advantages over designs in which subagents recommend actions It also has the property that if each subagent runs the Sarsa reinforcement learning algorithm to learn its local Q-function, then a globally optimal policy is achieved (On the other hand, local Q-learning leads to globally suboptimal behavior) In some cases, this form of agent decomposition allows the local Q-functions to be expressed by much-reduced state and action spaces These results are illustrated in two domains that require effective coordination of behaviors

Journal ArticleDOI
01 Feb 2003
TL;DR: A neural fuzzy system with mixed coarse learning and fine learning phases is proposed, which is able to perform collision-free navigation and a new learning method using a modification of Sutton and Barto's model is proposed to strengthen the exploration.
Abstract: Fuzzy logic systems are promising for efficient obstacle avoidance. However, it is difficult to maintain the correctness, consistency, and completeness of a fuzzy rule base constructed and tuned by a human expert. A reinforcement learning method is capable of learning the fuzzy rules automatically. However, it incurs a heavy learning phase and may result in an insufficiently learned rule base due to the curse of dimensionality. In this paper, we propose a neural fuzzy system with mixed coarse learning and fine learning phases. In the first phase, a supervised learning method is used to determine the membership functions for input and output variables simultaneously. After sufficient training, fine learning is applied which employs reinforcement learning algorithm to fine-tune the membership functions for output variables. For sufficient learning, a new learning method using a modification of Sutton and Barto's model is proposed to strengthen the exploration. Through this two-step tuning approach, the mobile robot is able to perform collision-free navigation. To deal with the difficulty of acquiring a large amount of training data with high consistency for supervised learning, we develop a virtual environment (VE) simulator, which is able to provide desktop virtual environment (DVE) and immersive virtual environment (IVE) visualization. Through operating a mobile robot in the virtual environment (DVE/IVE) by a skilled human operator, training data are readily obtained and used to train the neural fuzzy system.

Proceedings Article
09 Dec 2003
TL;DR: This work provides a simple and efficient algorithm that in part uses a linear system to model the world from a single agent's limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy.
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent's limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy.

Journal ArticleDOI
TL;DR: It is proved that a reinforcement learner with initial Q-values based on the shaping algorithm's potential function make the same updates throughout learning as a learner receiving potential-based shaping rewards.
Abstract: Shaping has proven to be a powerful but precarious means of improving reinforcement learning performance. Ng, Harada, and Russell (1999) proposed the potential-based shaping algorithm for adding shaping rewards in a way that guarantees the learner will learn optimal behavior. In this note, we prove certain similarities between this shaping algorithm and the initialization step required for several reinforcement learning algorithms. More specifically, we prove that a reinforcement learner with initial Q-values based on the shaping algorithm's potential function make the same updates throughout learning as a learner receiving potential-based shaping rewards. We further prove that under a broad category of policies, the behavior of these two learners are indistinguishable. The comparison provides intuition on the theoretical properties of the shaping algorithm as well as a suggestion for a simpler method for capturing the algorithm's benefit. In addition, the equivalence raises previously unaddressed issues concerning the efficiency of learning with potential-based shaping.

Proceedings Article
21 Aug 2003
TL;DR: The metric-E3 algorithm as mentioned in this paper is a generalization of the E3 algorithm, which assumes a black box for approximate planning and finds a near optimal policy in an amount of time that does not directly depend on the size of the state space, but instead depends on the covering number of neighborhoods required for accurate local modeling.
Abstract: We present metric-E3, a provably near-optimal algorithm for reinforcement learning in Markov decision processes in which there is a natural metric on the state space that allows the construction of accurate local models. The algorithm is a generalization of the E3 algorithm of Kearns and Singh, and assumes a black box for approximate planning. Unlike the original E3, metric-E 3 finds a near optimal policy in an amount of time that does not directly depend on the size of the state space, but instead depends on the covering number of the state space. Informally, the covering number is the number of neighborhoods required for accurate local modeling.