scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2000"


Journal ArticleDOI
TL;DR: The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction.
Abstract: This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics--as a subroutine hierarchy--and a declarative semantics--as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this nonhierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

1,486 citations


Journal ArticleDOI
TL;DR: A reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action based on the Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems is presented and algorithms for estimating value functions and improving policies with the use of function approximators are derived.
Abstract: This article presents a reinforcement learning framework for continuous time dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman(HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The processof value function estimation is formulated asthe minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(lambda) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJBbased framework.The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cartpole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

974 citations


Proceedings Article
29 Jun 2000
TL;DR: This paper considers the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method, and analyzes and compares this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling.
Abstract: Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policies can be learned about from the same data stream, and have been identified as particularly useful for learning about subgoals and temporally extended macro-actions. In this paper we consider the off-policy version of the policy evaluation problem, for which only one eligibility trace algorithm is known, a Monte Carlo method. We analyze and compare this and four new eligibility trace algorithms, emphasizing their relationships to the classical statistical technique known as importance sampling. Our main results are 1) to establish the consistency and bias properties of the new methods and 2) to empirically rank the new methods, showing improvement over one-step and Monte Carlo methods. Our results are restricted to model-free, table-lookup methods and to offline updating (at the end of each episode) although several of the algorithms could be applied more generally.

679 citations


Journal ArticleDOI
TL;DR: This paper examines the convergence of single-step on-policy RL algorithms for control with both decaying exploration and persistent exploration and provides examples of exploration strategies that result in convergence to both optimal values and optimal policies.
Abstract: An important application of reinforcement learning (RL) is to finite-state control problems and one of the most difficult problems in learning for control is balancing the exploration/exploitation tradeoff. Existing theoretical results for RL give very little guidance on reasonable ways to perform exploration. In this paper, we examine the convergence of single-step on-policy RL algorithms for control. On-policy algorithms cannot separate exploration from learning and therefore must confront the exploration problem directly. We prove convergence results for several related on-policy algorithms with both decaying exploration and persistent exploration. We also provide examples of exploration strategies that can be followed during learning that result in convergence to both optimal values and optimal policies.

660 citations


Journal ArticleDOI
TL;DR: The experimental results show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups.
Abstract: We propose a quantitative model for dialog systems that can be used for learning the dialog strategy. We claim that the problem of dialog design can be formalized as an optimization problem with an objective function reflecting different dialog dimensions relevant for a given application. We also show that any dialog system can be formally described as a sequential decision process in terms of its state space, action set, and strategy. With additional assumptions about the state transition probabilities and cost assignment, a dialog system can be mapped to a stochastic model known as Markov decision process (MDP). A variety of data driven algorithms for finding the optimal strategy (i.e., the one that optimizes the criterion) is available within the MDP framework, based on reinforcement learning. For an effective use of the available training data we propose a combination of supervised and reinforcement learning: the supervised learning is used to estimate a model of the user, i.e., the MDP parameters that quantify the user's behavior. Then a reinforcement learning algorithm is used to estimate the optimal strategy while the system interacts with the simulated user. This approach is tested for learning the strategy in an air travel information system (ATIS) task. The experimental results we present in this paper show that it is indeed possible to find a simple criterion, a state space representation, and a simulated user parameterization in order to automatically learn a relatively complex dialog behavior, similar to one that was heuristically designed by several research groups.

572 citations


Journal ArticleDOI
TL;DR: It is shown here that Stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE, which implies convergence of the algorithm.
Abstract: It is shown here that stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of the algorithm. Several specific classes of algorithms are considered as applications. It is found that the results provide (i) a simpler derivation of known results for reinforcement learning algorithms; (ii) a proof for the first time that a class of asynchronous stochastic approximation algorithms are convergent without using any a priori assumption of stability; (iii) a proof for the first time that asynchronous adaptive critic and Q-learning algorithms are convergent for the average cost optimal control problem.

515 citations


Proceedings Article
29 Jun 2000
TL;DR: It is proposed that the learning process estimates online the full posterior distribution over models and to determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming.
Abstract: The reinforcement learning problem can be decomposed into two parallel types of inference: (i) estimating the parameters of a model for the underlying process; (ii) determining behavior which maximizes return under the estimated model. Following Dearden, Friedman and Andre (1999), it is proposed that the learning process estimates online the full posterior distribution over models. To determine behavior, a hypothesis is sampled from this distribution and the greedy policy with respect to the hypothesis is obtained by dynamic programming. By using a different hypothesis for each trial appropriate exploratory and exploitative behavior is obtained. This Bayesian method always converges to the optimal policy for a stationary process with discrete states.

489 citations



Book ChapterDOI
18 Sep 2000
TL;DR: It is shown that five principles gleaned from biology are crucial: reinforcement learning, self-organisation, selectionism, co-evolution through structural coupling, and level formation.
Abstract: The paper surveys recent work on modeling the origins of communication systems in groups of autonomous distributed agents. It is shown that five principles gleaned from biology are crucial: reinforcement learning, self-organisation, selectionism, co-evolution through structural coupling, and level formation.

352 citations


Book
01 Jan 2000
TL;DR: This book looks at multiagent systems that consist of teams of autonomous agents acting in real-time, noisy, collaborative, and adversarial environments and introduces a new multiagent reinforcement learning algorithm--team-partitioned, opaque-transition reinforcement learning (TPOT-RL)--designed for domains in which agents cannot necessarily observe the state-changes caused by other agents' actions.
Abstract: From the Publisher: This book looks at multiagent systems that consist of teams of autonomous agents acting in real-time, noisy, collaborative, and adversarial environments. The book makes four main contributions to the fields of machine learning and multiagent systems. First, it describes an architecture within which a flexible team structure allows member agents to decompose a task into flexible roles and to switch roles while acting. Second, it presents layered learning, a general-purpose machine-learning method for complex domains in which learning a mapping directly from agents' sensors to their actuators is intractable with existing machine-learning methods. Third, the book introduces a new multiagent reinforcement learning algorithm--team-partitioned, opaque-transition reinforcement learning (TPOT-RL)--designed for domains in which agents cannot necessarily observe the state-changes caused by other agents' actions. The final contribution is a fully functioning multiagent system that incorporates learning in a real-time, noisy domain with teammates and adversaries--a computer-simulated robotic soccer team. Peter Stone's work is the basis for the CMUnited Robotic Soccer Team, which has dominated recent RoboCup competitions. RoboCup not only helps roboticists to prove their theories in a realistic situation, but has drawn considerable public and professional attention to the field of intelligent robotics. The CMUnited team won the 1999 Stockholm simulator competition, outscoring its opponents by the rather impressive cumulative score of 110-0.

311 citations


01 Jan 2000
TL;DR: A general framework for prediction, control and learning at multiple temporal scales, and the way in which multi-time models can be used to produce plans of behavior very quickly, using classical dynamic programming or reinforcement learning techniques is developed.
Abstract: Decision making usually involves choosing among different courses of action over a broad range of time scales. For instance, a person planning a trip to a distant location makes high-level decisions regarding what means of transportation to use, but also chooses low-level actions, such as the movements for getting into a car. The problem of picking an appropriate time scale for reasoning and learning has been explored in artificial intelligence, control theory and robotics. In this dissertation we develop a framework that allows novel solutions to this problem, in the context of Markov Decision Processes (MDPs) and reinforcement learning. In this dissertation, we present a general framework for prediction, control and learning at multiple temporal scales. In this framework, temporally extended actions are represented by a way of behaving (a policy) together with a termination condition. An action represented in this way is called an option. Options can be easily incorporated in MDPs, allowing an agent to use existing controllers, heuristics for picking actions, or learned courses of action. The effects of behaving according to an option can be predicted using multi-time models, learned by interacting with the environment. In this dissertation we develop multi-time models, and we illustrate the way in which they can be used to produce plans of behavior very quickly, using classical dynamic programming or reinforcement learning techniques. The most interesting feature of our framework is that it allows an agent to work simultaneously with high-level and low-level temporal representations. The interplay of these levels can be exploited in order to learn and plan more efficiently and more accurately. We develop new algorithms that take advantage of this structure to improve the quality of plans, and to learn in parallel about the effects of many different options.

Proceedings ArticleDOI
27 Jul 2000
TL;DR: This paper focuses on a systematic treatment for developing a generic online learning control system based on the fundamental principle of reinforcement learning or more specifically neuro-dynamic programming.
Abstract: This paper focuses on a systematic treatment for developing a generic online learning control system based on the fundamental principle of reinforcement learning or more specifically neuro-dynamic programming. This real time learning system improves its performance over time in two aspects: it learns from its own mistakes through the reinforcement signal from the external environment and try to reinforce its action to improve future performance; and system's state associated with the positive reinforcement is memorized through a network learning process where in the future, similar states will be more positively associated with a control action leading to a positive reinforcement. Two successful candidates of online learning control designs are introduced. Real time learning algorithms can be derived for individual components in the learning system. Some analytical insights are provided to give some guidelines on the entire online learning control system.

Proceedings Article
29 Jun 2000
TL;DR: This paper introduces an algorithm that safely approximates the value function for continuous state control tasks, and that learns quickly from a small amount of data, and gives experimental results using this algorithm to learn policies for both a simulated task and also for a real robot, operating in an unaltered environment.
Abstract: Dynamic control tasks are good candidates for the application of reinforcement learning techniques. However, many of these tasks inherently have continuous state or action variables. This can cause problems for traditional reinforcement learning algorithms which assume discrete states and actions. In this paper, we introduce an algorithm that safely approximates the value function for continuous state control tasks, and that learns quickly from a small amount of data. We give experimental results using this algorithm to learn policies for both a simulated task and also for a real robot, operating in an unaltered environment. The algorithm works well in a traditional learning setting, and demonstrates extremely good learning when bootstrapped with a small amount of human-provided data.

Journal ArticleDOI
Marilyn A. Walker1
TL;DR: A novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users, based on a combination of reinforcement learning and performance modeling of spoken dialogue systems.
Abstract: This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is based on a combination of reinforcement learning and performance modeling of spoken dialogue systems. The reinforcement learning component applies Q-learning (Watkins, 1989), while the performance modeling component applies the PARADISE evaluation framework (Walker et al., 1997) to learn the performance function (reward) used in reinforcement learning. We illustrate the method with a spoken dialogue system named elvis (EmaiL Voice Interactive System), that supports access to email over the phone. We conduct a set of experiments for training an optimal dialogue strategy on a corpus of 219 dialogues in which human users interact with elvis over the phone. We then test that strategy on a corpus of 18 dialogues. We show that elvis can learn to optimize its strategy selection for agent initiative, for reading messages, and for summarizing email folders.

Journal ArticleDOI
01 Jan 2000
TL;DR: A new reinforcement learning paradigm that explicitly takes into account input disturbance as well as modeling errors is proposed, which is called robust reinforcement learning (RRL) and tested on the control task of an inverted pendulum.
Abstract: This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H∞ control, we consider a differential game in which a "disturbing" agent tries to make the worst possible disturbance while a "control" agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H∞ control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired.

Book ChapterDOI
15 Sep 2000
TL;DR: This chapter provides an overview over the ACS system including all parameters as well as framework, structure, and environmental interaction, and a precise description of all algorithms in ACS2 is provided.
Abstract: The various modifications and extensions of the anticipatory classifier system (ACS) recently led to the introduction of ACS2, an enhanced and modified version of ACS. This chapter provides an overview over the system including all parameters as well as framework, structure, and environmental interaction. Moreover, a precise description of all algorithms in ACS2 is provided.

Journal ArticleDOI
TL;DR: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents that browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion.
Abstract: This paper discusses a novel distributed adaptive algorithm and representation used to construct populations of adaptive Web agents. These InfoSpiders browse networked information environments on-line in search of pages relevant to the user, by traversing hyperlinks in an autonomous and intelligent fashion. Each agent adapts to the spatial and temporal regularities of its local context thanks to a combination of machine learning techniques inspired by ecological models: evolutionary adaptation with local selection, reinforcement learning and selective query expansion by internalization of environmental signals, and optional relevance feedback. We evaluate the feasibility and performance of these methods in three domains: a general class of artificial graph environments, a controlled subset of the Web, and (preliminarly) the full Web. Our results suggest that InfoSpiders could take advantage of the starting points provided by search engines, based on global word statistics, and then use linkage topology to guide their search on-line. We show how this approach can complement the current state of the art, especially with respect to the scalability challenge.

Proceedings Article
29 Jun 2000
TL;DR: A kind of MDP that models the algorithm selection problem by allowing multiple state transitions is introduced, and the well known Q-learning algorithm is adapted for this case in a way that combines both Monte-Carlo and Temporal Difference methods.
Abstract: Many computational problems can be solved by multiple algorithms, with different algorithms fastest for different problem sizes, input distributions, and hardware characteristics. We consider the problem of algorithm selection: dynamically choose an algorithm to attack an instance of a problem with the goal of minimizing the overall execution time. We formulate the problem as a kind of Markov decision process (MDP), and use ideas from reinforcement learning to solve it. This paper introduces a kind of MDP that models the algorithm selection problem by allowing multiple state transitions. The well known Q-learning algorithm is adapted for this case in a way that combines both Monte-Carlo and Temporal Difference methods. Also, this work uses, and extends in a way to control problems, the Least-Squares Temporal Difference algorithm (LSTD(0)) of Boyan. The experimental study focuses on the classic problems of order statistic selection and sorting. The encouraging results reveal the potential of applying learning methods to traditional computational problems.

Journal ArticleDOI
TL;DR: This paper defines the recognition process as a sequential decision problem with the objective to disambiguate initial object hypotheses, and Reinforcement learning provides an efficient method to autonomously develop near-optimal decision strategies in terms of sensorimotor mappings.

Proceedings ArticleDOI
28 May 2000
TL;DR: An algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process is presented and it is proved that the algorithm converges with probability 1.
Abstract: Many control, scheduling, planning and game-playing tasks can be formulated as reinforcement learning problems, in which an agent chooses actions to take in some environment, aiming to maximize a reward function. We present an algorithm for computing approximations to the gradient of the average reward from a single sample path of a controlled partially observable Markov decision process. We show that the accuracy of these approximations depends on the relationship between a time constant used by the algorithm and the mixing time of the Markov chain, and that the error can be made arbitrarily small by setting the time constant suitably large. We prove that the algorithm converges with probability 1.

01 Oct 2000
TL;DR: This paper contributes a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities, and examines the assumptions and limitations of these algorithms.
Abstract: : Learning behaviors in a multiagent environment are crucial for developing and adapting multiagent systems. Reinforcement learning techniques have addressed this problem for a single agent acting in a stationary environment, which is modeled as a Markov decision process (MDP). But, multiagent environments are inherently non-stationary since the other agents are free to change their behavior as they also learn and adapt. Stochastic games, first studied in the game theory community, are a natural extension of MDPs to include multiple agents. In this paper we contribute a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the assumptions and limitations of these algorithms, and identify similarities between these algorithms, single agent reinforcement learners, and basic game theory techniques.

Journal ArticleDOI
TL;DR: Methods of neuro-dynamic programming [reinforcement learning (RL)], together with a decomposition approach, are used to construct dynamic (state-dependent) call admission control and routing policies based on state-dependent link costs.
Abstract: We consider the problem of call admission control (CAC) and routing in an integrated services network that handles several classes of calls of different value and with different resource requirements. The problem of maximizing the average value of admitted calls per unit time (or of revenue maximization) is naturally formulated as a dynamic programming problem, but is too complex to allow for an exact solution. We use methods of neuro-dynamic programming (NDP) [reinforcement learning (RL)], together with a decomposition approach, to construct dynamic (state-dependent) call admission control and routing policies. These policies are based on state-dependent link costs, and a simulation-based learning method is employed to tune the parameters that define these link costs. A broad set of experiments shows the robustness of our policy and compares its performance with a commonly used heuristic.

Journal ArticleDOI
01 Nov 2000
TL;DR: A general approach to image segmentation and object recognition that can adapt the image segmentsation algorithm parameters to the changing environmental conditions is presented and the performance improvement over time is shown.
Abstract: The paper presents a general approach to image segmentation and object recognition that can adapt the image segmentation algorithm parameters to the changing environmental conditions. Segmentation parameters are represented by a team of generalized stochastic learning automata and learned using connectionist reinforcement learning techniques. The edge-border coincidence measure is first used as reinforcement for segmentation evaluation to reduce computational expenses associated with model matching during the early stage of adaptation. This measure alone, however, cannot reliably predict the outcome of object recognition. Therefore, it is used in conjunction with model matching where the matching confidence is used as a reinforcement signal to provide optimal segmentation evaluation in a closed-loop object recognition system. The adaptation alternates between global and local segmentation processes in order to achieve optimal recognition performance. Results are presented for both indoor and outdoor color images where the performance improvement over time is shown for both image segmentation and object recognition.

Proceedings Article
01 Jan 2000
TL;DR: This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region.
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(0) and V(0); the latter algorithm was used in the well-known TD-Gammon program.

Journal ArticleDOI
TL;DR: A probabilistic framework for modelling spoken–dialogue systems on the assumption that the overall system behaviour can be represented as a Markov decision process is presented and the optimization of dialogue–management strategy using reinforcement learning is reviewed.
Abstract: This paper presents a probabilistic framework for modelling spoken–dialogue systems. On the assumption that the overall system behaviour can be represented as a Markov decision process, the optimization of dialogue–management strategy using reinforcement learning is reviewed. Examples of learning behaviour are presented for both dynamic programming and sampling methods, but the latter are preferred. The paper concludes by noting the importance of user simulation models for the practical application of these techniques and the need for developing methods of mapping system features in order to achieve sufficiently compact state spaces.

Proceedings Article
01 Jan 2000
TL;DR: Together, the methods presented in this work comprise a system for agent design that allows the programmer to specify what they know, hint at what they suspect using soft shaping, and leave unspecified that which they don't know; the system then optimally completes the program through experience and takes advantage of the hierarchical structure of the specified program to speed learning.
Abstract: This dissertation examines the use of partial programming as a means of designing agents for large Markov Decision Problems. In this approach, a programmer specifies only that which they know to be correct and the system then learns the rest from experience using reinforcement learning. In contrast to previous low-level languages for partial programming, this dissertation presents ALisp, a Lisp-based high-level partial programming language. ALisp allows the programmer to constrain the policies considered by a learning process and to express his or her prior knowledge in a concise manner. Optimally completing a partial ALisp program is shown to be equivalent to solving a Semi-Markov Decision Problem (SMDP). Under a finite memory-use condition, online learning algorithms for ALisp are proved to converge to an optimal solution of the SMDP and thus to an optimal completion of the partial program. This dissertation then presents methods for exploiting the modularity allows an agent to ignore aspects of its current state that are irrelevant to its current decision, and therefore speeds up reinforcement learning. By decomposing representations of the value of actions along subroutine boundaries, optimality, i.e., optimality among all policies consistent with the partial program. These methods are demonstrated on two simulated taxi tasks. Function approximation, a method for representing the value of actions, allows reinforcement learning to be applied to problems where exact methods are intractable. Soft shaping is a method for guiding an agent toward a solution without constraining the search space. Both can be integrated with ALisp. ALisp with function approximation and reward shaping is successfully applied on a difficult continuous variant of the simulated taxi task. Together, the methods presented in this work comprise a system for agent design that allows the programmer to specify what they know, hint at what they suspect using soft shaping, and leave unspecified that which they don't know; the system then optimally completes the program through experience and takes advantage of the hierarchical structure of the specified program to speed learning.

Journal ArticleDOI
TL;DR: In this paper, the optimal policy for a learning by doing problem is characterized using numerical methods and shown to have a substantial degree of experimentation under a wide range of initial beliefs about the unknown parameters.

Proceedings Article
30 Jul 2000
TL;DR: ADVISOR, a two-agent machine learning architecture for intelligent tutoring systems (ITS), is constructed to centralize the reasoning of an ITS into a single component to allow customization of teaching goals and to simplify improving the ITS.
Abstract: We have constructed ADVISOR, a two-agent machine learning architecture for intelligent tutoring systems (ITS). The purpose of this architecture is to centralize the reasoning of an ITS into a single component to allow customization of teaching goals and to simplify improving the ITS. The first agent is responsible for learning a model of how students perform using the tutor in a variety of contexts. The second agent is provided this model of student behavior and a goal specifying the desired educational objective. Reinforcement learning is used by this agent to derive a teaching policy that meets the specified educational goal. Component evaluation studies show each agent performs adequately in isolation. We have also conducted an evaluation with actual students of the complete architecture. Results show ADVISOR was successful in learning a teaching policy that met the educational objective provided. Although this set of machine learning agents has been integrated with a specific intelligent tutor, the general technique could be applied to a broad class of ITS.

Proceedings Article
29 Jun 2000
TL;DR: GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy, is introduced and it is proved convergence of GPOMDP.
Abstract: This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1), which has a natural interpretation in terms of bias-variance trade-off, and it requires no knowledge of the underlying state. We prove convergence of GPOMDP and show how the gradient estimates produced by GPOMDP can be used in a conjugate-gradient procedure to find local optima of the average reward.

Journal ArticleDOI
TL;DR: In this article, continuous action reinforcement learning automata (CARLA) are used to simultaneously tune the parameters of a three term controller on-line to minimise a performance objective, and the algorithm is first applied in simulation on a nominal engine model, and this is followed by a practical study using a Ford Zetec engine in a test cell.