scispace - formally typeset
Search or ask a question
Proceedings Article

Non-parametric Approximate Dynamic Programming via the Kernel Method

TL;DR: A novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees and can serve as a viable alternative to state-of-the-art parametric ADP algorithms.
Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper , the authors introduce a theoretical framework for studying the behavior of reinforcement learning algorithms that learn to control an MDP using history-based feature abstraction mappings, and numerically evaluate its effectiveness on a set of continuous control tasks.
Abstract: Reinforcement learning (RL) folklore suggests that history-based function approximation methods, such as recurrent neural nets or history-based state abstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks.

2 citations

Proceedings Article
24 May 2022
TL;DR: This paper studies the convergence of the regularized non-parametric TD(0) algorithm, in both the independent and Markovian observation settings, and proves convergence ofThe averaged iterates to the optimal value function, even when it does not belong to the RKHS.
Abstract: Temporal-difference learning is a popular algorithm for policy evaluation. In this paper, we study the convergence of the regularized non-parametric TD(0) algorithm, in both the independent and Markovian observation settings. In particular, when TD is performed in a universal reproducing kernel Hilbert space (RKHS), we prove convergence of the averaged iterates to the optimal value function, even when it does not belong to the RKHS. We provide explicit convergence rates that depend on a source condition relating the regularity of the optimal value function to the RKHS. We illustrate this convergence numerically on a simple continuous-state Markov reward process.

2 citations

Journal ArticleDOI
TL;DR: This paper captures the future cost of having nurse practitioners positioned at certain locations around the city by estimating a cost-to-go function using a tree or tree ensemble, and leverages formulation techniques from Mixed Integer Optimization to solve the optimization problem at each stage sufficiently fast.
Abstract: This paper is motivated from a collaboration with a start-up delivering on-demand health care services. In this setting, nurse practitioners need to be dynamically routed to patients' houses as service requests are received. We solve this problem using approximate dynamic programming and machine learning techniques. At every stage, an optimization problem needs to be solved to assign nurse practitioners to patients. We capture the future cost of having nurse practitioners positioned at certain locations around the city by estimating a cost-to-go function. Many previous approaches treat the cost-to-go function as a black box, which requires the enumeration of all possible actions to make a routing decision at each stage. This is an intractable problem in the reoptimization setting. Our approach overcomes this issue by approximating the cost-to-go function using a tree or tree ensemble, and leverages formulation techniques from Mixed Integer Optimization to solve the optimization problem at each stage sufficiently fast. Furthermore, this approach has the advantage of being more accurate relative to separable approximations used in the literature. We also apply this approach to several other online optimization problems in operations management which can be modeled through an underlying optimization formulation that needs to be solved at each stage, and show improvements relative to state-of-the-art methods.

2 citations


Cites background or methods from "Non-parametric Approximate Dynamic ..."

  • ...A range of approximations have been explored in the literature including linear (Powell and Carvalho (1998), De Farias and Van Roy (2003)), separable concave (Topaloglu and Powell (2003)), non-parametric kernel (Bhat et al. (2012)) or neural networks (Mnih et al. (2013), Mnih et al....

    [...]

  • ...A range of approximations have been explored in the literature including linear (Powell and Carvalho (1998), De Farias and Van Roy (2003)), separable concave (Topaloglu and Powell (2003)), non-parametric kernel (Bhat et al. (2012)) or neural networks (Mnih et al. (2013), Mnih et al. (2015)). A number of these techniques have also been applied to a vehicle routing setting (Godfrey and Powell (2002), Powell and Topaloglu (2003), Topaloglu and Powell (2006), Powell et al....

    [...]

  • ...A range of approximations have been explored in the literature including linear (Powell and Carvalho (1998), De Farias and Van Roy (2003)), separable concave (Topaloglu and Powell (2003)), non-parametric kernel (Bhat et al. (2012)) or neural networks (Mnih et al. (2013), Mnih et al. (2015))....

    [...]

  • ...A range of approximations have been explored in the literature including linear (Powell and Carvalho (1998), De Farias and Van Roy (2003)), separable concave (Topaloglu and Powell (2003)), non-parametric kernel (Bhat et al. (2012)) or neural networks (Mnih et al. (2013), Mnih et al. (2015)). A number of these techniques have also been applied to a vehicle routing setting (Godfrey and Powell (2002), Powell and Topaloglu (2003), Topaloglu and Powell (2006), Powell et al. (2007) Novoa and Storer (2009))....

    [...]

  • ...A range of approximations have been explored in the literature including linear (Powell and Carvalho (1998), De Farias and Van Roy (2003)), separable concave (Topaloglu and Powell (2003)), non-parametric kernel (Bhat et al. (2012)) or neural networks (Mnih et al....

    [...]

Posted Content
TL;DR: Self-guided ALPs are found to significantly reduce policy cost fluctuations and improve the optimality gaps from an ALP approach that employs basis functions tailored to the former application, and deliver optimality gap gaps that are comparable to a known adaptive basis function generation approach targeting the latter application.
Abstract: Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain heuristic policies and lower bounds on the optimal policy cost of Markov decision processes (MDPs). The ALP VFA is a linear combination of predefined basis functions that are chosen using domain knowledge and updated heuristically if the ALP optimality gap is large. We side-step the need for such basis function engineering in ALP -- an implementation bottleneck -- by proposing a sequence of ALPs that embed increasing numbers of random basis functions obtained via inexpensive sampling. We provide a sampling guarantee and show that the VFAs from this sequence of models converge to the exact value function. Nevertheless, the performance of the ALP policy can fluctuate significantly as more basis functions are sampled. To mitigate these fluctuations, we "self-guide" our convergent sequence of ALPs using past VFA information such that a worst-case measure of policy performance is improved. We perform numerical experiments on perishable inventory control and generalized joint replenishment applications, which, respectively, give rise to challenging discounted-cost MDPs and average-cost semi-MDPs. We find that self-guided ALPs (i) significantly reduce policy cost fluctuations and improve the optimality gaps from an ALP approach that employs basis functions tailored to the former application, and (ii) deliver optimality gaps that are comparable to a known adaptive basis function generation approach targeting the latter application. More broadly, our methodology provides application-agnostic policies and lower bounds to benchmark approaches that exploit application structure.

1 citations


Cites methods from "Non-parametric Approximate Dynamic ..."

  • ...The initial selection and potential modification of basis functions in steps (i) and (iv), respectively, are implementation bottlenecks when using ALP but this issue has received limited attention in the literature (Klabjan and Adelman 2007, Adelman and Klabjan 2012, and Bhat et al. 2012)....

    [...]

  • ...Bhat et al. (2012) side-step basis function selection when computing a VFA by applying the kernel trick (see, e.g., chapter 5 of Mohri et al. 2012) to replace inner-products of such functions in the dual of a regularized ALP relaxation....

    [...]

Journal ArticleDOI
TL;DR: In this article , a linear control policy for dynamic portfolio selection is developed by incorporating time-series behaviors of asset returns on the basis of coherent risk minimization, and the dual form of the optimization model is analyzed.

1 citations

References
More filters
Book
01 May 1995
TL;DR: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization.
Abstract: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization. The treatment focuses on basic unifying themes, and conceptual foundations. It illustrates the versatility, power, and generality of the method with many examples and applications from engineering, operations research, and other fields. It also addresses extensively the practical application of the methodology, possibly through the use of approximations, and provides an extensive treatment of the far-reaching methodology of Neuro-Dynamic Programming/Reinforcement Learning.

10,834 citations

BookDOI
01 Dec 2001
TL;DR: Learning with Kernels provides an introduction to SVMs and related kernel methods that provide all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms.
Abstract: From the Publisher: In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs—-kernels--for a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm. They are replacing neural networks in a variety of fields, including engineering, information retrieval, and bioinformatics. Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest research. It provides all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms and to understand and apply the powerful algorithms that have been developed over the last few years.

7,880 citations


"Non-parametric Approximate Dynamic ..." refers background in this paper

  • ...For certain sets S, Mercer’s theorem provides another important construction of such a Hilbert space. more examples can be found in the text of Scholkopf and Smola (2001)....

    [...]

  • ...The Gaussian kernel is known to be full-dimensional (see, e.g., Theorem 2.18, Scholkopf and Smola, 2001), so that employing such a kernel in our setting would correspond to working with an infinite dimensional approximation architecture....

    [...]

  • ...more examples can be found in the text of Scholkopf and Smola (2001)....

    [...]

Book
01 Jan 1968
TL;DR: This book shows engineers how to use optimization theory to solve complex problems with a minimum of mathematics and unifies the large field of optimization with a few geometric principles.
Abstract: From the Publisher: Engineers must make decisions regarding the distribution of expensive resources in a manner that will be economically beneficial. This problem can be realistically formulated and logically analyzed with optimization theory. This book shows engineers how to use optimization theory to solve complex problems. Unifies the large field of optimization with a few geometric principles. Covers functional analysis with a minimum of mathematics. Contains problems that relate to the applications in the book.

5,667 citations

Journal ArticleDOI
TL;DR: The stability of a queueing network with interdependent servers is considered and a policy is obtained which is optimal in the sense that its Stability Region is a superset of the stability region of every other scheduling policy, and this stability region is characterized.
Abstract: The stability of a queueing network with interdependent servers is considered. The dependency among the servers is described by the definition of their subsets that can be activated simultaneously. Multihop radio networks provide a motivation for the consideration of this system. The problem of scheduling the server activation under the constraints imposed by the dependency among servers is studied. The performance criterion of a scheduling policy is its throughput that is characterized by its stability region, that is, the set of vectors of arrival and service rates for which the system is stable. A policy is obtained which is optimal in the sense that its stability region is a superset of the stability region of every other scheduling policy, and this stability region is characterized. The behavior of the network is studied for arrival rates that lie outside the stability region. Implications of the results in certain types of concurrent database and parallel processing systems are discussed. >

3,018 citations


"Non-parametric Approximate Dynamic ..." refers methods in this paper

  • ...Max-Weight (Tassiulas and Ephremides, 1992)....

    [...]

  • ...We prepare the ground for the proof by developing appropriate uniform concentration guarantees for appropriate function classes....

    [...]

Book ChapterDOI
01 Mar 2003
TL;DR: In this paper, the authors investigate the use of data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities, in a decision theoretic setting and prove general risk bounds in terms of these complexities.
Abstract: We investigate the use of certain data-dependent estimates of the complexity of a function class, called Rademacher and Gaussian complexities. In a decision theoretic setting, we prove general risk bounds in terms of these complexities. We consider function classes that can be expressed as combinations of functions from basis classes and show how the Rademacher and Gaussian complexities of such a function class can be bounded in terms of the complexity of the basis classes. We give examples of the application of these techniques in finding data-dependent risk bounds for decision trees, neural networks and support vector machines.

2,535 citations