scispace - formally typeset
Search or ask a question
Posted Content

A Linearly Relaxed Approximate Linear Program for Markov Decision Processes

TL;DR: A linearly relaxed approximation linear program (LRALP) that has a tractable number of constraints, obtained as positive linear combinations of the original constraints of the ALP is defined.
Abstract: Approximate linear programming (ALP) and its variants have been widely applied to Markov Decision Processes (MDPs) with a large number of states. A serious limitation of ALP is that it has an intractable number of constraints, as a result of which constraint approximations are of interest. In this paper, we define a linearly relaxed approximation linear program (LRALP) that has a tractable number of constraints, obtained as positive linear combinations of the original constraints of the ALP. The main contribution is a novel performance bound for LRALP.
Citations
More filters
Proceedings Article
01 Jan 2018
TL;DR: This work studies a primal-dual formulation of the ALP, and develops a scalable, model-free algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided, proving that it is sample-efficient.

44 citations


Cites background or methods from "A Linearly Relaxed Approximate Line..."

  • ...Others have studied various approaches to compress the large constraint set into a smaller one (Taylor and Parr, 2012; Lakshminarayanan et al., 2017)....

    [...]

  • ...Others have studied various approaches to compress the large constraint set into a smaller one (Taylor & Parr, 2012; Lakshminarayanan et al., 2017)....

    [...]

Posted Content
TL;DR: An equivalent form of the dual problem that relates the dual LP with a sample average approximation to a stochastic program is identified and a new type of OLP algorithm is proposed, action-history-dependent learning algorithm, which improves the previous algorithm performances by taking into account the past input data and the past decisions/actions.
Abstract: We study an online linear programming (OLP) problem under a random input model in which the columns of the constraint matrix along with the corresponding coefficients in the objective function are generated i.i.d. from an unknown distribution and revealed sequentially over time. Virtually all pre-existing online algorithms were based on learning the dual optimal solutions/prices of the linear programs (LP), and their analyses were focused on the aggregate objective value and solving the packing LP where all coefficients in the constraint matrix and objective are nonnegative. However, two major open questions were: (i) Does the set of LP optimal dual prices learned in the pre-existing algorithms converge to those of the "offline" LP, and (ii) Could the results be extended to general LP problems where the coefficients can be either positive or negative. We resolve these two questions by establishing convergence results for the dual prices under moderate regularity conditions for general LP problems. Specifically, we identify an equivalent form of the dual problem which relates the dual LP with a sample average approximation to a stochastic program. Furthermore, we propose a new type of OLP algorithm, Action-History-Dependent Learning Algorithm, which improves the previous algorithm performances by taking into account the past input data as well as decisions/actions already made. We derive an $O(\log n \log \log n)$ regret bound (under a locally strong convexity and smoothness condition) for the proposed algorithm, against the $O(\sqrt{n})$ bound for typical dual-price learning algorithms, where $n$ is the number of decision variables. Numerical experiments demonstrate the effectiveness of the proposed algorithm and the action-history-dependent design.

43 citations


Cites methods from "A Linearly Relaxed Approximate Line..."

  • ...The approximate algorithm can be viewed as a constraint sampling procedure (De Farias and Van Roy, 2004; Lakshminarayanan et al., 2017) for the dual LP....

    [...]

Posted Content
TL;DR: In this article, a simple and fast online algorithm for solving a class of binary integer linear programs (LPs) arising in general resource allocation problem is presented, which requires only one single pass through the input data and is free of doing any matrix inversion.
Abstract: In this paper, we develop a simple and fast online algorithm for solving a class of binary integer linear programs (LPs) arisen in general resource allocation problem. The algorithm requires only one single pass through the input data and is free of doing any matrix inversion. It can be viewed as both an approximate algorithm for solving binary integer LPs and a fast algorithm for solving online LP problems. The algorithm is inspired by an equivalent form of the dual problem of the relaxed LP and it essentially performs (one-pass) projected stochastic subgradient descent in the dual space. We analyze the algorithm in two different models, stochastic input and random permutation, with minimal technical assumptions on the input data. The algorithm achieves $O\left(m \sqrt{n}\right)$ expected regret under the stochastic input model and $O\left((m+\log n)\sqrt{n}\right)$ expected regret under the random permutation model, and it achieves $O(m \sqrt{n})$ expected constraint violation under both models, where $n$ is the number of decision variables and $m$ is the number of constraints. The algorithm enjoys the same performance guarantee when generalized to a multi-dimensional LP setting which covers a wider range of applications. In addition, we employ the notion of permutational Rademacher complexity and derive regret bounds for two earlier online LP algorithms for comparison. Both algorithms improve the regret bound with a factor of $\sqrt{m}$ by paying more computational cost. Furthermore, we demonstrate how to convert the possibly infeasible solution to a feasible one through a randomized procedure. Numerical experiments illustrate the general applicability and effectiveness of the algorithms.

28 citations

Posted Content
TL;DR: This work proposes a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs that serves as a theoretically sound alternative to the widely used squared Bellman error.
Abstract: We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The method is closely related to the classic Relative Entropy Policy Search (REPS) algorithm of Peters et al. (2010), with the key difference that our method introduces a Q-function that enables efficient exact model-free implementation. The main feature of our algorithm (called QREPS) is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error. We provide a practical saddle-point optimization method for minimizing this loss function and provide an error-propagation analysis that relates the quality of the individual updates to the performance of the output policy. Finally, we demonstrate the effectiveness of our method on a range of benchmark problems.

21 citations


Cites background or methods from "A Linearly Relaxed Approximate Line..."

  • ...Our method is based on a subtle variation on the classic LP formulation of optimal control in MDPs due to Manne (1960). One key element in our formulation is a linear relaxation of some of the constraints in this LP, which is a technique looking back to a long history: a similar relaxation has been first proposed by Schweitzer and Seidmann (1985), whose approach was later popularized by the influential work of de Farias and Van Roy (2003)....

    [...]

  • ...…paper initiated a long line of work studying the properties of solutions to various linearly relaxed versions of the LP, mostly focusing on the quality of value functions extracted from the solutions (see, e.g., Petrik and Zilberstein, 2009; Desai et al., 2012; Lakshminarayanan et al., 2018)....

    [...]

  • ...This latter paper initiated a long line of work studying the properties of solutions to various linearly relaxed versions of the LP, mostly focusing on the quality of value functions extracted from the solutions (see, e.g., Petrik and Zilberstein, 2009; Desai et al., 2012; Lakshminarayanan et al., 2018)....

    [...]

  • ...Our method is based on a subtle variation on the classic LP formulation of optimal control in MDPs due to Manne (1960). One key element in our formulation is a linear relaxation of some of the constraints in this LP, which is a technique looking back to a long history: a similar relaxation has been first proposed by Schweitzer and Seidmann (1985), whose approach was later popularized by the influential work of de Farias and Van Roy (2003). This latter paper initiated a long line of work studying the properties of solutions to various linearly relaxed versions of the LP, mostly focusing on the quality of value functions extracted from the solutions (see, e....

    [...]

References
More filters
Book
15 Apr 1994
TL;DR: Puterman as discussed by the authors provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models, focusing primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous time discrete state models.
Abstract: From the Publisher: The past decade has seen considerable theoretical and applied research on Markov decision processes, as well as the growing use of these models in ecology, economics, communications engineering, and other fields where outcomes are uncertain and sequential decision-making processes are needed. A timely response to this increased activity, Martin L. Puterman's new work provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models. The book is organized around optimality criteria, using a common framework centered on the optimality (Bellman) equation for presenting results. The results are presented in a "theorem-proof" format and elaborated on through both discussion and examples, including results that are not available in any other book. A two-state Markov decision process model, presented in Chapter 3, is analyzed repeatedly throughout the book and demonstrates many results and algorithms. Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria. It also explores several topics that have received little or no attention in other books, including modified policy iteration, multichain models with average reward criterion, and sensitive optimality. In addition, a Bibliographic Remarks section in each chapter comments on relevant historic

11,625 citations

MonographDOI
TL;DR: Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.
Abstract: From the Publisher: The past decade has seen considerable theoretical and applied research on Markov decision processes, as well as the growing use of these models in ecology, economics, communications engineering, and other fields where outcomes are uncertain and sequential decision-making processes are needed. A timely response to this increased activity, Martin L. Puterman's new work provides a uniquely up-to-date, unified, and rigorous treatment of the theoretical, computational, and applied research on Markov decision process models. It discusses all major research directions in the field, highlights many significant applications of Markov decision processes models, and explores numerous important topics that have previously been neglected or given cursory coverage in the literature. Markov Decision Processes focuses primarily on infinite horizon discrete time models and models with discrete time spaces while also examining models with arbitrary state spaces, finite horizon models, and continuous-time discrete state models. The book is organized around optimality criteria, using a common framework centered on the optimality (Bellman) equation for presenting results. The results are presented in a \"theorem-proof\" format and elaborated on through both discussion and examples, including results that are not available in any other book. A two-state Markov decision process model, presented in Chapter 3, is analyzed repeatedly throughout the book and demonstrates many results and algorithms. Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria. It also explores several topics that have received little or no attention in other books, including modified policy iteration, multichain models with average reward criterion, and sensitive optimality. In addition, a Bibliographic Remarks section in each chapter comments on relevant historic

5,188 citations

Journal ArticleDOI
TL;DR: This paper considers problems related to stability or stabilizability of linear systems with parametric uncertainty, robust control, time-varying linear systems, nonlinear and hybrid systems, and stochastic optimal control.

785 citations


Additional excerpts

  • ...In this paper we adopt the framework of discrete-time, discounted MDPs when a controller steers the stochastically evolving state of a system while receiving rewards that depends on the states visited and actions chosen....

    [...]

  • ...Approximate linear programming (ALP) and its variants have been widely applied to Markov Decision Processes (MDPs) with a large number of states....

    [...]

  • ...While the second assumption limits the scope of MDPs that the result can be applied to, the other two assumptions limit the choice of the basis functions....

    [...]

  • ...The book of [KM12] gives a relatively fresh, algorithm-centered summary of existing methods suitable for planning in MDPs. AI research tend to focus on empirical results through the development of various benchmarks and little if any effort is devoted to the theoretical understanding of the quality-effort tradeoff exhibited by the that the various algorithms that are developed in this field....

    [...]

  • ...Keywords: Markov Decision Processes (MDPs), Approximate Linear Programming (ALP), I. INTRODUCTION Markov decision processes (MDPs) have proved to be an indispensable model for sequential decision making under uncertainty with applications in networking, traffic control, robotics, operations research, business, finance, artificial intelligence, health-care and more (see, e.g., [Whi93; Rus96a; FS02; HY07; SB10; BR11; Put94; LL12; AA+15; BD17])....

    [...]

Journal ArticleDOI
TL;DR: In this article, an efficient method based on linear programming for approximating solutions to large-scale stochastic control problems is proposed. But the approach is not suitable for large scale queueing networks.
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large-scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach "fits" a linear combination of pre-selected basis functions to the dynamic programming cost-to-go function. We develop error bounds that offer performance guarantees and also guide the selection of both basis functions and "state-relevance weights" that influence quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology.

643 citations

Journal ArticleDOI
TL;DR: This paper presents two approximate solution algorithms that exploit structure in factored MDPs by using an approximate value function represented as a linear combination of basis functions, where each basis function involves only a small subset of the domain variables.
Abstract: This paper addresses the problem of planning under uncertainty in large Markov Decision Processes (MDPs). Factored MDPs represent a complex state space using state variables and the transition model using a dynamic Bayesian network. This representation often allows an exponential reduction in the representation size of structured MDPs, but the complexity of exact solution algorithms for such MDPs can grow exponentially in the representation size. In this paper, we present two approximate solution algorithms that exploit structure in factored MDPs. Both use an approximate value function represented as a linear combination of basis functions, where each basis function involves only a small subset of the domain variables. A key contribution of this paper is that it shows how the basic operations of both algorithms can be performed efficiently in closed form, by exploiting both additive and context-specific structure in a factored MDP. A central element of our algorithms is a novel linear program decomposition technique, analogous to variable elimination in Bayesian networks, which reduces an exponentially large LP to a provably equivalent, polynomial-sized one. One algorithm uses approximate linear programming, and the second approximate dynamic programming. Our dynamic programming algorithm is novel in that it uses an approximation based on max-norm, a technique that more directly minimizes the terms that appear in error bounds for approximate MDP algorithms. We provide experimental results on problems with over 1040 states, demonstrating a promising indication of the scalability of our approach, and compare our algorithm to an existing state-of-the-art approach, showing, in some problems, exponential gains in computation time.

503 citations