scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Policy Gradient Reinforcement Learning for Solving Supply-Chain Management Problems

TL;DR: This work tackles the problem of general stochastic supply-chain management problem by formulating it as a multi-arm non-contextual bandit problem and then taking a policy gradient descent approach (a Reinforcement Learning approach) to find a robust policy.
Abstract: Supply-chain management problems are quite common in various industries and it is becoming increasingly necessary to tackle uncertainties while making decisions due to the rapid rise in production and consumption levels, and shortening of product life cycles. In our work, we tackle this problem of general stochastic supply-chain management problem by formulating it as a multi-arm non-contextual bandit problem and then taking a policy gradient descent approach (a Reinforcement Learning approach) to find a robust policy. The gradient descent is guided by cost from a simulator which models the demand, lead times and other uncertainties. Our experiments demonstrate that it finds better solutions than naive worst-case linear programming solutions to such problems.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , a semi-systematic literature review explores the current state of the art of reinforcement learning in supply chain management (SCM) and proposes a classification framework, which classifies academic papers based on supply chain drivers, algorithms, data sources, and industrial sectors.
Abstract: Decision-making in supply chains is challenged by high complexity, a combination of continuous and discrete processes, integrated and interdependent operations, dynamics, and adaptability. The rapidly increasing data availability, computing power and intelligent algorithms unveil new potentials in adaptive data-driven decision-making. Reinforcement Learning, a class of machine learning algorithms, is one of the data-driven methods. This semi-systematic literature review explores the current state of the art of reinforcement learning in supply chain management (SCM) and proposes a classification framework. The framework classifies academic papers based on supply chain drivers, algorithms, data sources, and industrial sectors. The conducted review revealed a few critical insights. First, the classic Q-learning algorithm is still the most popular one. Second, inventory management is the most common application of reinforcement learning in supply chains, as it is a pivotal element of supply chain synchronisation. Last, most reviewed papers address toy-like SCM problems driven by artificial data. Therefore, shifting to industry-scale problems will be a crucial challenge in the next years. If this shift is successful, the vision of data-driven decision-making in real-time could become a reality.

14 citations

Journal ArticleDOI
01 Jan 2020
TL;DR: Reinforcement learning integrated heuristic search method (RLIH) is proposed for self-driving vehicle using blockchain in supply chain management by combining the advantage of reinforcement learning and heuristicsearch method.
Abstract: Blockchain is a distributed open (Public) ledger that is used to record the transaction across many computers. Blockchain technology can be applied in any domain such as banking, healthcare, real estate, travel, food, and supply chain. In supply chain management to train the self-driving vehicle in blockchain technology also integrate the Artificial Intelligence (AI) and Machine Learning (ML) Algorithms. In this paper we have proposed Reinforcement learning integrated heuristic search method (RLIH) for self-driving vehicle using blockchain in supply chain management by combining the advantage of reinforcement learning and heuristic search method. RLIH is developed using Decentralized app and result shows that proposed method outperform the existing heuristic search method in term of service time and data traffic.

13 citations

Book ChapterDOI
01 Jan 2021
TL;DR: This paper presents the application of policy-based reinforcement learning algorithms to tackle the control of a centralized distributed inventory management system using reinforcement learning in an effective manner and compares these approaches to an existing approach involving a fixed replenishment scheme.
Abstract: This paper presents our approach for the control of a centralized distributed inventory management system using reinforcement learning (RL). We propose the application of policy-based reinforcement learning algorithms to tackle this problem in an effective manner. We have formulated the problem as a Markov decision process (MDP) and have created an environment that keeps track of multiple products across multiple warehouses returning a reward signal that directly corresponds to the total revenue across all warehouses at every time step. In this environment, we have applied various policy-based reinforcement learning algorithms such as Advantage Actor-Critic, Trust Region Policy Optimization and Proximal Policy Optimization to decide the amount of each product to be stocked in every warehouse. The performance of these algorithms in maximizing average revenue over time has been evaluated considering various statistical distributions from which we sample demand per time step per episode of training. We also compare these approaches to an existing approach involving a fixed replenishment scheme. In conclusion, we elaborate upon the results of our evaluation and the scope for future work on the topic.

1 citations

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors investigated a group of agents working concurrently to solve similar combinatorial bandit problems while maintaining quality constraints, and proposed a Privacy-preserving Federated Combinatorial Bandit algorithm, P-FCB.
Abstract: There is a rapid increase in the cooperative learning paradigm in online learning settings, i.e., federated learning (FL). Unlike most FL settings, there are many situations where the agents are competitive. Each agent would like to learn from others, but the part of the information it shares for others to learn from could be sensitive; thus, it desires its privacy. This work investigates a group of agents working concurrently to solve similar combinatorial bandit problems while maintaining quality constraints. Can these agents collectively learn while keeping their sensitive information confidential by employing differential privacy? We observe that communicating can reduce the regret. However, differential privacy techniques for protecting sensitive information makes the data noisy and may deteriorate than help to improve regret. Hence, we note that it is essential to decide when to communicate and what shared data to learn to strike a functional balance between regret and privacy. For such a federated combinatorial MAB setting, we propose a Privacy-preserving Federated Combinatorial Bandit algorithm, P-FCB. We illustrate the efficacy of P-FCB through simulations. We further show that our algorithm provides an improvement in terms of regret while upholding quality threshold and meaningful privacy guarantees.
Journal ArticleDOI
27 Jun 2022
TL;DR: This work investigates a group of agents working concurrently to solve similar combinatorial bandit problems while maintaining quality constraints, and proposes a Privacy-preserving Federated Combinatorial Bandit algorithm, P-FCB.
Abstract: There is a rapid increase in the cooperative learning paradigm in online learning settings, i.e., federated learning (FL). Unlike most FL settings, there are many situations where the agents are competitive. Each agent would like to learn from others, but the part of the information it shares for others to learn from could be sensitive; thus, it desires its privacy. This work investigates a group of agents working concurrently to solve similar combinatorial bandit problems while maintaining quality constraints. Can these agents collectively learn while keeping their sensitive information confidential by employing differential privacy? We observe that communicating can reduce the regret. However, differential privacy techniques for protecting sensitive information makes the data noisy and may deteriorate than help to improve regret. Hence, we note that it is essential to decide when to communicate and what shared data to learn to strike a functional balance between regret and privacy. For such a federated combinatorial MAB setting, we propose a Privacy-preserving Federated Combinatorial Bandit algorithm, P-FCB. We illustrate the efficacy of P-FCB through simulations. We further show that our algorithm provides an improvement in terms of regret while upholding quality threshold and meaningful privacy guarantees.
References
More filters
Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations

Journal ArticleDOI

1,577 citations

Journal ArticleDOI
TL;DR: It is shown that the structure of the optimal robust policy is of the same base-stock character as the optimal stochastic policy for a wide range of inventory problems in single installations, series systems, and general supply chains.
Abstract: We propose a general methodology based on robust optimization to address the problem of optimally controlling a supply chain subject to stochastic demand in discrete time. This problem has been studied in the past using dynamic programming, which suffers from dimensionality problems and assumes full knowledge of the demand distribution. The proposed approach takes into account the uncertainty of the demand in the supply chain without assuming a specific distribution, while remaining highly tractable and providing insight into the corresponding optimal policy. It also allows adjustment of the level of robustness of the solution to trade off performance and protection against uncertainty. An attractive feature of the proposed approach is its numerical tractability, especially when compared to multidimensional dynamic programming problems in complex supply chains, as the robust problem is of the same difficulty as the nominal problem, that is, a linear programming problem when there are no fixed costs, and a mixed-integer programming problem when fixed costs are present. Furthermore, we show that the optimal policy obtained in the robust approach is identical to the optimal policy obtained in the nominal case for a modified and explicitly computable demand sequence. In this way, we show that the structure of the optimal robust policy is of the same base-stock character as the optimal stochastic policy for a wide range of inventory problems in single installations, series systems, and general supply chains. Preliminary computational results are very promising.

619 citations

Journal ArticleDOI
TL;DR: Techniques for optimizing stochastic discrete-event systems via simulation, including perturbation analysis, the likelihood ratio method, and frequency domain experimentation are reviewed.
Abstract: We review techniques for optimizing stochastic discrete-event systems via simulation. We discuss both the discrete parameter case and the continuous parameter case, but concentrate on the latter which has dominated most of the recent research in the area. For the discrete parameter case, we focus on the techniques for optimization from a flnite set: multiple-comparison procedures and ranking-and-selection procedures. For the continuous parameter case, we focus on gradient-based methods, including perturbation analysis, the likelihood ratio method, and frequency domain experimentation. For illustrative purposes, we compare and contrast the implementation of the techniques for some simple discrete-event systems such as the (s;S) inventory system and the GI=G=1 queue. Finally, we speculate on future directions for the fleld, particularly in the context of the rapid advances being made in parallel computing.

444 citations

Journal ArticleDOI
TL;DR: Simulation-based methods for estimating sensitivities of inventory costs with respect to policy parameters are developed and it is shown that these estimates converge to the correct value for finite-horizon and infinite-Horizon discounted and average cost criteria.
Abstract: Effective management of inventories in large-scale production and distribution systems requires methods for bringing model solutions closer to the complexities of real systems. Motivated by this need, we develop simulation-based methods for estimating sensitivities of inventory costs with respect to policy parameters. These sensitivity estimates are useful in adjusting optimal parameters predicted by a simplified model to complexities that can be incorporated in a simulation. We consider capacitated, multiechelon systems operating under base-stock policies and develop estimators of derivatives with respect to base-stock levels. We show that these estimates converge to the correct value for finite-horizon and infinite-horizon discounted and average cost criteria. Our methods are easy to implement and experiments suggest that they converge quickly. We illustrate their use by optimizing base-stock levels for a subsystem of the PC assembly and distribution system of a major computer manufacturer.

286 citations