Trust Region Policy Optimization

Home
/
Papers
/
Trust Region Policy Optimization

Proceedings Article•

Trust Region Policy Optimization

John Schulman¹, Sergey Levine¹, Pieter Abbeel¹, Michael I. Jordan¹, Philipp Moritz¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

06 Jul 2015-pp 1889-1897

TL;DR: A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

read less

Abstract: In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Proximal Policy Optimization Algorithms

[...]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov - Show less +1 more

20 Jul 2017-arXiv: Learning

TL;DR: A new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent, are proposed.

...read moreread less

Abstract: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

...read moreread less

9,020 citations

Proceedings Article•

Model-agnostic meta-learning for fast adaptation of deep networks

[...]

Chelsea Finn¹, Pieter Abbeel¹, Sergey Levine¹•Institutions (1)

University of California, Berkeley¹

06 Aug 2017

TL;DR: An algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning is proposed.

...read moreread less

Abstract: We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using only a small number of training samples. In our approach, the parameters of the model are explicitly trained such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. In effect, our method trains the model to be easy to fine-tune. We demonstrate that this approach leads to state-of-the-art performance on two few-shot image classification benchmarks, produces good results on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.

...read moreread less

7,027 citations

Cites methods from "Trust Region Policy Optimization"

...The gradient updates are computed using vanilla policy gradient (REINFORCE) (Williams, 1992), and we use trust-region policy optimization (TRPO) as the meta-optimizer (Schulman et al., 2015)....
[...]
...Practical implementations of this method may also use a variety of improvements recently proposed for policy gradient algorithms, including state or action-dependent baselines and trust regions (Schulman et al., 2015)....
[...]

Proceedings Article•

Asynchronous methods for deep reinforcement learning

[...]

Volodymyr Mnih¹, Adrià Puigdomènech Badia¹, Mehdi Mirza², Alex Graves¹, Tim Harley¹, Timothy P. Lillicrap¹, David Silver¹, Koray Kavukcuoglu¹ - Show less +4 more•Institutions (2)

Google¹, Université de Montréal²

19 Jun 2016

TL;DR: A conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers and shows that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

...read moreread less

Abstract: We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

...read moreread less

6,736 citations

Cites background or methods from "Trust Region Policy Optimization"

...By storing the agent’s data in an experience replay memory, the data can be batched (Riedmiller, 2005; Schulman et al., 2015a) or randomly sampled (Mnih et al., 2013; 2015; Van Hasselt et al., 2015) from different time-steps....
[...]
...The asynchronous advantage actor-critic method could be potentially improved by using other ways of estimating the advantage function, such as generalized advantage estimation of (Schulman et al., 2015b)....
[...]
...Instead, a variety of solutions have been proposed to stabilize the algorithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Hasselt et al., 2015; Schulman et al., 2015a)....
[...]

Proceedings Article•DOI•

Domain randomization for transferring deep neural networks from simulation to the real world

[...]

Josh Tobin¹, Rachel Fong², Alex Ray², Jonas Schneider², Wojciech Zaremba², Pieter Abbeel¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, OpenAI²

20 Mar 2017

TL;DR: This paper explores domain randomization, a simple technique for training models on simulated images that transfer to real images by randomizing rendering in the simulator, and achieves the first successful transfer of a deep neural network trained only on simulated RGB images to the real world for the purpose of robotic control.

...read moreread less

Abstract: Bridging the ‘reality gap’ that separates simulated robotics from experiments on hardware could accelerate robotic research through improved data availability. This paper explores domain randomization, a simple technique for training models on simulated images that transfer to real images by randomizing rendering in the simulator. With enough variability in the simulator, the real world may appear to the model as just another variation. We focus on the task of object localization, which is a stepping stone to general robotic manipulation skills. We find that it is possible to train a real-world object detector that is accurate to 1.5 cm and robust to distractors and partial occlusions using only data from a simulator with non-realistic random textures. To demonstrate the capabilities of our detectors, we show they can be used to perform grasping in a cluttered environment. To our knowledge, this is the first successful transfer of a deep neural network trained only on simulated RGB images (without pre-training on real images) to the real world for the purpose of robotic control.

...read moreread less

2,079 citations

Cites result from "Trust Region Policy Optimization"

...Learning in simulation is especially promising for building on recent results using deep reinforcement learning to achieve human-level performance on tasks like Atari [29] and robotic control [26], [41]....
[...]

Posted Content•

Addressing Function Approximation Error in Actor-Critic Methods

[...]

Scott Fujimoto¹, Herke van Hoof², David Meger¹•Institutions (2)

McGill University¹, University of Amsterdam²

26 Feb 2018-arXiv: Artificial Intelligence

TL;DR: This paper builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation, and draws the connection between target networks and overestimation bias.

...read moreread less

Abstract: In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

...read moreread less

1,968 citations

Cites methods from "Trust Region Policy Optimization"

...…DDPG (Lillicrap et al., 2015) as well as the state of art policy gradient algorithms: PPO (Schulman et al., 2017), ACKTR (Wu et al., 2017) and TRPO (Schulman et al., 2015), as implemented by OpenAI’s baselines repository (Dhariwal et al., 2017), and SAC (Haarnoja et al., 2018), as implemented by…...
[...]
...We compare our algorithm against DDPG (Lillicrap et al., 2015) as well as the state of art policy gradient algorithms: PPO (Schulman et al., 2017), ACKTR (Wu et al., 2017) and TRPO (Schulman et al., 2015), as implemented by OpenAI’s baselines repository (Dhariwal et al., 2017), and SAC (Haarnoja et al., 2018), as implemented by the author’s GitHub1....
[...]
..., 2017) and TRPO (Schulman et al., 2015), as implemented by OpenAI’s baselines repository (Dhariwal et al....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Book•

Dynamic Programming and Optimal Control

[...]

Dimitri P. Bertsekas¹•Institutions (1)

Massachusetts Institute of Technology¹

01 May 1995

TL;DR: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization.

...read moreread less

Abstract: The leading and most up-to-date textbook on the far-ranging algorithmic methododogy of Dynamic Programming, which can be used for optimal control, Markovian decision problems, planning and sequential decision making under uncertainty, and discrete/combinatorial optimization. The treatment focuses on basic unifying themes, and conceptual foundations. It illustrates the versatility, power, and generality of the method with many examples and applications from engineering, operations research, and other fields. It also addresses extensively the practical application of the methodology, possibly through the use of approximations, and provides an extensive treatment of the far-reaching methodology of Neuro-Dynamic Programming/Reinforcement Learning.

...read moreread less

10,834 citations

Posted Content•

Playing Atari with Deep Reinforcement Learning

[...]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller - Show less +3 more

19 Dec 2013-arXiv: Learning

TL;DR: This work presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, which outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

...read moreread less

Abstract: We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

...read moreread less

8,757 citations

"Trust Region Policy Optimization" refers methods in this paper

...…which are made available through the Arcade Learning Environment (Bellemare et al., 2013) The images were preprocessed following the protocol in Mnih et al (2013), and the policy was represented by the convolutional neural network shown in Figure 3, with two convolutional layers with 16…...
[...]
...We tested our algorithms on the same seven games reported on in (Mnih et al., 2013) and (Guo et al....
[...]
...The results of the vine and single path algorithms are summarized in Table 1, which also includes an expert human performance and two recent methods: deep Q-learning (Mnih et al., 2013), and a combination of Monte-Carlo Tree Search with supervised training (Guo et al., 2014), called UCC-I....
[...]
...The results of the vine and single path algorithms are summarized in Table 1, which also includes an expert human performance and two recent methods: deep Q-learning (Mnih et al., 2013), and a combination of Monte-Carlo Tree Search with supervised training (Guo et al....
[...]
...We tested our algorithms on the same seven games reported on in (Mnih et al., 2013) and (Guo et al., 2014), which are made available through the Arcade Learning Environment (Bellemare et al., 2013) The images were preprocessed following the protocol in Mnih et al (2013), and the policy was…...
[...]

Proceedings Article•DOI•

MuJoCo: A physics engine for model-based control

[...]

Emanuel Todorov¹, Tom Erez¹, Yuval Tassa¹•Institutions (1)

University of Washington¹

24 Dec 2012

TL;DR: A new physics engine tailored to model-based control, based on the modern velocity-stepping approach which avoids the difficulties with spring-dampers, which can compute both forward and inverse dynamics.

...read moreread less

Abstract: We describe a new physics engine tailored to model-based control. Multi-joint dynamics are represented in generalized coordinates and computed via recursive algorithms. Contact responses are computed via efficient new algorithms we have developed, based on the modern velocity-stepping approach which avoids the difficulties with spring-dampers. Models are specified using either a high-level C++ API or an intuitive XML file format. A built-in compiler transforms the user model into an optimized data structure used for runtime computation. The engine can compute both forward and inverse dynamics. The latter are well-defined even in the presence of contacts and equality constraints. The model can include tendon wrapping as well as actuator activation states (e.g. pneumatic cylinders or muscles). To facilitate optimal control applications and in particular sampling and finite differencing, the dynamics can be evaluated for different states and controls in parallel. Around 400,000 dynamics evaluations per second are possible on a 12-core machine, for a 3D homanoid with 18 dofs and 6 active contacts. We have already used the engine in a number of control applications. It will soon be made publicly available.

...read moreread less

4,018 citations

"Trust Region Policy Optimization" refers methods in this paper

...We conducted the robotic locomotion experiments using the MuJoCo simulator (Todorov et al., 2012)....
[...]
...games from images using convolutional neural networks with tens of thousands of parameters. 8.1Simulated Robotic Locomotion We conducted the robotic locomotion experiments using the MuJoCo simulator (Todorov et al., 2012). The three simulated robots are shown in Figure 2. The states of the robots are their generalized positions and velocities, and the controls are joint torques. Underactuation, high dimensionality, an...
[...]

Journal Article•DOI•

Neuronlike adaptive elements that can solve difficult learning control problems

[...]

Andrew G. Barto¹, Richard S. Sutton¹, Charles W. Anderson¹•Institutions (1)

University of Massachusetts Amherst¹

01 Sep 1983

TL;DR: In this article, a system consisting of two neuron-like adaptive elements can solve a difficult learning control problem, where the task is to balance a pole that is hinged to a movable cart by applying forces to the cart base.

...read moreread less

Abstract: It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neurolike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.

...read moreread less

3,240 citations

Book•

Markov Chains and Mixing Times

[...]

David A. Levin¹, Yuval Peres², Yuval Peres³, Yuval Peres⁴, Elizabeth Wilmer⁵ - Show less +1 more•Institutions (5)

University of Oregon¹, University of California, Berkeley², University of Washington³, Microsoft⁴, Oberlin College⁵

01 Dec 2008

TL;DR: Markov Chains and Mixing Times as mentioned in this paper is an introduction to the modern approach to the theory of Markov chains and its application in the field of probability theory and linear algebra, where the main goal is to determine the rate of convergence of a Markov chain to the stationary distribution.

...read moreread less

Abstract: This book is an introduction to the modern approach to the theory of Markov chains. The main goal of this approach is to determine the rate of convergence of a Markov chain to the stationary distribution as a function of the size and geometry of the state space. The authors develop the key tools for estimating convergence times, including coupling, strong stationary times, and spectral methods. Whenever possible, probabilistic methods are emphasized. The book includes many examples and provides brief introductions to some central models of statistical mechanics. Also provided are accounts of random walks on networks, including hitting and cover times, and analyses of several methods of shuffling cards. As a prerequisite, the authors assume a modest understanding of probability theory and linear algebra at an undergraduate level. ""Markov Chains and Mixing Times"" is meant to bring the excitement of this active area of research to a wide audience.

...read moreread less

2,573 citations