Policy Gradient Methods for Reinforcement Learning with Function Approximation
Citations
[...]
38,208 citations
37,989 citations
14,635 citations
Cites methods from "Policy Gradient Methods for Reinfor..."
...An important class of DS methods for NNs are Policy Gradient methods (Williams, 1986, 1988, 1992a; Baxter and Bartlett, 1999; Sutton et al., 1999a; Aberdeen, 2003; Ghavamzadeh and Mahadevan, 2003; Kohl and Stone, 2004; Wierstra et al., 2007, 2008; Rückstieß et al., 2008; Peters and Schaal, 2008b,a; Sehnke et al., 2010; Grüttner et al., 2010; Wierstra et al., 2010; Peters, 2010; Bartlett and Baxter, 2011; Grondman et al., 2012)....
[...]
14,377 citations
2,391 citations
Cites background from "Policy Gradient Methods for Reinfor..."
...which is equivalent to the policy gradient theorem (Sutton et al., 1999)....
[...]
...…that have translated particularly well into the domain of robotics include: (i) policy gradient approaches based on likelihood-ratio estimation (Sutton et al., 1999), (ii) policy updates inspired by expectation– maximization (EM) (Toussaint et al., 2010), and (iii) the path integral methods…...
[...]
...Modeling exploration models with probability distributions has surprising implications, e.g. stochastic policies have been shown to be the optimal stationary policies for selected problems (Jaakkola et al., 1993; Sutton et al., 1999) and can even break the curse of dimensionality (Rust, 1997)....
[...]
...…by actions taken at the end of an episode, we can replace the return of the episode J τ by the state–action value function Qπ (s, a) and obtain (Peters and Schaal, 2008c) ∇θJ θ = E { H∑ h=1 ∇θ logπθ (sh, ah)Qπ (sh, ah) } , which is equivalent to the policy gradient theorem (Sutton et al., 1999)....
[...]
...Some of the most popular white-box general reinforcement learning techniques that have translated particularly well into the domain of robotics include: (i) policy gradient approaches based on likelihood-ratio estimation (Sutton et al., 1999), (ii) policy updates inspired by expectation– maximization (EM) (Toussaint et al....
[...]
References
37,989 citations
7,930 citations
7,016 citations
"Policy Gradient Methods for Reinfor..." refers background in this paper
...1 Policy Gradient Theorem We consider the standard reinforcement learning framework (see, e.g., Sutton and Barto, 1998), in which a learning agent interacts with a Markov decision process (MDP)....
[...]
4,251 citations
3,665 citations
"Policy Gradient Methods for Reinfor..." refers background or methods in this paper
...These, together with the step-size requirements, are the necessary conditions to apply Proposition 3.5 from page 96 of Bertsekas and Tsitsiklis (1996), which assures convergence to a local optimum....
[...]
...Such discontinuous changes have been identified as a key obstacle to establishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996)....
[...]
...For example, Q-Iearning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996)....
[...]