Reinforcement Learning: An Introduction
Citations
[...]
38,208 citations
23,074 citations
14,635 citations
Cites background from "Reinforcement Learning: An Introduc..."
...Such NNs learn to perceive/encode/predict/ classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys, e.g., Kaelbling et al., 1996; Sutton & Barto, 1998; Wiering & van Otterlo, 2012)....
[...]
...The latter is often explained in a probabilistic framework (e.g., Sutton & Barto, 1998), but its basic idea can already be conveyed in a deterministic setting....
[...]
...Such NNs learn to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act in the more general sense of Reinforcement Learning (RL) in unknown environments (e.g., Kaelbling et al., 1996; Sutton and Barto, 1998)....
[...]
...Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan, 1992; Moore and Atkeson, 1993; Schwartz, 1993; Baird, 1994; Rummery and Niranjan, 1994; Singh, 1994; Baird, 1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy, 1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto, 1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010)....
[...]
...This assumption does not hold in the broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto, 1998; Hutter, 2005) (Sec....
[...]
14,377 citations
10,141 citations
References
312 citations
310 citations
309 citations
309 citations
"Reinforcement Learning: An Introduc..." refers background or methods in this paper
...The example of Phil’s breakfast in this chapter was inspired by Agre (1988). We direct the reader to Chapter 6 for references to the kind of temporal-difference method we used in the tic-tac-toe example....
[...]
...1 The overall view of planning and learning presented here has developed gradually over a number of years, in part by the authors (Sutton, 1990, 1991a, 1991b; Barto, Bradtke, and Singh, 1991, 1995; Sutton and Pinette, 1985; Sutton and Barto, 1981b); it has been strongly influenced by Agre and Chapman (1990; Agre 1988), Bertsekas and Tsitsiklis (1989), Singh (1993), and others....
[...]
...Experts agree that the major stumbling block to creating stronger-than-amateur Go programs is the difficulty of defining an adequate position evaluation function. A good evaluation function allows search to be truncated at a feasible depth by providing relatively easy-to-compute predictions of what deeper search would likely yield. According to Müller (2002): “No simple yet reasonable evaluation function will ever be found for Go....
[...]
304 citations