Adaptive properties of differential learning rates for positive and negative outcomes
Citations
197 citations
117 citations
107 citations
92 citations
89 citations
References
37,989 citations
"Adaptive properties of differential..." refers background in this paper
...To understand this pattern of results, recall that reinforcement learners face a trade-off between exploitation and exploration (Sutton and Barto 1998): choosing to exploit the option with the highest estimated Q-value does not guarantee that there is not an (insufficiently explored) better option…...
[...]
...and state values through a common gain factor or learning rate (Sutton and Barto 1998; O’Doherty et al. 2007)....
[...]
...A central element in the major theories of reinforcement learning is the reward prediction error (RPE) which, in its simplest form, is the difference between the amount of received and expected reward (Sutton and Barto 1998)....
[...]
...To understand this pattern of results, recall that reinforcement learners face a trade-off between exploitation and exploration (Sutton and Barto 1998): choosing to exploit the option with the highest estimated Q-value does not guarantee that there is not an (insufficiently explored) better option available....
[...]
...The three agents are tested on two versions of a difficult two-armed bandit task, with large variance in the two outcomes but small differences in the means (Sutton and Barto 1998)....
[...]
35,067 citations
7,016 citations
"Adaptive properties of differential..." refers background in this paper
...The three agents are tested on two versions of a difficult two-armed bandit task, with large variance in the two outcomes but small differences in the means (Sutton and Barto 1998)....
[...]
...A central element in the major theories of reinforcement learning is the reward prediction error (RPE) which, in its simplest form, is the difference between the amount of received and expected reward (Sutton and Barto 1998)....
[...]
...To understand this pattern of results, recall that reinforcement learners face a trade-off between exploitation and exploration (Sutton and Barto 1998): choosing to exploit the option with the highest estimated Q-value does not guarantee that there is not an (insufficiently explored) better option…...
[...]
...Many reinforcement learning models implicitly assume that positive and negative RPEs impact estimation of action and state values through a common gain factor or learning rate (Sutton and Barto 1998; O’Doherty et al. 2007)....
[...]
4,916 citations
"Adaptive properties of differential..." refers background or methods in this paper
...Unlike the standard form of typical reinforcement learning algorithms such as Q-learning (Watkins 1989), however, our agents employ different learning rates for positive and negative prediction errors ΔQt = (rt − Qt ): Qt+1 = Qt + { α+ΔQt if ΔQt ≥ 0 α−ΔQt if ΔQt < 0 Qt , by convention, corresponds to the expected reward of taking a certain action at time step t (“Q-value”). rt is the actual reward received at time step t ....
[...]
...Q-learning and related models are in widespread use as models for fitting behavior and neurophysiological data in neuroscience, providing much evidence that biological learning mechanisms share at least some properties with these models (O’Doherty et al. 2007; Maia and Frank 2011, but see Gershman and Niv 2010)....
[...]
...dard form of typical reinforcement learning algorithms such as Q-learning (Watkins 1989), however, our agents employ different learning rates for positive and negative prediction errors ΔQt = (rt − Qt ):...
[...]
...The analytical results predicted with high accuracy the behavior of a modified Q-learning algorithm (5,000 iterations of 800 trials each; error bars in Fig....
[...]
...In particular, we simulate the performance of three Q-learning agents on two different “two-armed bandit” problems....
[...]
2,946 citations
"Adaptive properties of differential..." refers background in this paper
...pathways in the basal ganglia, associated with predominantly D1- and D2-expressing projection neurons in the striatum, respectively (Gerfen et al. 1990; Kravitz et al. 2012)....
[...]
...An influential proposal arising idea is that RPEs may have different effects on direct (D1) and indirect (D2) pathways in the striatum (Gerfen et al. 1990; Kravitz et al. 2012)....
[...]
...Depending on the specific setting, these distinct mechanisms are referred to as approach/avoid, go/noGo, or direct/indirect pathways in the basal ganglia, associated with predominantly D1- and D2-expressing projection neurons in the striatum, respectively (Gerfen et al. 1990; Kravitz et al. 2012)....
[...]