scispace - formally typeset
Open AccessJournal ArticleDOI

PVLV: the primary value and learned value Pavlovian learning algorithm.

Reads0
Chats0
TLDR
The authors show that the PVLV model can account for critical aspects of the DA firing data, making a number of clear predictions about lesion effects, several of which are consistent with existing data.
Abstract
The authors present their primary value learned value (PVLV) model for understanding the reward-predictive firing properties of dopamine (DA) neurons as an alternative to the temporal-differences (TD) algorithm PVLV is more directly related to underlying biology and is also more robust to variability in the environment The primary value (PV) system controls performance and learning during primary rewards, whereas the learned value (LV) system learns about conditioned stimuli The PV system is essentially the Rescorla-Wagner/delta-rule and comprises the neurons in the ventral striatum/nucleus accumbens that inhibit DA cells The LV system comprises the neurons in the central nucleus of the amygdala that excite DA cells The authors show that the PVLV model can account for critical aspects of the DA firing data, making a number of clear predictions about lesion effects, several of which are consistent with existing data For example, first- and second-order conditioning can be anatomically dissociated, which is consistent with PVLV and not TD Overall, the model provides a biologically plausible framework for understanding the neural basis of reward learning

read more

Content maybe subject to copyright    Report

PVLV: The Primary Value and Learned Value Pavlovian
Learning Algorithm
Randall C. O’Reilly
University of Colorado at Boulder
Michael J. Frank
University of Arizona
Thomas E. Hazy and Brandon Watz
University of Colorado at Boulder
The authors present their primary value learned value (PVLV) model for understanding the reward-
predictive firing properties of dopamine (DA) neurons as an alternative to the temporal-differences (TD)
algorithm. PVLV is more directly related to underlying biology and is also more robust to variability in
the environment. The primary value (PV) system controls performance and learning during primary
rewards, whereas the learned value (LV) system learns about conditioned stimuli. The PV system is
essentially the Rescorla–Wagner/delta-rule and comprises the neurons in the ventral striatum/nucleus
accumbens that inhibit DA cells. The LV system comprises the neurons in the central nucleus of the
amygdala that excite DA cells. The authors show that the PVLV model can account for critical aspects
of the DA firing data, making a number of clear predictions about lesion effects, several of which are
consistent with existing data. For example, first- and second-order conditioning can be anatomically
dissociated, which is consistent with PVLV and not TD. Overall, the model provides a biologically
plausible framework for understanding the neural basis of reward learning.
Keywords: basal ganglia, dopamine, reinforcement learning, Pavlovian conditioning, computational
modeling
An important and longstanding challenge for both the cognitive
neuroscience and artificial intelligence communities has been to
develop an adequate understanding (and a correspondingly robust
model) of Pavlovian learning. Such a model should account for the
full range of signature findings in the rich literature on this phe-
nomenon. Pavlovian conditioning refers to the ability of previ-
ously neutral stimuli that reliably co-occur with primary rewards to
elicit new conditioned behaviors and to take on reward value
themselves (e.g., Pavlov’s famous case of the bell signaling food
for hungry dogs; Pavlov, 1927).
Pavlovian conditioning is distinguished from instrumental con-
ditioning in that the latter involves the learning of new behaviors
that are reliably associated with reward, either first order (US), or
second order (CS). Although Pavlovian conditioning also involves
behaviors (conditioned and unconditioned responses), reward de-
livery is not contingent on behavior but is instead reliably paired
with a stimulus regardless of behavior. In contrast, instrumental
conditioning explicitly makes reward contingent on a particular
“operant” or “instrumental” response. Both stimulus–reward (Pav-
lovian) and stimulus–response–reward (instrumental) associations,
however, are thought to be trained by the same phasic dopamine
signal that occurs at the time of primary reward (US) as described
below. In practice, the distinction is often blurry as the two types
of conditioning interact (e.g., second-order instrumental condition-
ing and so-called Pavlovian instrumental transfer effects).
The dominant theoretical perspective for both Pavlovian and
instrumental conditioning since the seminal Rescorla and Wagner
(1972) model, is that learning is based on the discrepancy between
actual rewards received and predictions thereof (i.e., reward pre-
diction error). Currently, the temporal differences (TD) reward
prediction framework (Sutton, 1988; Sutton & Barto, 1998) is by
far the most widely adopted computational level account of Pav-
lovian (and instrumental) conditioning and dopamine firing (e.g.,
Barto, 1995; Daw, Courville, & Touretzky, 2003; Daw, Kakade, &
Dayan, 2002; Dayan, 2001; Dayan & Balleine, 2002; Houk, Ad-
ams, & Barto, 1995; Kakade & Dayan, 2002a, 2002b; Montague,
Dayan, & Sejnowski, 1996; Suri & Schultz, 1999, 2001; see
Brown, Bullock, & Grossberg, 1999; Contreras-Vidal & Schultz,
1999; Sporns & Alexander, 2002, for alternative models, and Joel,
Niv, & Ruppin, 2002, for a biologically oriented review).
One important reason for the popularity of TD is that a reward
prediction error signal has been established in the brain, in the
pattern of midbrain dopamine neuron activation (e.g., Schultz,
1998; Schultz, Apicella, & Ljungberg, 1993; see Figure 1). These
neurons initially fire short phasic bursts of activity for primary
rewards and over the course of learning come to fire similarly at
Randall C. O’Reilly, Thomas E. Hazy, and Brandon Watz, Department
of Psychology, University of Colorado at Boulder; Michael J. Frank,
Department of Psychology and Program in Neuroscience, University of
Arizona.
This work was supported by Office of Naval Research Grant N00014-
03-1-0428 and National Institute of Mental Health Grants MH069597 and
MH64445. We thank Peter Dayan, Nathaniel Daw, Yael Niv, Eric Claus,
and the CCN lab for discussions of these ideas.
Correspondence concerning this article should be addressed to Randall
C. O’Reilly, Department of Psychology, University of Colorado at Boul-
der, 345 UCB, Boulder, CO 80309. E-mail: oreilly@psych.colorado.edu
Behavioral Neuroscience Copyright 2007 by the American Psychological Association
2007, Vol. 121, No. 1, 31– 49 0735-7044/07/$12.00 DOI: 10.1037/0735-7044.121.1.31
31

the onset of previously neutral, reward predictive stimuli (i.e.,
conditioned stimuli; CS), and no longer to the reward itself.
Generally, there is a time period when both CS and reward-related
firing is occurring (Pan, Schmidt, Wickens, & Hyland, 2005;
Schultz, 2002).
However, it remains unclear exactly what brain mechanisms
lead to this behavior on the part of dopamine cells. Most research-
ers agree that the critical learning processes are taking place
upstream from the midbrain dopamine neurons themselves. But
which areas are doing what? Because it is an abstract, unitary (and
elegant) framework, the TD model does not map directly onto the
relatively large collection of neural substrates known to be in-
volved in reinforcement learning, including areas of the basal
ganglia, amygdala, midbrain dopamine nuclei, and ventromedial
prefrontal cortex (Cardinal, Parkinson, Hall, & Everitt, 2002).
Indeed, relatively few specific proposals have been made for a
biological mapping of the TD model (Houk et al., 1995; Joel et al.,
2002).
In this article, we offer a multicomponent model of Pavlovian
learning called PVLV, which provides a more direct mapping onto
the underlying neural substrates. PVLV is composed of two sub-
systems: primary value (PV) and learned value (LV). The PV
system is engaged by primary reward (i.e., an unconditioned
stimulus; US) and learns to expect the occurrence of a given US,
thereby inhibiting the dopamine burst that would otherwise occur
for it. The LV system learns about conditioned stimuli that are
reliably associated with primary rewards, and it drives phasic
dopamine burst firing at the time of CS onset. This decomposition
is similar to the model of Brown et al. (1999), but as we discuss
later, there are several important functional and anatomical differ-
ences between the two models.
The PV and LV systems are further subdivided into excitatory
and inhibitory subcomponents, which provide a good fit with a
wide range of data (reviewed in detail later) on three different
brain areas. The excitatory subcomponent of PV (denoted PVe) is
associated with the reward-driven excitatory projections from the
lateral hypothalamus onto midbrain dopamine neurons in the sub-
stantia nigra pars compacta (SNc) and the ventral tegmental area
(VTA) as we discuss in more detail later in the section “Biological
Mapping of PVLV.” The inhibitory subcomponent of PV (PVi) is
associated with neurons in the ventral striatum/nucleus accumbens
(VS/NAc) that have direct GABAergic projections to the SNc and
VTA and fire just in advance of primary rewards. The excitatory
subcomponent of the LV system (LVe) is associated with neurons
in the central nucleus of the amygdala (CNA), which have a net
excitatory effect on the SNc and VTA. Thus, we suggest that these
CNA neurons learn to associate CSs with reward and drive exci-
tatory dopamine bursts at CS onset. Finally, there is an inhibitory
component of the LV (LVi) that is also associated with the VS/
NAc, which slowly learns to inhibit the excitatory LVe drive on
the dopamine neurons.
In addition to these core PVLV mechanisms, a number of other
brain areas play a critical role in reinforcement learning. For
example, we think of the prefrontal cortex (PFC) and hippocampus
as providing something akin to an eligibility trace (as in TD[];
Sutton & Barto, 1998; Pan et al., 2005). We believe this sort of
actively maintained working memory representation is particularly
crucial in trace conditioning paradigms in which there is an inter-
val of time between CS-offset and US-onset. As we discuss later,
PVLV explicitly accounts for known dissociations between delay
versus trace conditioning paradigms that occur under PFC and/or
hippocampal lesions, something about which TD is less explicit. In
fact, PVLV actually requires that models learn to hold onto work-
ing memory representations under trace conditioning paradigms.
Although for the models in this article we apply the working
memory inputs directly (so as to focus on the core PVLV mech-
anisms), our larger prefrontal cortex-basal ganglia (PBWM)
model, of which PVLV is a subcomponent, demonstrates how the
system can learn to maintain task-relevant information in working
memory (O’Reilly & Frank, 2006). TD does not address the
learning of working memory representations explicitly, instead it
finesses the issue by assuming that it is just there in the eligibility
trace.
In addition, the cerebellum (and possibly other brain areas)
provides a representation of time (e.g., Mauk & Buonomano,
2004; Ivry, 1996) that acts as an additional input signal that can
become associated with reward, as in the framework of Savastano
and Miller (1998). The basolateral nucleus of the amygdala (BLA)
is important for second-order conditioning in this framework be-
cause detailed properties of the PVLV mechanism prevent the LVe
(CNA) from performing both first- and second-order conditioning.
This is consistent with data showing anatomical dissociations
between these forms of conditioning (e.g., Hatfield, Han, Conely,
& Holland, 1996). Note that this dissociation, and many others
reviewed later, would not be predicted by the abstract, unitary TD
mechanism. Thus, the PVLV mechanism provides an important
bridge between the more abstract TD model and the details of the
underlying neural systems.
The mapping between TD and PVLV is not perfect, however,
and PVLV makes several distinctive predictions relative to TD in
various behavioral paradigms. For example, PVLV strongly pre-
dicts that higher order conditioning beyond second-order should be
Figure 1. Schematic of dopamine (DA) recording data for a simple
conditioning experiment in which a conditioned stimulus (CS) reliably
precedes the delivery of a reward (Rew). During acquisition (a), DA
initially bursts at the time of reward delivery but then starts spiking at
stimulus onset, diminishing at the time of reward delivery. Note that there
is no strong evidence of a backward-propagating burst over training, as
predicted by some versions of the temporal-differences model but not by
primary value learned value (PVLV). After training (b), if reward is
omitted, a dip in DA below baseline tonic levels is observed.
32
O’REILLY, FRANK, HAZY, AND WATZ

weak to nonexistent, whereas TD makes no distinction between
these different levels of conditioning. There is a remarkable
lack of published work on third or higher levels of conditioning,
and the two references we were able to find indicate that it is
nonexistent or weak at best (Denny & Ratner, 1970; Dinsmoor,
2001). Another difference comes from paradigms with variable
CS–US intervals. As we show later, PVLV is very robust to this
variability but TD is not. The data indicate that animals are also
robust to this form of variability (H. Davis, McIntire, & Cohen,
1969; Kamin, 1960; Kirkpatrick & Church, 2000). PVLV also
makes a very strong distinction between delay and trace con-
ditioning, as do animals, whereas this distinction in TD is
considerably more arbitrary.
The remainder of the article is organized as follows. First we
develop the PVLV algorithm at a computational level and
provide several demonstrations of basic Pavlovian learning
phenomena by using the PVLV model. Next, we discuss the
mapping of PVLV onto the brain areas as summarized above
and review a range of empirical data that are consistent with
this model. We conclude by comparing our model with other
models in the literature, including the Brown et al. (1999)
model, which has several important similarities and several
differences relative to our model.
The PVLV Algorithm
The PVLV algorithm starts with the basic Rescorla and Wagner
(1972) learning rule (which is formally identical to the earlier delta
rule; Widrow & Hoff, 1960, originally pointed out by Sutton &
Barto, 1981), which captures the core principle that learning
should be based on the discrepancy between predictions and actual
outcomes:
t
r
t
rˆ
t
, (1)
where r
t
is the current reward value at time t, rˆ
t
is the expected or
predicted reward value, and
t
is the discrepancy or error between
the two. This
t
value then drives synaptic weight changes for the
system computing rˆ
t
. For example, a simple neural model would
involve a single neural unit that computes the estimated value rˆ
t
by
using synaptic weights w
i
from a set of sensory inputs x
i
:
rˆ
t
i
w
i
t
x
i
t
. (2)
The change in the weight values needed to improve the estimated
reward value is simply
w
i
t
ε
t
x
i
t
. (3)
This model does an excellent job of learning to expect primary
rewards, and, if we take the
t
to represent the dopamine firing
deviations from baseline, it can explain the cancellation of
dopamine bursting at the onset of the US in a classic Pavlovian
paradigm (Figure 1). However, it cannot account for the firing
of dopamine bursts at the earlier onset of a CS because in fact
there is no actual primary reward (r
t
) present at that time, and
thus the system will not learn to expect anything at that time.
This CS-triggered dopamine firing plays a critical functional
role in learning because it allows the system to learn which
situations and actions can lead to subsequent reward. For ex-
ample, initial exposure to the presence of cookies in a cookie jar
can enable a subsequent dopamine-reinforced approach and
opening of the jar.
The TD algorithm corrects this critical limitation of the
Rescorla–Wagner algorithm by adopting a temporally extended
prediction framework, where the objective is to predict future
rewards not just present rewards. The consequence of this is that
the
t
at one point in time drives learning based on the immediately
prior sensory input state x
i
t 1
. This produces a chain-reaction
effect in which a reward prediction error at one point in time
propagates earlier and earlier in time, to the earliest reliable pre-
dictor of a subsequent reward. Hence, the
t
value, and thus the
dopamine bursting, can move earlier in time to the onset of the CS.
The PVLV algorithm takes a different approach: The basic
Rescorla–Wagner learning rule is retained as the PV (primary
value) system, and an additional system (LV, learned value) is
added to learn about reward associations for conditioned stimuli.
In addition to the biological motivations for such a division of
labor mentioned earlier (and elaborated below), there are some
computational advantages for adopting this approach. Principally,
the relationship between a CS and a subsequent US is not always
very reliable, and having separate PV and LV systems enables the
system to be very robust to such variability. In contrast, the
chaining mechanism present in the TD algorithm is designed to
work optimally when there is a very reliable sequence of events
leading from the CS to the US. Intuitively, the chain between CS
and US must remain unbroken for the predictive signal to propa-
gate backward over learning, and this chain is only as strong as its
weakest link. This problem can be mitigated to some extent by
using an eligibility trace as in TD(), where 0 ⬍␭⬍1 parame-
terizes an exponentially decaying trace of the input stimuli used for
learning. This can smooth over rough spots in the chain but at the
potential cost of reducing the temporal precision of reward pre-
dictions as a result of excessive smearing. In contrast, PVLV
avoids this problem entirely by not relying on a chaining mecha-
nism at all.
There are many situations in which the CS–US relationship is
unreliable. For example, in many working memory tasks, a highly
variable number of distractor stimuli can intervene between a
stimulus to be encoded in working memory and the subsequent
demand to recall that stimulus (Hochreiter & Schmidhuber, 1997;
O’Reilly & Frank, 2006). Any dog owner knows that dogs come to
associate the jingling of a leash with the idea that they will soon be
going on a walk, despite a variable amount of time and intervening
events between the leash jingle and the walk itself (e.g., the owner
may go to the bathroom, turn off the television, and check e-mail).
In the animal learning literature, there are (only) a few experiments
in which the CS–US relationship is variable (H. Davis et al., 1969;
Kamin, 1960; Kirkpatrick & Church, 2000), but it is clear that
conditioning is very robust in this case, equivalent to comparison
conditions that have fixed CS–US intervals. This finding is con-
sistent with PVLV and poses a challenge to TD-based approaches.
In short, we think the PVLV mechanism has the simplicity and
robustness that are often characteristic of biological systems, with
the cost of being less elegant than the TD system (two systems are
33
PVLV: PAVLOVIAN LEARNING ALGORITHM

required instead of one). In the subsequent sections, we provide the
details for how the PV and LV systems operate.
The PV System
We can rewrite the Rescorla–Wagner equation in terms of the
excitatory (PVe) and inhibitory (PVi) subcomponents of the PV
system. The excitatory PV system represents the value implicitly
hardwired into a primary reward (US), PV
e
t
r
t
in the notation of
Rescorla–Wagner, whereas the inhibitory system learns to cancel
out these rewards, PV
i
t
rˆ
t
. Thus, in this terminology, the PV
delta is
pv
t
PV
e
t
PV
i
t
r
t
rˆ
t
, (4)
and this value is used to train the PVi system as described
earlier (Equation 3). As a consequence, when primary rewards
are delivered, the PVi system associates the current state of the
system with the US (reward). This current state information
includes any sensory inputs that coincide with reward, together
with internally generated timing signals (e.g., if rewards are
always delivered precisely 2 s following an input stimulus, then
the 2-s timing signal becomes associated with the US just as an
external sensory stimulus can become associated with it; Savas-
tano & Miller, 1998). As these associations increase, PV
i
t
at the
time of primary reward increases to match PV
e
t
, and the
pv
t
value (i.e., dopamine bursting) decreases, which is the observed
pattern.
The LV System
The LV system also uses the Rescorla–Wagner learning rule but
has a few key differences that enable it to signal reward associa-
tions at the time of CS onset. Like the PV system, the LV system
has two components, excitatory (LVe) and inhibitory (LVi). We
focus first on the LVe component, which learns CS associations
and drives the excitatory dopamine bursts at CS onset. The most
important property of the LVe system is that it only learns when
primary rewards are present or expected. In contrast, the PVi
system learns at all times about the current primary reward status
(PVe or r
t
). This difference protects the LVe system from having
to learn that there are no actual primary rewards present at the time
of CS onset. Therefore, unlike the PV system, it is able to signal
the reward association of a CS and not have this signal (dopamine
burst) trained away, as otherwise it would be if pure Rescorla–
Wagner learning were at work.
More formally, the LVe learning is conditioned on the state
of the PV system, according to the following filtering condition:
PV
filter
PV
i
t
pv
or PV
e
t
pv
, (5)
where
pv
is a threshold on PV activation, above which it is
considered that the PV system is expecting or receiving a reward
at this time (in the Appendix we present a more general condition
that allows for representation of both reward and punishment
expectations).
For clarity, note that PV
filter
is thus a boolean variable such that
PV
filter
1 if primary reward present or expected
0 otherwise.
(6)
The boolean value of PV
filter
then regulates the learning of the LVe
system,
w
i
t
εPV
e
t
LV
e
t
x
i
t
,ifPV
filter
0 otherwise.
(7)
The dependence of the secondary LV system on the primary PV
system for learning ensures that actual reward outcomes have the
final say in shaping all of the reward associations learned by the
system. Also, note that it is the primary reward value itself (PV
e
t
or r
t
) that drives the learning of the LV system, not the PV or LV
delta value, which is defined next. These features have important
implications for understanding various conditioning phenomena,
as elaborated below.
The LVi system performs a similar role for a learned CS as the
PVi system does for the US: It learns to cancel out dopamine
bursts for a highly learned CS. The LVi system is essentially the
same as the LVe system, except that it uses a slower learning rate
(ε), and it produces a net inhibitory drive on the dopamine system
like the PVi system. The LV delta is then the difference between
the excitatory and inhibitory components (just as with the PV
delta),
lv
t
LV
e
t
LV
i
t
. (8)
Because of its slower learning rate, LVi slowly learns which
CSs are reliably associated with reward and decreases the
dopamine bursts for such CSs relative to those that have more
recently become associated with reward (which have been
learned by the faster LVe but not the slower LVi system).
Furthermore, if a CS that has been reliably associated with
reward subsequently becomes less strongly associated with
reward, the LV delta can become negative (because LVe has
learned this new lower reward association, but LVi retains the
previous more positive association), indicating the change in
reward association. Thus, consistent with the computational
motivation for the delta rule, the LV delta in Equation 8
represents the discrepancy between what was previously known
or expected (as encoded in the LVi weights of the system
through prior learning) and what is more recently happening
(encoded through the LVe weights). This LVi value does not
much affect the simple conditioning simulations shown below,
but it is more important for the efficacy of PVLV in training an
actor (in our case for working memory updating; O’Reilly &
Frank, 2006). Specifically, without LVi a stimulus associated
with reward would always drive a DA burst (even if its reward
association had recently decreased), and it would always rein-
force actions with a constant dopamine burst, to the point that
such actions would be massively overlearned.
How do the PV and LV systems each contribute to the
dopamine output signal? Because there are two delta signals in
PVLV, from PV and LV, these need to be combined to produce
an overall delta value that can be used as a global dopamine
signal (e.g., to train an actor system in an actor– critic archi-
tecture). The most functionally transparent mechanism is to
34
O’REILLY, FRANK, HAZY, AND WATZ

have the PV delta apply whenever there is a primary reward
present or expected by the PV system. But when no rewards are
present, the LV delta can still drive dopamine firing. As before
(see Equation 5), PVLV implements this by using the boolean
variable, PV
filter
, where
PV
filter
1 if primary reward present or expected
0 otherwise,
(9)
and PV
filter
is evaluated,
PV
filter
PV
i
t
pv
or PV
e
t
py
. (10)
Thus,
t
pv
,ifPV
filter
lv
otherwise.
(11)
This is also consistent with the equation that determines when the
LV system learns according to PV expectations and actual reward
delivery (Equation 7).
1
Figure 2 summarizes the PVLV system’s operation in the simple
CS–US conditioning paradigm we have been considering. The PV
continuously learns about the occurrence of primary rewards (both
presence and absence), and as it learns to expect reward delivery
it cancels the dopamine burst (i.e., PV delta value) that would
otherwise occur at that time. The reward also trains the LV system,
which produces increasing weights from the CS (as long as this is
active in the input at this time). On subsequent trials, the LV
system is then able to fire naturally at CS onset, producing a
dopamine burst (i.e., LV delta value). By this mechanism, the time
gap between CS-onset and US is bridged automatically by the
CS–US association, without recourse to the kind of explicit pre-
diction that is central to the TD model. The biological mapping of
the PVLV mechanisms shown in the figure are discussed in detail
below.
Additional Mechanisms
There are two additional mechanisms required for the overall
system to function (and to be consistent with available data). First
(as previously noted), the PVi system must take advantage of some
kind of timing signal that enables it to fire at the expected time of
actual reward input and not otherwise. In Figure 2B, we illustrate
a ramping timing signal triggered by CS onset, which is intended
to represent the kind of interval timing signal provided by the
cerebellum (e.g., Ivry, 1996; Mauk & Buonomano, 2004), but any
kind of regular activity pattern would work just as well for our
model (see Lustig, Matell, & Meck, 2005, for a model of timing
signals within the basal ganglia). We discuss this issue further in
comparison with an alternative model of DA firing by Brown et al.
(1999) below, which depends on an intrinsic timing mechanism as
an integral part of their system.
The second additional mechanism required is a novelty de-
tection (and familiarity suppression) mechanism, so that the LV
system does not continue to trigger dopamine spiking during the
entire duration of CS input. With such a mechanism in place,
the first onset of a stimulus input triggers a burst of LV firing,
but this then decreases as the stimulus stays on. One solution to
this problem is to use a habituation mechanism on the LV
system to achieve this effect (e.g., Brown et al., 1999), but this
would generalize across various different stimuli and would
therefore prevent a second stimulus that could be associated
with a different or larger reward from evoking DA firing.
Instead, in our implementation we have adopted a synaptic
depression mechanism (e.g., Abbott, Varela, Sen, & Nelson,
1997; Markram & Tsodyks, 1996; Zucker & Regehr, 2002;
Huber & O’Reilly, 2003), which causes a habituation of the LV
DA-burst firing response only to the stimulus that was initially
active (i.e., only active synapses are depressed). With this
mechanism in place, the LVe system accommodates to any
constant sensory inputs and responds only to changes in input
signals, causing it to fire only at the onset of a stimulus tone.
Such synaptic depression mechanisms are ubiquitous through-
out the vertebrate and invertebrate brain (Zucker & Regehr,
2002). Nevertheless, there are a large number of ways of
implementing such an overall function, so we are confident that,
if our overall hypothesis about the PVLV mechanism is correct,
the brain will have found a means of achieving this function.
2
For full details about the PVLV algorithm and implementation,
see the Appendix.
Application to Conditioning Data
At the level of the basic DA firing data represented in Figure 1,
both TD and PVLV account for the most basic findings of DA
bursting at tone onset and cancellation of the burst at reward
delivery. However, as noted earlier, simple TD models (but not
PVLV) also predict a chaining of DA bursts “backward in time”
from the reward to the stimulus onset, which has not been reliably
observed empirically (Fiorillo, Tobler, & Schultz, 2005; Pan et al.,
2005). However, this particular aspect of the data is still contro-
versial (e.g., Niv, Duff, & Dayan, 2005) and also depends critically
on the way that the input environment is represented. For example,
Pan et al. (2005) recently showed how a TD() model with a high
lambda value could reproduce the empirically observed pattern
(i.e., no evidence of backward marching dopamine bursts). Fur-
thermore, the data often show dopamine bursts at both the CS and
US (Pan et al., 2005; Schultz, 2002)—this is incompatible with
1
A simpler possible implementation would be to just add the two delta
values to produce a summed DA value, but this double counts the reward-
related deltas because both the LV and PV contribute in this case. Never-
theless, because LV and PV deltas otherwise occur at different times,
Equation 11 is very similar to adding the deltas; the PV system just
dominates when external rewards are presented or expected. It is also
possible to consider an additive equation that also conditionalizes the
contribution of the PV component; this was found in O’Reilly and Frank
(2006) to work slightly better than Equation 11 in a working memory
model (see Appendix for details).
2
Available evidence suggests that a mechanism such as proposed here
most likely exists in the pathway somewhere distal to the LVe represen-
tations themselves (which PVLV proposes to be in the central nucleus of
the amygdala, see below) as electrophysiological recording data show
sustained (i.e., not onset-only) firing in CNA cells throughout CS duration
(Ono, Nishijo, & Uwano, 1995). For example, downstream synaptic de-
pression/habituation may occur in either the pedunculopontine nucleus, or
it could be intrinsic to local dynamics in the midbrain dopamine nuclei
themselves.
35
PVLV: PAVLOVIAN LEARNING ALGORITHM

Figures
Citations
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia

TL;DR: This article presents an attempt to deconstruct this homunculus through powerful learning mechanisms that allow a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner.
Journal ArticleDOI

Hold your horses: a dynamic computational role for the subthalamic nucleus in decision making

TL;DR: This model incorporates the subthalamic nucleus (STN) and shows that by modulating when a response is executed, the STN reduces premature responding and therefore has substantial effects on which response is ultimately selected, particularly when there are multiple competing responses.

2006 Special Issue Hold your horses: A dynamic computational role for the subthalamic nucleus in decision making

TL;DR: In this paper, a model incorporating the subthalamic nucleus (STN) of the basal ganglia (BG) was proposed to model the dynamics of activity in various BG areas during response selection in Parkinson's tremor.
Journal ArticleDOI

Reinforcement learning: The Good, The Bad and The Ugly

TL;DR: The latest dispatches from the forefront offorcement learning are reviewed, some of the territories where lie monsters are mapped, and the future of reinforcement learning is mapped.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Learning to Predict by the Methods of Temporal Differences

Richard S. Sutton
- 01 Aug 1988 - 
TL;DR: This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior – and proves their convergence and optimality for special cases and relation to supervised-learning methods.
Journal ArticleDOI

Short-Term Synaptic Plasticity

TL;DR: The evidence for this hypothesis, and the origins of the different kinetic phases of synaptic enhancement, as well as the interpretation of statistical changes in transmitter release and roles played by other factors such as alterations in presynaptic Ca(2+) influx or postsynaptic levels of [Ca(2+)]i are discussed.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What is the role of the basal ganglia in learning?

How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively re-spond to unexpected rewarding cues. 

The authors present their primary value learned value ( PVLV ) model for understanding the rewardpredictive firing properties of dopamine ( DA ) neurons as an alternative to the temporal-differences ( TD ) algorithm. The authors show that the PVLV model can account for critical aspects of the DA firing data, making a number of clear predictions about lesion effects, several of which are consistent with existing data. Overall, the model provides a biologically plausible framework for understanding the neural basis of reward learning. 

The striosome/patch neurons in the ventral striatum should be specifically responsible for producing the dip in dopamine firing at the time of expected rewards, when rewards are not delivered. 

At a behavioral level, the VS striosome/patch neurons should be specifically important for extinction and reversal learning, which should depend on the dopamine dip for expected but not delivered rewards. 

The blocking effect (Kamin, 1968) was one of the main motivations leading to the development of the Rescorla–Wagner learning rule. 

At a behavioral level, CNA should be critical for learning working memory tasks, in which working memory update actions (driven by more central regions of the dorsal striatum that project to prefrontal cortex) must be activated at the onset of task-relevant stimuli to encode these stimuli into working memory (O’Reilly & Frank, 2006). 

their more general basal ganglia model suggests that the disinhibitory pathway from the striatum to the SNc via the ventral pallidum is responsible for disinhibiting dopamine release for actions that were initiated by “go” signals in the striatum and not for directly activating dopamine bursting (O’Reilly & Frank, 2006). 

The standard account of these findings is that these additional brain systems are necessary for maintaining the stimulus trace through to the point of reward so that a reward association can be established. 

Perhaps the strongest support for this framework as a model of reward learning in the brain comes from the good fit between the functional properties of the PV and LV components of PVLV and those of brain areas that are known to support reward learning. 

this ventral striatum pathway achieves an effect similar to the LV system in PVLV (i.e., CS-onset firing), whereas the striosomal pathway achieves an effect similar to the PV system (i.e., reward burst canceling). 

The actual pattern of results that is seen depends on the preexisting salience of the two stimuli, with the more salient stimulus overshadowing the lesser.