What is the role of the basal ganglia in learning?

How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively re-spond to unexpected rewarding cues.

What is the role of the striosome/patch neurons in the ventral?

The striosome/patch neurons in the ventral striatum should be specifically responsible for producing the dip in dopamine firing at the time of expected rewards, when rewards are not delivered.

What is the role of the VS striosome/patch neurons in learning?

At a behavioral level, the VS striosome/patch neurons should be specifically important for extinction and reversal learning, which should depend on the dopamine dip for expected but not delivered rewards.

What was the main motivation for the development of the Rescorla learning rule?

The blocking effect (Kamin, 1968) was one of the main motivations leading to the development of the Rescorla–Wagner learning rule.

What is the role of the CNA in learning working memory tasks?

At a behavioral level, CNA should be critical for learning working memory tasks, in which working memory update actions (driven by more central regions of the dorsal striatum that project to prefrontal cortex) must be activated at the onset of task-relevant stimuli to encode these stimuli into working memory (O’Reilly & Frank, 2006).

What is the role of the basal ganglia in dopamine release?

their more general basal ganglia model suggests that the disinhibitory pathway from the striatum to the SNc via the ventral pallidum is responsible for disinhibiting dopamine release for actions that were initiated by “go” signals in the striatum and not for directly activating dopamine bursting (O’Reilly & Frank, 2006).

What is the standard account of these findings?

The standard account of these findings is that these additional brain systems are necessary for maintaining the stimulus trace through to the point of reward so that a reward association can be established.

What is the strongest support for the TD framework?

Perhaps the strongest support for this framework as a model of reward learning in the brain comes from the good fit between the functional properties of the PV and LV components of PVLV and those of brain areas that are known to support reward learning.

What is the effect of the ventral striatum pathway on the reward system?

this ventral striatum pathway achieves an effect similar to the LV system in PVLV (i.e., CS-onset firing), whereas the striosomal pathway achieves an effect similar to the PV system (i.e., reward burst canceling).

What is the pattern of results that is seen in the DA burst?

The actual pattern of results that is seen depends on the preexisting salience of the two stimuli, with the more salient stimulus overshadowing the lesser.

(Open Access) PVLV: the primary value and learned value Pavlovian learning algorithm. (2007) | Randall C. O'Reilly

Q: What contributions have the authors mentioned in the paper "Pvlv: the primary value and learned value pavlovian learning algorithm" ?

The authors present their primary value learned value ( PVLV ) model for understanding the rewardpredictive firing properties of dopamine ( DA ) neurons as an alternative to the temporal-differences ( TD ) algorithm. The authors show that the PVLV model can account for critical aspects of the DA firing data, making a number of clear predictions about lesion effects, several of which are consistent with existing data. Overall, the model provides a biologically plausible framework for understanding the neural basis of reward learning.

PVLV: The Primary Value and Learned Value Pavlovian

Learning Algorithm

Randall C. O’Reilly

University of Colorado at Boulder

Michael J. Frank

University of Arizona

Thomas E. Hazy and Brandon Watz

University of Colorado at Boulder

The authors present their primary value learned value (PVLV) model for understanding the reward-

predictive firing properties of dopamine (DA) neurons as an alternative to the temporal-differences (TD)

algorithm. PVLV is more directly related to underlying biology and is also more robust to variability in

the environment. The primary value (PV) system controls performance and learning during primary

rewards, whereas the learned value (LV) system learns about conditioned stimuli. The PV system is

essentially the Rescorla–Wagner/delta-rule and comprises the neurons in the ventral striatum/nucleus

accumbens that inhibit DA cells. The LV system comprises the neurons in the central nucleus of the

amygdala that excite DA cells. The authors show that the PVLV model can account for critical aspects

of the DA firing data, making a number of clear predictions about lesion effects, several of which are

consistent with existing data. For example, first- and second-order conditioning can be anatomically

dissociated, which is consistent with PVLV and not TD. Overall, the model provides a biologically

plausible framework for understanding the neural basis of reward learning.

Keywords: basal ganglia, dopamine, reinforcement learning, Pavlovian conditioning, computational

modeling

An important and longstanding challenge for both the cognitive

neuroscience and artificial intelligence communities has been to

develop an adequate understanding (and a correspondingly robust

model) of Pavlovian learning. Such a model should account for the

full range of signature findings in the rich literature on this phe-

nomenon. Pavlovian conditioning refers to the ability of previ-

ously neutral stimuli that reliably co-occur with primary rewards to

elicit new conditioned behaviors and to take on reward value

themselves (e.g., Pavlov’s famous case of the bell signaling food

for hungry dogs; Pavlov, 1927).

Pavlovian conditioning is distinguished from instrumental con-

ditioning in that the latter involves the learning of new behaviors

that are reliably associated with reward, either first order (US), or

second order (CS). Although Pavlovian conditioning also involves

behaviors (conditioned and unconditioned responses), reward de-

livery is not contingent on behavior but is instead reliably paired

with a stimulus regardless of behavior. In contrast, instrumental

conditioning explicitly makes reward contingent on a particular

“operant” or “instrumental” response. Both stimulus–reward (Pav-

lovian) and stimulus–response–reward (instrumental) associations,

however, are thought to be trained by the same phasic dopamine

signal that occurs at the time of primary reward (US) as described

below. In practice, the distinction is often blurry as the two types

of conditioning interact (e.g., second-order instrumental condition-

ing and so-called Pavlovian instrumental transfer effects).

The dominant theoretical perspective for both Pavlovian and

instrumental conditioning since the seminal Rescorla and Wagner

(1972) model, is that learning is based on the discrepancy between

actual rewards received and predictions thereof (i.e., reward pre-

diction error). Currently, the temporal differences (TD) reward

prediction framework (Sutton, 1988; Sutton & Barto, 1998) is by

far the most widely adopted computational level account of Pav-

lovian (and instrumental) conditioning and dopamine firing (e.g.,

Barto, 1995; Daw, Courville, & Touretzky, 2003; Daw, Kakade, &

Dayan, 2002; Dayan, 2001; Dayan & Balleine, 2002; Houk, Ad-

ams, & Barto, 1995; Kakade & Dayan, 2002a, 2002b; Montague,

Dayan, & Sejnowski, 1996; Suri & Schultz, 1999, 2001; see

Brown, Bullock, & Grossberg, 1999; Contreras-Vidal & Schultz,

1999; Sporns & Alexander, 2002, for alternative models, and Joel,

Niv, & Ruppin, 2002, for a biologically oriented review).

One important reason for the popularity of TD is that a reward

prediction error signal has been established in the brain, in the

pattern of midbrain dopamine neuron activation (e.g., Schultz,

1998; Schultz, Apicella, & Ljungberg, 1993; see Figure 1). These

neurons initially fire short phasic bursts of activity for primary

rewards and over the course of learning come to fire similarly at

Randall C. O’Reilly, Thomas E. Hazy, and Brandon Watz, Department

of Psychology, University of Colorado at Boulder; Michael J. Frank,

Department of Psychology and Program in Neuroscience, University of

Arizona.

This work was supported by Office of Naval Research Grant N00014-

03-1-0428 and National Institute of Mental Health Grants MH069597 and

MH64445. We thank Peter Dayan, Nathaniel Daw, Yael Niv, Eric Claus,

and the CCN lab for discussions of these ideas.

Correspondence concerning this article should be addressed to Randall

C. O’Reilly, Department of Psychology, University of Colorado at Boul-

der, 345 UCB, Boulder, CO 80309. E-mail: oreilly@psych.colorado.edu

2007, Vol. 121, No. 1, 31– 49 0735-7044/07/$12.00 DOI: 10.1037/0735-7044.121.1.31

the onset of previously neutral, reward predictive stimuli (i.e.,

conditioned stimuli; CS), and no longer to the reward itself.

Generally, there is a time period when both CS and reward-related

firing is occurring (Pan, Schmidt, Wickens, & Hyland, 2005;

Schultz, 2002).

However, it remains unclear exactly what brain mechanisms

lead to this behavior on the part of dopamine cells. Most research-

ers agree that the critical learning processes are taking place

upstream from the midbrain dopamine neurons themselves. But

which areas are doing what? Because it is an abstract, unitary (and

elegant) framework, the TD model does not map directly onto the

relatively large collection of neural substrates known to be in-

volved in reinforcement learning, including areas of the basal

ganglia, amygdala, midbrain dopamine nuclei, and ventromedial

prefrontal cortex (Cardinal, Parkinson, Hall, & Everitt, 2002).

Indeed, relatively few specific proposals have been made for a

biological mapping of the TD model (Houk et al., 1995; Joel et al.,

2002).

In this article, we offer a multicomponent model of Pavlovian

learning called PVLV, which provides a more direct mapping onto

the underlying neural substrates. PVLV is composed of two sub-

systems: primary value (PV) and learned value (LV). The PV

system is engaged by primary reward (i.e., an unconditioned

stimulus; US) and learns to expect the occurrence of a given US,

thereby inhibiting the dopamine burst that would otherwise occur

for it. The LV system learns about conditioned stimuli that are

reliably associated with primary rewards, and it drives phasic

dopamine burst firing at the time of CS onset. This decomposition

is similar to the model of Brown et al. (1999), but as we discuss

later, there are several important functional and anatomical differ-

ences between the two models.

The PV and LV systems are further subdivided into excitatory

and inhibitory subcomponents, which provide a good fit with a

wide range of data (reviewed in detail later) on three different

brain areas. The excitatory subcomponent of PV (denoted PVe) is

associated with the reward-driven excitatory projections from the

lateral hypothalamus onto midbrain dopamine neurons in the sub-

stantia nigra pars compacta (SNc) and the ventral tegmental area

(VTA) as we discuss in more detail later in the section “Biological

Mapping of PVLV.” The inhibitory subcomponent of PV (PVi) is

associated with neurons in the ventral striatum/nucleus accumbens

(VS/NAc) that have direct GABAergic projections to the SNc and

VTA and fire just in advance of primary rewards. The excitatory

subcomponent of the LV system (LVe) is associated with neurons

in the central nucleus of the amygdala (CNA), which have a net

excitatory effect on the SNc and VTA. Thus, we suggest that these

CNA neurons learn to associate CSs with reward and drive exci-

tatory dopamine bursts at CS onset. Finally, there is an inhibitory

component of the LV (LVi) that is also associated with the VS/

NAc, which slowly learns to inhibit the excitatory LVe drive on

the dopamine neurons.

In addition to these core PVLV mechanisms, a number of other

brain areas play a critical role in reinforcement learning. For

example, we think of the prefrontal cortex (PFC) and hippocampus

as providing something akin to an eligibility trace (as in TD[␭];

Sutton & Barto, 1998; Pan et al., 2005). We believe this sort of

actively maintained working memory representation is particularly

crucial in trace conditioning paradigms in which there is an inter-

val of time between CS-offset and US-onset. As we discuss later,

PVLV explicitly accounts for known dissociations between delay

versus trace conditioning paradigms that occur under PFC and/or

hippocampal lesions, something about which TD is less explicit. In

fact, PVLV actually requires that models learn to hold onto work-

ing memory representations under trace conditioning paradigms.

Although for the models in this article we apply the working

memory inputs directly (so as to focus on the core PVLV mech-

anisms), our larger prefrontal cortex-basal ganglia (PBWM)

model, of which PVLV is a subcomponent, demonstrates how the

system can learn to maintain task-relevant information in working

memory (O’Reilly & Frank, 2006). TD does not address the

learning of working memory representations explicitly, instead it

finesses the issue by assuming that it is just there in the eligibility

trace.

In addition, the cerebellum (and possibly other brain areas)

provides a representation of time (e.g., Mauk & Buonomano,

2004; Ivry, 1996) that acts as an additional input signal that can

become associated with reward, as in the framework of Savastano

and Miller (1998). The basolateral nucleus of the amygdala (BLA)

is important for second-order conditioning in this framework be-

cause detailed properties of the PVLV mechanism prevent the LVe

(CNA) from performing both first- and second-order conditioning.

This is consistent with data showing anatomical dissociations

between these forms of conditioning (e.g., Hatfield, Han, Conely,

& Holland, 1996). Note that this dissociation, and many others

reviewed later, would not be predicted by the abstract, unitary TD

mechanism. Thus, the PVLV mechanism provides an important

bridge between the more abstract TD model and the details of the

underlying neural systems.

The mapping between TD and PVLV is not perfect, however,

and PVLV makes several distinctive predictions relative to TD in

various behavioral paradigms. For example, PVLV strongly pre-

dicts that higher order conditioning beyond second-order should be

Figure 1. Schematic of dopamine (DA) recording data for a simple

conditioning experiment in which a conditioned stimulus (CS) reliably

precedes the delivery of a reward (Rew). During acquisition (a), DA

initially bursts at the time of reward delivery but then starts spiking at

stimulus onset, diminishing at the time of reward delivery. Note that there

is no strong evidence of a backward-propagating burst over training, as

predicted by some versions of the temporal-differences model but not by

primary value learned value (PVLV). After training (b), if reward is

omitted, a dip in DA below baseline tonic levels is observed.

O’REILLY, FRANK, HAZY, AND WATZ

weak to nonexistent, whereas TD makes no distinction between

these different levels of conditioning. There is a remarkable

lack of published work on third or higher levels of conditioning,

and the two references we were able to find indicate that it is

nonexistent or weak at best (Denny & Ratner, 1970; Dinsmoor,

2001). Another difference comes from paradigms with variable

CS–US intervals. As we show later, PVLV is very robust to this

variability but TD is not. The data indicate that animals are also

robust to this form of variability (H. Davis, McIntire, & Cohen,

1969; Kamin, 1960; Kirkpatrick & Church, 2000). PVLV also

makes a very strong distinction between delay and trace con-

ditioning, as do animals, whereas this distinction in TD is

considerably more arbitrary.

The remainder of the article is organized as follows. First we

develop the PVLV algorithm at a computational level and

provide several demonstrations of basic Pavlovian learning

phenomena by using the PVLV model. Next, we discuss the

mapping of PVLV onto the brain areas as summarized above

and review a range of empirical data that are consistent with

this model. We conclude by comparing our model with other

models in the literature, including the Brown et al. (1999)

model, which has several important similarities and several

differences relative to our model.

The PVLV Algorithm

The PVLV algorithm starts with the basic Rescorla and Wagner

(1972) learning rule (which is formally identical to the earlier delta

rule; Widrow & Hoff, 1960, originally pointed out by Sutton &

Barto, 1981), which captures the core principle that learning

should be based on the discrepancy between predictions and actual

outcomes:

␦

⫽ r

⫺ rˆ

, (1)

where r

is the current reward value at time t, rˆ

is the expected or

predicted reward value, and ␦

is the discrepancy or error between

the two. This ␦

value then drives synaptic weight changes for the

system computing rˆ

. For example, a simple neural model would

involve a single neural unit that computes the estimated value rˆ

using synaptic weights w

from a set of sensory inputs x

rˆ

⫽

冘

. (2)

The change in the weight values needed to improve the estimated

reward value is simply

⌬w

⫽ ε␦

. (3)

This model does an excellent job of learning to expect primary

rewards, and, if we take the ␦

to represent the dopamine firing

deviations from baseline, it can explain the cancellation of

dopamine bursting at the onset of the US in a classic Pavlovian

paradigm (Figure 1). However, it cannot account for the firing

of dopamine bursts at the earlier onset of a CS because in fact

there is no actual primary reward (r

) present at that time, and

thus the system will not learn to expect anything at that time.

This CS-triggered dopamine firing plays a critical functional

role in learning because it allows the system to learn which

situations and actions can lead to subsequent reward. For ex-

ample, initial exposure to the presence of cookies in a cookie jar

can enable a subsequent dopamine-reinforced approach and

opening of the jar.

The TD algorithm corrects this critical limitation of the

Rescorla–Wagner algorithm by adopting a temporally extended

prediction framework, where the objective is to predict future

rewards not just present rewards. The consequence of this is that

the ␦

at one point in time drives learning based on the immediately

prior sensory input state x

t ⫺ 1

. This produces a chain-reaction

effect in which a reward prediction error at one point in time

propagates earlier and earlier in time, to the earliest reliable pre-

dictor of a subsequent reward. Hence, the ␦

value, and thus the

dopamine bursting, can move earlier in time to the onset of the CS.

The PVLV algorithm takes a different approach: The basic

Rescorla–Wagner learning rule is retained as the PV (primary

value) system, and an additional system (LV, learned value) is

added to learn about reward associations for conditioned stimuli.

In addition to the biological motivations for such a division of

labor mentioned earlier (and elaborated below), there are some

computational advantages for adopting this approach. Principally,

the relationship between a CS and a subsequent US is not always

very reliable, and having separate PV and LV systems enables the

system to be very robust to such variability. In contrast, the

chaining mechanism present in the TD algorithm is designed to

work optimally when there is a very reliable sequence of events

leading from the CS to the US. Intuitively, the chain between CS

and US must remain unbroken for the predictive signal to propa-

gate backward over learning, and this chain is only as strong as its

weakest link. This problem can be mitigated to some extent by

using an eligibility trace as in TD(␭), where 0 ⬍␭⬍1 parame-

terizes an exponentially decaying trace of the input stimuli used for

learning. This can smooth over rough spots in the chain but at the

potential cost of reducing the temporal precision of reward pre-

dictions as a result of excessive smearing. In contrast, PVLV

avoids this problem entirely by not relying on a chaining mecha-

nism at all.

There are many situations in which the CS–US relationship is

unreliable. For example, in many working memory tasks, a highly

variable number of distractor stimuli can intervene between a

stimulus to be encoded in working memory and the subsequent

demand to recall that stimulus (Hochreiter & Schmidhuber, 1997;

O’Reilly & Frank, 2006). Any dog owner knows that dogs come to

associate the jingling of a leash with the idea that they will soon be

going on a walk, despite a variable amount of time and intervening

events between the leash jingle and the walk itself (e.g., the owner

may go to the bathroom, turn off the television, and check e-mail).

In the animal learning literature, there are (only) a few experiments

in which the CS–US relationship is variable (H. Davis et al., 1969;

Kamin, 1960; Kirkpatrick & Church, 2000), but it is clear that

conditioning is very robust in this case, equivalent to comparison

conditions that have fixed CS–US intervals. This finding is con-

sistent with PVLV and poses a challenge to TD-based approaches.

In short, we think the PVLV mechanism has the simplicity and

robustness that are often characteristic of biological systems, with

the cost of being less elegant than the TD system (two systems are

PVLV: PAVLOVIAN LEARNING ALGORITHM

required instead of one). In the subsequent sections, we provide the

details for how the PV and LV systems operate.

The PV System

We can rewrite the Rescorla–Wagner equation in terms of the

excitatory (PVe) and inhibitory (PVi) subcomponents of the PV

system. The excitatory PV system represents the value implicitly

hardwired into a primary reward (US), PV

⫽ r

in the notation of

Rescorla–Wagner, whereas the inhibitory system learns to cancel

out these rewards, PV

⫽ rˆ

. Thus, in this terminology, the PV

delta is

␦

⫽ PV

⫺ PV

⫽ r

⫺ rˆ

, (4)

and this value is used to train the PVi system as described

earlier (Equation 3). As a consequence, when primary rewards

are delivered, the PVi system associates the current state of the

system with the US (reward). This current state information

includes any sensory inputs that coincide with reward, together

with internally generated timing signals (e.g., if rewards are

always delivered precisely 2 s following an input stimulus, then

the 2-s timing signal becomes associated with the US just as an

external sensory stimulus can become associated with it; Savas-

tano & Miller, 1998). As these associations increase, PV

at the

time of primary reward increases to match PV

, and the ␦

value (i.e., dopamine bursting) decreases, which is the observed

pattern.

The LV System

The LV system also uses the Rescorla–Wagner learning rule but

has a few key differences that enable it to signal reward associa-

tions at the time of CS onset. Like the PV system, the LV system

has two components, excitatory (LVe) and inhibitory (LVi). We

focus first on the LVe component, which learns CS associations

and drives the excitatory dopamine bursts at CS onset. The most

important property of the LVe system is that it only learns when

primary rewards are present or expected. In contrast, the PVi

system learns at all times about the current primary reward status

(PVe or r

). This difference protects the LVe system from having

to learn that there are no actual primary rewards present at the time

of CS onset. Therefore, unlike the PV system, it is able to signal

the reward association of a CS and not have this signal (dopamine

burst) trained away, as otherwise it would be if pure Rescorla–

Wagner learning were at work.

More formally, the LVe learning is conditioned on the state

of the PV system, according to the following filtering condition:

filter

⫽ PV

⬎ ␪

or PV

⬎ ␪

, (5)

where ␪

is a threshold on PV activation, above which it is

considered that the PV system is expecting or receiving a reward

at this time (in the Appendix we present a more general condition

that allows for representation of both reward and punishment

expectations).

For clarity, note that PV

filter

is thus a boolean variable such that

filter

⫽

再

1 if primary reward present or expected

0 otherwise.

(6)

The boolean value of PV

filter

then regulates the learning of the LVe

system,

⌬w

⫽

再

ε共PV

⫺ LV

兲x

,ifPV

filter

0 otherwise.

(7)

The dependence of the secondary LV system on the primary PV

system for learning ensures that actual reward outcomes have the

final say in shaping all of the reward associations learned by the

system. Also, note that it is the primary reward value itself (PV

or r

) that drives the learning of the LV system, not the PV or LV

delta value, which is defined next. These features have important

implications for understanding various conditioning phenomena,

as elaborated below.

The LVi system performs a similar role for a learned CS as the

PVi system does for the US: It learns to cancel out dopamine

bursts for a highly learned CS. The LVi system is essentially the

same as the LVe system, except that it uses a slower learning rate

(ε), and it produces a net inhibitory drive on the dopamine system

like the PVi system. The LV delta is then the difference between

the excitatory and inhibitory components (just as with the PV

delta),

␦

⫽ LV

⫺ LV

. (8)

Because of its slower learning rate, LVi slowly learns which

CSs are reliably associated with reward and decreases the

dopamine bursts for such CSs relative to those that have more

recently become associated with reward (which have been

learned by the faster LVe but not the slower LVi system).

Furthermore, if a CS that has been reliably associated with

reward subsequently becomes less strongly associated with

reward, the LV delta can become negative (because LVe has

learned this new lower reward association, but LVi retains the

previous more positive association), indicating the change in

reward association. Thus, consistent with the computational

motivation for the delta rule, the LV delta in Equation 8

represents the discrepancy between what was previously known

or expected (as encoded in the LVi weights of the system

through prior learning) and what is more recently happening

(encoded through the LVe weights). This LVi value does not

much affect the simple conditioning simulations shown below,

but it is more important for the efficacy of PVLV in training an

actor (in our case for working memory updating; O’Reilly &

Frank, 2006). Specifically, without LVi a stimulus associated

with reward would always drive a DA burst (even if its reward

association had recently decreased), and it would always rein-

force actions with a constant dopamine burst, to the point that

such actions would be massively overlearned.

How do the PV and LV systems each contribute to the

dopamine output signal? Because there are two delta signals in

PVLV, from PV and LV, these need to be combined to produce

an overall delta value that can be used as a global dopamine

signal (e.g., to train an actor system in an actor– critic archi-

tecture). The most functionally transparent mechanism is to

O’REILLY, FRANK, HAZY, AND WATZ

have the PV delta apply whenever there is a primary reward

present or expected by the PV system. But when no rewards are

present, the LV delta can still drive dopamine firing. As before

(see Equation 5), PVLV implements this by using the boolean

variable, PV

filter

, where

filter

⫽

再

1 if primary reward present or expected

0 otherwise,

(9)

and PV

filter

is evaluated,

filter

⫽ PV

⬎ ␪

or PV

⬎ ␪

. (10)

Thus,

␦

⫽

再

␦

,ifPV

filter

␦

otherwise.

(11)

This is also consistent with the equation that determines when the

LV system learns according to PV expectations and actual reward

delivery (Equation 7).

Figure 2 summarizes the PVLV system’s operation in the simple

CS–US conditioning paradigm we have been considering. The PV

continuously learns about the occurrence of primary rewards (both

presence and absence), and as it learns to expect reward delivery

it cancels the dopamine burst (i.e., PV delta value) that would

otherwise occur at that time. The reward also trains the LV system,

which produces increasing weights from the CS (as long as this is

active in the input at this time). On subsequent trials, the LV

system is then able to fire naturally at CS onset, producing a

dopamine burst (i.e., LV delta value). By this mechanism, the time

gap between CS-onset and US is bridged automatically by the

CS–US association, without recourse to the kind of explicit pre-

diction that is central to the TD model. The biological mapping of

the PVLV mechanisms shown in the figure are discussed in detail

below.

Additional Mechanisms

There are two additional mechanisms required for the overall

system to function (and to be consistent with available data). First

(as previously noted), the PVi system must take advantage of some

kind of timing signal that enables it to fire at the expected time of

actual reward input and not otherwise. In Figure 2B, we illustrate

a ramping timing signal triggered by CS onset, which is intended

to represent the kind of interval timing signal provided by the

cerebellum (e.g., Ivry, 1996; Mauk & Buonomano, 2004), but any

kind of regular activity pattern would work just as well for our

model (see Lustig, Matell, & Meck, 2005, for a model of timing

signals within the basal ganglia). We discuss this issue further in

comparison with an alternative model of DA firing by Brown et al.

(1999) below, which depends on an intrinsic timing mechanism as

an integral part of their system.

The second additional mechanism required is a novelty de-

tection (and familiarity suppression) mechanism, so that the LV

system does not continue to trigger dopamine spiking during the

entire duration of CS input. With such a mechanism in place,

the first onset of a stimulus input triggers a burst of LV firing,

but this then decreases as the stimulus stays on. One solution to

this problem is to use a habituation mechanism on the LV

system to achieve this effect (e.g., Brown et al., 1999), but this

would generalize across various different stimuli and would

therefore prevent a second stimulus that could be associated

with a different or larger reward from evoking DA firing.

Instead, in our implementation we have adopted a synaptic

depression mechanism (e.g., Abbott, Varela, Sen, & Nelson,

1997; Markram & Tsodyks, 1996; Zucker & Regehr, 2002;

Huber & O’Reilly, 2003), which causes a habituation of the LV

DA-burst firing response only to the stimulus that was initially

active (i.e., only active synapses are depressed). With this

mechanism in place, the LVe system accommodates to any

constant sensory inputs and responds only to changes in input

signals, causing it to fire only at the onset of a stimulus tone.

Such synaptic depression mechanisms are ubiquitous through-

out the vertebrate and invertebrate brain (Zucker & Regehr,

2002). Nevertheless, there are a large number of ways of

implementing such an overall function, so we are confident that,

if our overall hypothesis about the PVLV mechanism is correct,

the brain will have found a means of achieving this function.

For full details about the PVLV algorithm and implementation,

see the Appendix.

Application to Conditioning Data

At the level of the basic DA firing data represented in Figure 1,

both TD and PVLV account for the most basic findings of DA

bursting at tone onset and cancellation of the burst at reward

delivery. However, as noted earlier, simple TD models (but not

PVLV) also predict a chaining of DA bursts “backward in time”

from the reward to the stimulus onset, which has not been reliably

observed empirically (Fiorillo, Tobler, & Schultz, 2005; Pan et al.,

2005). However, this particular aspect of the data is still contro-

versial (e.g., Niv, Duff, & Dayan, 2005) and also depends critically

on the way that the input environment is represented. For example,

Pan et al. (2005) recently showed how a TD(␭) model with a high

lambda value could reproduce the empirically observed pattern

(i.e., no evidence of backward marching dopamine bursts). Fur-

thermore, the data often show dopamine bursts at both the CS and

US (Pan et al., 2005; Schultz, 2002)—this is incompatible with

A simpler possible implementation would be to just add the two delta

values to produce a summed DA value, but this double counts the reward-

related deltas because both the LV and PV contribute in this case. Never-

theless, because LV and PV deltas otherwise occur at different times,

Equation 11 is very similar to adding the deltas; the PV system just

dominates when external rewards are presented or expected. It is also

possible to consider an additive equation that also conditionalizes the

contribution of the PV component; this was found in O’Reilly and Frank

(2006) to work slightly better than Equation 11 in a working memory

model (see Appendix for details).

Available evidence suggests that a mechanism such as proposed here

most likely exists in the pathway somewhere distal to the LVe represen-

tations themselves (which PVLV proposes to be in the central nucleus of

the amygdala, see below) as electrophysiological recording data show

sustained (i.e., not onset-only) firing in CNA cells throughout CS duration

(Ono, Nishijo, & Uwano, 1995). For example, downstream synaptic de-

pression/habituation may occur in either the pedunculopontine nucleus, or

it could be intrinsic to local dynamics in the midbrain dopamine nuclei

themselves.

PVLV: PAVLOVIAN LEARNING ALGORITHM

PVLV: the primary value and learned value Pavlovian learning algorithm.

Figures

Citations

Reinforcement Learning: An Introduction

Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia

Hold your horses: a dynamic computational role for the subthalamic nucleus in decision making

2006 Special Issue Hold your horses: A dynamic computational role for the subthalamic nucleus in decision making

Reinforcement learning: The Good, The Bad and The Ugly

References

Long short-term memory

Reinforcement Learning: An Introduction

A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement

Learning to Predict by the Methods of Temporal Differences

Short-Term Synaptic Plasticity

Related Papers (5)

A Neural Substrate of Prediction and Reward

Reinforcement Learning: An Introduction

By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism

A framework for mesencephalic dopamine systems based on predictive Hebbian learning

Predictive Reward Signal of Dopamine Neurons

Frequently Asked Questions (11)

Q1. What is the role of the basal ganglia in learning?

Q2. What contributions have the authors mentioned in the paper "Pvlv: the primary value and learned value pavlovian learning algorithm" ?

Q3. What is the role of the striosome/patch neurons in the ventral?

Q4. What is the role of the VS striosome/patch neurons in learning?

Q5. What was the main motivation for the development of the Rescorla learning rule?

Q6. What is the role of the CNA in learning working memory tasks?

Q7. What is the role of the basal ganglia in dopamine release?

Q8. What is the standard account of these findings?

Q9. What is the strongest support for the TD framework?

Q10. What is the effect of the ventral striatum pathway on the reward system?

Q11. What is the pattern of results that is seen in the DA burst?