scispace - formally typeset
Open AccessPosted ContentDOI

Mixing memory and desire: How memory reactivation supports deliberative decision-making.

Reads0
Chats0
TLDR
A model suggests a central role for the dynamics of memory reactivation in determining the influence of different kinds of memory in decisions and proposes that representation-specific dynamics can implement a bottom-up “product of experts” rule that integrates multiple sets of action-outcome predictions weighted on the basis of their uncertainty.
Abstract
Memories affect nearly every aspect of our mental life. They allow us to both resolve uncertainty in the present and to construct plans for the future. Recently, renewed interest in the role memory plays in adaptive behavior has led to new theoretical advances and empirical observations. We review key findings, with particular emphasis on how the retrieval of many kinds of memories affects deliberative action selection. These results are interpreted in a sequential inference framework, in which reinstatements from memory serve as "samples" of potential action outcomes. The resulting model suggests a central role for the dynamics of memory reactivation in determining the influence of different kinds of memory in decisions. We propose that representation-specific dynamics can implement a bottom-up "product of experts" rule that integrates multiple sets of action-outcome predictions weighted based on their uncertainty. We close by reviewing related findings and identifying areas for further research. This article is categorized under: Psychology > Reasoning and Decision Making Neuroscience > Cognition Neuroscience > Computation.

read more

Content maybe subject to copyright    Report

Mixing memory and desire: How memory reactivation supports
deliberative decision-making.
Shaoming Wang
1*
, Samuel F. Feng
2,3
, Aaron M. Bornstein
4,5,6*
1
Department of Psychology, New York University
2
Department of Mathematics, Khalifa University of Science and Technology, Abu Dhabi, UAE
3
Khalifa University Centre for Biotechnology, Khalifa University of Science and Technology,
Abu Dhabi, UAE
4
Department of Cognitive Sciences, University of California, Irvine
5
Center for the Neurobiology of Learning & Memory, University of California, Irvine
6
Institute for Mathematical Behavioral Sciences, University of California, Irvine
*
Correspondence: shaoming@nyu.edu , aaron.bornstein@uci.edu
Abstract
Memories affect nearly every aspect of our mental life. They allow us to both resolve uncertainty
in the present and to construct plans for the future. Recently, renewed interest in the role
memory plays in adaptive behavior has led to new theoretical advances and empirical
observations. We review key findings, with particular emphasis on how the retrieval of many
kinds of memories affect deliberative action selection. These results are interpreted in a
sequential inference framework, in which reinstatements from memory serve as “samples” of
potential action outcomes. The resulting model suggests a central role for the dynamics of
memory reactivation in determining the influence of different kinds of memory in decisions. We
propose that representation-specific dynamics can implement a bottom-up “product of experts”
rule that integrates multiple sets of action-outcome predictions weighted on the basis of their
uncertainty. We close by reviewing related findings and identifying areas for further research.
Introduction
Most decisions involve some form of memory. Decades of research has focused on
understanding how one kind of memory, about the summary statistics of a task or environment,
are employed in the service of evaluating choice options, either through incremental learning of
stimulus-outcome associations, or via extracting regularities present in the structure of the
environment (Balleine, 2007; Daw et al., 2011; Dayan, 1993; Gläscher et al., 2010; Tolman,
1948) . These types of memories are differentiated by their distinct representational properties
and divergent neural substrates (Dolan & Dayan, 2013; Poldrack & Packard, 2003; Yin &
Knowlton, 2006) . Critically, however, they share in common a reliance on extensive experience
often measured within a narrowly controlled, highly repetitive laboratory task in order to

learn usable statistics (Behrens et al., 2007; Daw et al., 2011) . This leaves open the question of
how decisions are made on the basis of little direct experience (Lengyel & Dayan, 2008) , or in
complex environments from which it may be intractable to extract sufficiently detailed
regularities (Kaelbling et al., 1998; Silver & Veness, 2010) as in many real-world decisions
faced by humans and animals (Lake et al., 2015; Lien & Cheng, 2000; Niv et al., 2015) .
Humans and animals constantly draw on memories of the past to inform decisions about
the future (Redish, 2016; Schacter et al., 2017) . An emerging framework describes this
phenomenon as a simulation-driven estimation process, in which decision-makers examine
what might result from each available action by consulting memories of similar previous
settings. This approach, generally referred to as memory sampling (Bordalo et al., 2020;
Gershman & Daw, 2017; Kuwabara & Pillemer, 2010; Lengyel & Dayan, 2008; Lieder et al.,
2018; Ritter et al., 2018; Shadlen & Shohamy, 2016; Zhao et al., 2019) , can approximate the
sorts of option value estimates that would be learned across repeated experience by, e.g.,
temporal-difference reinforcement learning (TDRL; Bornstein et al., 2017; Gershman & Daw,
2017; Lengyel & Dayan, 2008) , while retaining the flexibility to diverge from long-run averages
when doing so may be adaptive. At one extreme, drawing on individual memories in this way
allows one to effectively tackle choice problems even in the low-data limit (e.g., in novel
environments), where processes that rely on abstraction over multiple experiences are
unreliable (Lengyel & Dayan, 2008) .
Examining memory retrieval from the perspective of reinforcement learning
complements the use of RL to study representation formation -- e.g. of cached values (Barto et
al., 1995) , motor sequences (Botvinick et al., 2009; Keramati et al., 2016; Miller et al., 2018,
2019) , or environmental structure (Dayan, 1993; Gershman, 2018; Wilson et al., 2014) .
Therefore, we begin this review by describing the RL formulation of the computational problem
of optimal action selection among immediately available options. We continue with a review of
how known cognitive and neurobiological properties of long- and short-term memory retrieval in
humans and animals suggest an implementation of one form of approximate solution to this
problem, the stochastic sampling of past experiences. Then, we briefly introduce the
mathematical framework that describes the optimal solution to two-alternative forced choice on
the basis of unreliable evidence the drift-diffusion model (DDM) with emphasis on what is
known about how organisms approach the special case of evidence in the form of
internally-generated signals.
We next review theoretical frameworks and key empirical studies that describe how
various kinds of memory, ranging from action sequences to “cognitive maps” to long-term
autobiographical memories, can provide these internally-generated signals for action selection.
We focus especially on a representative selection of studies that have shown that episodic

features mediate the selection of which memories are retrieved during decision deliberation;
1
these constitute an informative limiting case of the memory sampling framework.
Next, we examine how these properties of memory retrieval during action selection
constrain the process of accumulating evidence from memory. We focus on areas in which the
properties of memory sampling contrast with those of sensory evidence accumulation, such as
the relationship between representational properties and retrieval dynamics, and the sequential
structure of retrieval.
We close with a synthesis of the reviewed findings, and suggest that action selection
based on memory retrieval can be best described by a time-varying evidence accumulation
process, in which the momentary rate of accumulation is determined by several cognitive and
neural factors. The resulting model approximates a “product of experts” rule for integrating
action tendencies from multiple control processes in this case, memory representations with
different associative content, relational structure, and history-dependence. It follows directly that
the involvement of different forms of memory in action selection depends on the temporal
dynamics of these factors, via their influence on the effective rate of production of evidence
samples, which can implement the principle of uncertainty-weighted arbitration between different
decision systems (Daw et al., 2005; Keramati et al., 2011) . We close with a brief review of
existing empirical evidence in support of this model, and suggest potential directions for further
research.
I. The view from Reinforcement Learning
We begin by detailing key aspects of the predominant framework for value-based decisions,
Reinforcement Learning (RL; Sutton & Barto, 2018) . We begin here because memory sampling
shares with RL the use of primitives such as states , actions, and rewards -- but, crucially, it
operates on these elements with a different computational form that provides a distinct set of
guarantees about efficiency and optimality. Understanding these provides the basis for
understanding why each approach makes different empirical predictions in certain settings.
Importantly, RL provides a formal understanding of the value estimation problem, and thus for
evaluating different kinds of estimates. This framework will be crucial for understanding our later
description of how and why multiple memory systems can contribute to decisions.
RL examines the problem of learning how to best navigate an uncertain environment
guided primarily by feedback, in the form of reward or punishment, obtained after taking actions
within that environment. While the framework allows for a wide range of possible approaches,
its primary applications in neuroscience research to date have followed a particular form
1
We use the term “memories with episodic features” to refer to representations of past experience that
exhibit dense, multi-sensory associations, formed during a single experience, which potentially include
attributes incidental to goals at the time of that experience (Allen & Fortin, 2013; Bornstein & Pickard,
2020 Box 1) . Though “episodic memory” has variously been defined by its relationship to conscious,
declarative recall, these properties may not be functionally necessary to an influence on choices, and so
we sidestep the question of awareness in the present review.

involving incremental learning of a value function relating states and actions to the long-term,
2
discounted rewards that can be expected to result (Eqn. 1). When fitting human behavior, a
common practice (Daw et al., 2011) is to specify an action selection function that translates
these values into a likelihood of taking each available action (Eqn. 2). We next describe
particular instances of these equations and the key features relevant to the current review:
(1) 𝑄 ( 𝑎 , 𝑠 ) 𝑄 ( 𝑎 , 𝑠 ) + α[ 𝑅 + γ 𝑚𝑎 𝑥
𝑎 '
𝑄 ( 𝑎 ' , 𝑠 ' ) 𝑄 ( 𝑎 , 𝑠 )]
(2)
𝑃 ( 𝑎
*
== 𝐴 )
exp[β 𝑄 ( 𝐴 , 𝑠 )]
𝑎 '
∑exp[β 𝑄 ( 𝑎 ' , 𝑠 )]
The first equation describes the incremental, experience-driven learning of value
expectations (the value function, Q ). The quantity specified by the value function is an estimate
of the total future reward expected after taking action a in state s (and continuing to act optimally
thereafter). This future reward is the sum of the reward directly obtained by taking the action
( R ), plus the total future reward to be obtained by taking the best action in the ensuing state s’ .
(Future rewards are, throughout, treated as less important to momentary action selection than
immediate rewards, so they are discounted according to a constant 0 <
𝛾
< 1.) The expectation
is updated by the difference between this sum and the previous value of the expectation, after
scaling by a learning rate (0 < α < 1) in order to regularize the estimate. The second equation
specifies the probability of choosing a given action ( A ) as the relative profitability of that action,
versus all candidate actions. The sensitivity of this likelihood to the value difference is specified
by the temperature parameter,
𝛽
.
Importantly, the first equation is an approximation to the full value computation (Eqn. 3),
which incorporates knowledge about the transition structure of the world the likelihood that
taking a given action a in state s is going to lead to a particular state s’ . The true discounted
future reward thus integrates over transition probabilities to all possible successor states. An
agent with knowledge of this transition structure may be able to make better decisions than one
who just learns reward values, but representing and working with this structure can be quite
costly.
(3)
𝑄 ( 𝑎 , 𝑠 ) =
𝑠 '
𝑇 ( 𝑠 , 𝑎 , 𝑠 ' ) 𝑉 ( 𝑠 ' )
Note that the future return of the target states, , is recursively defined:
𝑉 ( 𝑠 ' )
(4)
𝑉 ( 𝑠 ' ) = 𝑅 ( 𝑠 ' ) + γ 𝑉 ( 𝑠 '' )
Unrolling the recursion gives a converging sum of (discounted) rewards:
2
Multiple variants of each equation achieve similar goals under different settings. For more in-depth
treatment, see Sutton & Barto (2018) ; for a review of the neural instantiation of these variables, see
(Glimcher, 2011) .

(5)
𝑉 ( 𝑠 ' ) = 𝑅 ( 𝑠 ' ) + γ 𝑅 ( 𝑠 '' ) + γ
2
𝑅 ( 𝑠 ''' ) + ...
where future states after are denoted by , , and so on. Computing this (recursive) 𝑠 ' 𝑠 '' 𝑠 '''
expectation is difficult in practice, especially with limited experience of the transition structure.
Therefore, approximate computations may be employed, either the incremental approach of
Equation 1 above, which marginalizes over transitions, or via methods that directly estimate the
transition structure (Daw et al., 2005) . More broadly, however, the computational goal
choosing on the basis of total discounted future reward — can be achieved in multiple ways.
One approach, called memory sampling , avoids the dependence on extensive
experience by simply consulting the values obtained directly, “remembering” individual
experiences with the current (and potential future) state(s). Formally, rather than computing this
estimate by updating a cached value function with each experience (Eqn. 1), the alternative
computes it dynamically, possibly even on-demand (Eldar et al., 2020) , by sampling past
encounters with the states of interest (and, potentially, generalizing from similar states) and
averaging the resulting values. This approach can be used to estimate both the reward to be
received from the current action (Bornstein et al., 2017) , and also that of states that follow from
each action (Bornstein & Norman, 2017; Gershman & Daw, 2017; Vikbladh et al., 2017) . When
multiple relevant experiences exist, they can be selected from according to a sample-selection
function (Fig. 1; Equation 6a, function S ), that specifies some probability distribution over
rewards for each action given by the distance between current state s and given sample state s’
in a probability space defined over their shared features (Eqn. 6b). While in practice this
distance incorporates any set of features relevant to the current comparison (Fig. 1), in
laboratory experiments task states are usually distinguishable along only a small number of
well-controlled dimensions. For example, samples could be weighted by their proximity in time
to the current moment (Eqn. 6c) capturing the intuition that the remembered states most like
the state I am currently in are the states I have most recently visited. In this formulation,
samples at time t are most likely to be drawn from the most recent trial ( i=t -1), and exponentially
less likely to be drawn from preceding trials i (i.e. where i=t -2, t -3, t -4, …), with decay specified
by the parameter α. Because the value of α is between 0 and 1, exponentiating this value by t - i
will result in progressively smaller probabilities for trials further in the past (greater i ). Values
estimated by this approach have the same form of dependence on recent experience as do
those learned by TDRL (Bornstein et al., 2017) .
(6a)
( 𝑠 ' , 𝑟 ' ) 𝑆 ( 𝑠 , 𝑎 )
(6b)
𝑃 ( 𝑄 ( 𝑎 , 𝑠 ) == 𝑅 ( 𝑠 ' )) || 𝑠 𝑠 '||
(6c)
𝑃 ( 𝑄 ( 𝑎 , 𝑠 ) == 𝑅
𝑖
) = α( 1 α)
𝑡 𝑖

Figures
Citations
More filters
Journal Article

A Context-Based Theory of Recency and Contiguity in Free Recall. Commentary. Authors' reply

TL;DR: The authors proposed a new model of free recall on the basis of M. Howard and M. J. McClelland's leaky-accumulator decision model, where recall decisions are controlled by a race between competitive leaky accumulators.
Journal ArticleDOI

Fixation patterns in simple choice reflect optimal information sampling.

TL;DR: In this article, the authors model the decision process for simple choice as an information sampling problem and approximate the optimal sampling policy, finding that it is optimal to sample from options whose value estimates are both high and uncertain.
Journal ArticleDOI

Replay in minds and machines.

TL;DR: The benefits an agent can gain from replay that cannot be achieved through direct interactions with the world itself are summarized and include faster learning and data efficiency, less forgetting, 15 prioritizing important experiences, as well as improved planning and generalization.
Posted ContentDOI

Developmental change in prefrontal cortex recruitment supports the emergence of value-guided memory

TL;DR: It is found that the ability to use learned value signals to selectively enhance memory for useful information strengthened throughout childhood and into adolescence, and developmental increases in the strategic engagement of the prefrontal cortex support the emergence of adaptive memory.
References
More filters
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

A spreading-activation theory of semantic processing

TL;DR: The present paper shows how the extended theory can account for results of several production experiments by Loftus, Juola and Atkinson's multiple-category experiment, Conrad's sentence-verification experiments, and several categorization experiments on the effect of semantic relatedness and typicality by Holyoak and Glass, Rips, Shoben, and Smith, and Rosch.
Journal ArticleDOI

Cognitive maps in rats and men

TL;DR: Most of the rat investigations, which I shall report, were carried out in the Berkeley laboratory, and a few, though a very few, were even carried out by me myself.
Journal ArticleDOI

Planning and Acting in Partially Observable Stochastic Domains

TL;DR: A novel algorithm for solving pomdps off line and how, in some cases, a finite-memory controller can be extracted from the solution to a POMDP is outlined.
Journal ArticleDOI

A Theory of Memory Retrieval.

TL;DR: A theory of memory retrieval is developed and is shown to apply over a range of experimental paradigms, and it is noted that neural network models can be interfaced to the retrieval theory with little difficulty and that semantic memory models may benefit from such a retrieval scheme.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What have the authors contributed in "Mixing memory and desire: how memory reactivation supports deliberative decision-making" ?

The authors review key findings, with particular emphasis on how the retrieval of many kinds of memories affect deliberative action selection. The authors propose that representation-specific dynamics can implement a bottom-up “ product of experts ” rule that integrates multiple sets of action-outcome predictions weighted on the basis of their uncertainty. The authors close by reviewing related findings and identifying areas for further research. These results are interpreted in a sequential inference framework, in which reinstatements from memory serve as “ samples ” of potential action outcomes. The resulting model suggests a central role for the dynamics of memory reactivation in determining the influence of different kinds of memory in decisions. 

A primary direction of future research is understanding how various factors influence the temporal dynamics of memory retrieval. Further work is needed to understand how within-trial dynamics affect the integration of information about potential future states. Evidence suggests the influence of at least the following terms: 1. semantic distance ( e. g. as estimated using word embeddings ; Chadwick et al., 2016 ), 2. episodic distance ( Polyn et al., 2009 ), and 3. the spread of probability mass across associations at each kind of distance ( Socher et al., 2009 ). 

The interaction between internally-generated sequences and the properties of external input is a critical feature of computational work on state inference , a necessary function for online planning in environments with uncertain latent contingency structure (Kaelbling et al., 1998; Rao, 2010) . 

With its ability to extract sparse codes from sensory inputs, hippocampus is implicated in the learning of uncertain states by representing the latent contexts that give rise to observations (Gershman et al., 2010; Sanders et al., 2020) . 

The key insight of the theory is that rather than encoding place in an absolute sense, the place cells encode a predictive representation of future states that reflects the relational structure between them (Stachenfeld et al., 2017) . 

One question raised by this framework is whether the temporal dynamics of memory reactivation are fundamental, adapt to the time available, or are modulated by the content of representation or computations being performed. 

Supporting the idea that memory retrievals’ influence on decision unfolds over time is theobservation that longer delays before choice lead to greater memory influence on decisions (Foerde & Shohamy, 2011) -- and, in particular, greater influence of extended retrievals from memory (Bakkour et al., 2019; Eldar et al., 2020; Gordon et al., 2014) . 

An emerging framework describes this phenomenon as a simulation-driven estimation process, in which decision-makers examine what might result from each available action by consulting memories of similar previous settings. 

As reactivations of those previous experiences echo both previous sensory inputs and also latent, non-sensory information, such as the inferred contingency structure of the environment and the value of rewards available at the time, all of these lead to the subsequent reactivation of the same sorts of action-tendency or value associations as does sensory input. 

the authors propose that one mechanism by which such jumps arise is via parallel sampling from multiple internal evidence sources which produce evidence at different latencies and frequencies. 

One especially promising approach for modeling the arrival of evidence samples from different distributions, called Lévy Flight models (Fig. 2), considers a variety of intermittent “jumps” that augment and alter the Brownian motion of Equation 7. 

In an insightful evaluation of this question, Shohamy and Shadlen ( 2016) propose that one reason memory-guided decisions take time, rather than acting instantly on internally-available information, is because a limited-bandwidth thalamocortical pipeline enforces serial processing. 

The proposal that multiple forms of decisions depend on retrieval dynamics that vary as a function of associative distance may explain why choices and response times appear to covary between tasks that examine how subjects weigh options across many kinds of such distances, for instance in intertemporal choice, patch foraging, and model-based planning (Kane et al., 2019; Shenhav et al., 2014) , each of which have been independently shown to depend on long-term memory representations (Palombo et al., 2015; Peters & Büchel, 2010; Schmidt et al., 2019; Vikbladh et al., 2019) . 

findings about the neural architecture of evidence integration in these other modalities are likely to apply to the study of memory-guided decisions, especially when studies employ stimuli whose predictiveness is estimated via associations that emerge across experience (Yang & Shadlen, 2007) . 

Pattern completion may be especially useful to decision-making because it allows past choices and outcomes to come to mind in situations that are similar to, but not exactly the same as, past encounters. 

the authors describe multiple kinds of memory representations, how they differently represent aspects of past experience, and how they lend themselves to different retrieval and transformation dynamics that later affect decision-making. 

Bornstein and colleagues ( 2017) also used the same model to reanalyze previously collected data from a four-choice decision task (Daw et al., 2006) , which further revealed that in addition to participants’ choices, neural decision variables measured in fMRI were better explained by a memory sampling model than by TDRL. 

rather than computing this estimate by updating a cached value function with each experience (Eqn. 1), the alternative computes it dynamically, possibly even on-demand (Eldar et al., 2020) , by sampling past encounters with the states of interest (and, potentially, generalizing from similar states) and averaging the resulting values.