scispace - formally typeset
Open AccessBook ChapterDOI

Trial-by-trial data analysis using computational models

Reads0
Chats0
TLDR
The present review aims to clarify the toolbox by cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts.
Abstract
In numerous and high-profile studies, researchers have recently begun to integrate computational models into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher, 1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al., 2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007; Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008). As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to address. We then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis. The techniques are illustrated with fits of simple models to simulated datasets. Throughout, we flag interpretational and technical pitfalls of which we believe authors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientific conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are also excellent books that cover statistical inference of this general sort with much greater generality, formal precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).

read more

Content maybe subject to copyright    Report

Trial-by-trial data analysis using computational models
Nathaniel D. Daw
August 27, 2009
[manuscript for Attention & Performance XXIII.]
1 Introduction
In numerous and high-profile studies, researchers have recently begun to integrate computational models
into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher,
1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al.,
2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007;
Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008).
As these techniques are spreading rapidly, but have been developed and documented somewhat sporadi-
cally alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty
et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to
address. We then offer a relatively practical tutorial about the basic statistical methods for their answer
and how they can be applied to data analysis. The techniques are illustrated with fits of simple models
to simulated datasets. Throughout, we flag interpretational and technical pitfalls of which we believe au-
thors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat
idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these tech-
niques as instances of a general set of tools that can be applied to analyze behavioral and neural data of
many sorts.
A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientific
conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are
also excellent books that cover statistical inference of this general sort with much greater generality, formal
precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).
2 Background
Much work in this area grew out of the celebrated observation (Barto, 1995; Schultz et al., 1997) that the
firing of midbrain dopamine neurons (and also the BOLD signal measured via fMRI in their primary target,
the striatum; Delgado et al., 2000; Knutson et al., 2000; McClure et al., 2003; O’Doherty et al., 2003) resembles
a “prediction error signal used in a number of computational algorithms for reinforcement learning (RL,
i.e. trial and error learning in decision problems; Sutton and Barto, 1998). Although the original empirical
articles reported activity averaged across many trials, and the mean behavior of computational simulations
was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying
neural activity) changes trial by trial in response to feedback. In fact, the computational theories are framed
in just these terms, and so more recent work on the system (O’Doherty et al., 2003; Bayer and Glimcher,
2005) has focused on comparing their predictions to raw data timeseries, trial by trial: measuring, in effect,
the theories’ goodness of fit to the data, on average, rather than their goodness of fit to the averaged data.
1

This change in approach represents a major advance in the use of computational models for experimen-
tal design and analysis, which is still unfolding. Used this way, computational models represent excep-
tionally detailed, quantitative hypotheses about how the brain approaches a problem, which are amenable
to direct experimental test. As noted, such trial-by-trial analyses are particularly suitable to developing a
more detailed and dynamic picture of learning than was previously available.
In a standard experiential decision experiment, such as a “bandit” task (Sugrue et al., 2004; Lau and
Glimcher, 2005; Daw et al., 2006), a subject is offered repeated opportunities to choose between multiple
options (e.g. slot machines) and receives rewards or punishments according to her choice on each trial.
Data might consist of a series of choices and outcomes (one per trial). In principle, any arbitrary relation-
ship might obtain between the entire list of past choices and outcomes, and the next one. Computational
theories constitute particular claims about some more restricted function by which previous choices and
feedback give rise to subsequent choices. For instance, standard RL models (such as “Q learning”; Watkins,
1989) envision that subjects track the expected reward for each slot machine, via some sort of running av-
erage over the feedback, and it is only through these aggregated “value” predictions that past feedback
determines future choices.
This example points to another important feature of this approach, which is that the theories purport
to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,”
or difference between the received and expected rewards). That is, the theories permit the estimation of
quantities (expectations, expectation violations) that would otherwise be subjective; this, in turn, enables
the search for neural correlates of these estimates (Platt and Glimcher, 1999; Sugrue et al., 2004).
By comparing the model’s predictions to trial-by-trial experimental data, such as choices or BOLD sig-
nals, it is possible using a mixture of Bayesian and classical statistical techniques to answer two sorts of
questions about a model, which are discussed in Sections 3 and 4 below. The art is framing questions of
scientific interest in these terms.
The first question is parameter estimation. RL models typically have a number of free parameters
measuring quantities such as the “learning rate,” or the degree to which subjects update their beliefs in
response to feedback. Often, these parameters characterize (or new parameters can be introduced so as to
characterize) factors that are of experimental interest. For instance, Behrens et al. (2007) tested predictions
about how particular task manipulations would affect the learning rate.
The second type of question that can be addressed is model comparison. Different computational mod-
els, in effect, constitute different hypotheses about the learning process that gave rise to the data. These
hypotheses may be tested against one another on the basis of their fit to the data. For example, Hamp-
ton et al. (2008) use this method to compare which of different approaches subjects use for anticipating an
opponent’s behavior in a multiplayer competitive game.
Learning and observation models: In order to appreciate the extent to which the same methods may
be applied to different sets of data, it is useful to separate a computational theory into two parts. The first,
which we will call the learning model, describes the dynamics of the model’s internal variables such as the
reward expected for each slot machine. The second part, which we will call the observation model, describes
how the model’s internal variables are reflected in observed data: for instance, how expected values drive
choice or how prediction errors produce neural spiking. Essentially, the observation model regresses the
learning model’s internal variables onto the observed data; it plays a similar role as (and is often, in fact,
identical to) the “link function” in generalized linear modeling. In this way, a common learning process
(a single learning model) may be viewed as giving rise to distinct observable data streams in a number of
different modalities (e.g., choices and BOLD, through two separate observation models). Thus, although we
describe the methods in this tutorial primarily in terms of choice data, they are directly applicable to other
modalities simply by substituting a different observation model.
Crucially, whereas the learning model is typically deterministic, the observation models are noisy: that
is, given the internal variables produced by the learning model, an observation model assigns some proba-
bility to any possible observations. Thus the “fit” of different learning models, or their parameters, to any
observed data can be quantified statistically in terms of the probability they assign to the data, a procedure
at the core of the methods that follow.
2

3 Parameter estimation
Model parameters can characterize a variety of scientifically interesting quantities, from how quickly sub-
jects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al.,
2007). Here we consider how to obtain statistical results about parameters’ values from data. We first con-
sider the general statistical rationale underlying the problem; then develop the details for an example RL
model before considering various pragmatic factors of actually performing these analyses on data. Finally,
having discussed these details in terms of choice data, we discuss how the same methods may be applied
to other sorts of data.
Suppose we have some model M, with a vector of free parameters θ
M
. The model (here, the composite
of our learning and observation models) describes a probability distribution, or likelihood function, P(D |
M, θ
M
) over possible data sets D. Then, Bayes’ rule tells us that having observed a data set D,
P(θ
M
| D, M) P(D | M, θ
M
) · P(θ
M
| M) (1)
That is, the posterior probability distribution over the free parameters, given the data, is proportional to the
product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability
of the parameters. This equation famously shows how to start with a theory of how parameters (noisily)
produce data, and invert it into an theory by which data (noisily) reveal the parameters that produced
it. Classically, we seek a point estimate of the parameters θ
M
rather than a posterior distribution over
all possible values; if we neglect (or treat as flat) the prior over the parameters P(θ
M
|M), then the most
probable value for θ
M
is the maximum likelihood estimate: the setting of the parameters that maximizes the
likelihood function, P(D | M, θ
M
). We denote this
ˆ
θ
M
.
3.1 Maximum likelihood estimation for RL
An RL model: We may see how the general ideas play out in a simple reinforcement learning setting.
Consider a simple game in which on each trial t, a subject makes a choice c
t
(= L or R) between a left and
a right slot machine, and receives a reward r
t
(= $1 or $0) stochastically. According to a simple Q-learning
model (Watkins, 1989), on each trial the subject assigns an expected value to each machine: Q
t
(L) and
Q
t
(R). We initialize these values to (say) 0, and then on each trial, the value for the chosen machine is
updated as
Q
t+1
(c
t
) = Q
t
(c
t
) + α · δ
t
(2)
where 0 α 1 is a free learning rate parameter, and δ
t
= r
t
Q
t
(c
t
) is the prediction error. Equation 2 is
our learning model. To explain the choices c
t
in terms of the values Q
t
we assume an observation model. In
RL, it is often assumed that that subjects choose probabilistically according to a softmax distribution:
P(c
t
= L | Q
t
(L), Q
t
(R)) =
exp(β · Q
t
(L))
exp(β · Q
t
(R)) + exp(β · Q
t
(L))
(3)
Here, β is a free parameter known in RL as the inverse temperature parameter. However, note that Equa-
tion 3 is also equivalent to standard logistic regression where the dependent variable is the binary choice
variable c
t
and there is one predictor variable, the difference in values Q
t
(L) Q
t
(R). Therefore, β can also
be as viewed the regression weight connecting the Qs to the choices. More generally, when there are more
than two choice options, the softmax model corresponds to a generalization of logistic regression known as
conditional logit regression (McFadden, 1974).
The model of Equations 2 and 3 is only a representative example of the sorts of algorithms used to
study reinforcement learning. Since our focus here is on the methodology for estimation given a model,
a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and
Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments). That said, most models in the literature
are variants on the example shown here. Another commonly used (Daw et al., 2006; Behrens et al., 2007) and
seemingly rather different family of learning methods is Bayesian models such as the Kalman filter (Kakade
3

β
α
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
Figure 1: Likelihood surface for simulated reinforcement learning data, as a function of two free parame-
ters. Lighter colors denote higher data likelihood. The maximum likelihood estimate is shown as an “o”
surrounded by an ellipse of one standard error (a region of about 90% confidence); the true parameters
from which the data were generated are denoted by an “x”.
and Dayan, 2002). In fact, the Q-learning rule of Equation 2 can be seen as a simplified case of the Kalman
filter: the Bayesian model uses the same learning rule but has additional machinery that determines the
learning rate parameter α on a trial-by-trial basis (Kakade and Dayan, 2002; Behrens et al., 2007; Daw et al.,
2008).
Data likelihood: Given the model described above, the probability of a whole dataset D (i.e., a whole
sequence of choices c = c
1...T
given the rewards r = r
1...T
) is just product of their probabilities from Equation
3,
t
P(c
t
= L | Q
t
(L), Q
t
(R)) (4)
Note that the terms Q
t
in the softmax are determined (via equation 2) by the rewards r
1...t1
and choices
c
1...t1
on trials prior to t.
Together, Equations 2 and 3 constitute a full likelihood function P(D | M, θ
M
), and we can estimate the
free parameters (θ
M
=
h
α, β
i
) by maximum likelihood. Figure 1 illustrates the process. 1,000 choice trials
were simulated according to the model (with parameters α = .25 and β = 1, red x). The likelihood of the
observed data was then computed for a range of parameters, and plotted (with brighter colors for higher
likelihood) on a 2-D grid. In this case, the maximum likelihood point (
ˆ
α = .34 and
ˆ
β = .93, blue circle) was
near the true parameters.
Confidence intervals: Of course, in order actually to test a hypothesis about the parameters’ values,
we need to be able to make statistical claims about the quality of the estimate
ˆ
θ
M
. Intuitively, the degree
to which our estimate can be trusted depends on how much better it accounts for the data than other
nearby parameter estimates, that is on how sharply peaked is the “hill” of data likelihoods in the space
of parameters. Such peakiness is characterized by the second derivative (the Hessian) of the likelihood
function with respect to the parameters. The Hessian is a square matrix (here, 2x2) with a row and column
for each parameter. Evaluated at the peak point
ˆ
θ
M
, the elements of the Hessian are larger the more rapidly
the likelihood function is dropping off away from it in different directions, which corresponds to a more
reliable estimate of the parameters. Conversely, the matrix inverse of the Hessian (like the reciprocal of a
4

scalar) is larger for poorer estimates, like error bars. More precisely, if H is the Hessian of the negative log of
the likelihood function at the maximum likelihood point
ˆ
θ
M
, then a standard estimator for the covariance of
the parameter estimates is its matrix inverse H
1
(MacKay, 2003).
The diagonal terms of H
1
correspond to variances for each parameter separately, and their square
roots measure one standard error on the parameter. Thus, for instance, 95% confidence intervals around
the maximum likelihood estimate may be estimated as
ˆ
θ plus or minus 1.96 standard errors.
Covariance between parameters: The off-diagonal terms of of H
1
measure covariance between the
parameters, and are useful for diagnosing model fit. In general, large off-diagonal terms are a symptom of
a poorly specified model or some kinds of bad data. In the worst case, two parameters may be redundant,
so that there is no unique optimum. The Q learning model has a more moderate coupling between the
parameters. As can be seen by the elongated, tilted shape of the “ridge” in Figure 1, estimates of α and
β tend to be inversely coupled in this model. By increasing β while decreasing α (or vice versa: moving
northwest or southeast in the figure), a similar likelihood is obtained. This is because the reward r
t
is
multiplied by both α (in Equation 2 to update Q
t
) and then by β (in Equation 3) before affecting the choice
likelihood on the next trial. As a result of this, either parameter individually cannot be estimated so tightly
by itself (the “ridge” is a bit wide if you cross it horizontally in β or vertically in α), but their product is well
estimated (the hill is most narrow when crossed from northeast to southwest). The blue oval in the figure
traces out a one-standard error ellipse in the two parameters jointly, derived from H
1
; its tilt follows the
contour of the ridge.
Often in applications such as logistic regression, a corrected covariance estimator is used that is thought
to be more robust to problems such as mismatch between the true and assumed models. This “Huber-
White” or “sandwich” estimator (Huber, 1967; Freedman, 2006) is H
1
BH
1
where B =
t
g(c
t
)
T
g(c
t
),
and g(c
t
), in turn, is the gradient (vector of first partial derivatives with respect to the parameters) of the
negative log likelihood of the tth data point c
t
, evaluated at
ˆ
θ
M
. This is harder to compute in practice, since
it involves keeping track of g, which is laborious. However, as discussed below, g can also be useful when
searching for the maximum likelihood point.
3.2 Pragmatics
Above, we developed the general equations for maximum likelihood parameter estimation in an RL model;
how can these be implemented in practice for data analysis?
First, although we have noted an equivalence between Equation 3 and logistic regression, it is not pos-
sible simply to use an off-the-shelf regression package to estimate the parameters. This is because although
the observation stage of the model represents a logistic regression from values Q
t
to choices c
t
, the values
are not fixed but themselves depend on the free parameters (here, α) of the learning process. As these do
not enter the likelihood linearly they cannot be estimated by a generalized linear model. Thus, we must
search for the full set of free parameters that optimize the likelihood function.
Likelihood function: At the heart of optimizing the likelihood function is computing it. It is straight-
forward to write a function that takes in a dataset (a sequence of choices and rewards) and a candidate
setting of the free parameters, loops over the data computing equations 2 and 3, and returns the aggregate
likelihood of the data. Importantly, the product in Equation 4 is often an exceptionally small number; it is
thus numerically more stable to compute its log, i.e. the sum over trials of the log of the choice probability
from equation 3, which is β · Q
t
(c
t
) log(exp(β · Q
t
(L)) + exp(β · Q
t
(R))). Since log is a monotonic func-
tion, this quantity has the same optimum but is less likely to underflow the minimum floating point value
representable by a computer. (Another numerical trick is to note that Equation 3 is invariant to the addition
or subtraction of any constant to all of the Q values. The chance of the exponential under- or overflowing
can thus be reduced by evaluating the log probability for Q values after first subtracting their mean.)
How, then, to find the optimal likelihood? In general, it may be tempting to discretize the space of
free parameters, compute the likelihood everywhere, and simply search for the best, much as is illustrated
in Figure 1. We recommend against this approach. First, most models of interest have more than two
parameters, and exhaustively testing all combinations in a higher dimensional space becomes intractable.
5

Figures
Citations
More filters
Dissertation

Dopamine, decision-making, and aging : neural and behavioural correlates

Lieke de Boer
TL;DR: De Boer et al. as mentioned in this paper used a sample of 30 older and 30 younger participants to investigate age-related differences in neural and behavioural correlates of value-based decision-making, which involves making decisions that can result in rewards and punishments.

Different components of cognitive-behavioural therapy affect specific cognitive mechanisms

TL;DR: In this article , the authors demonstrate the power of combing theory with computational methods to parse effects of different components of cognitive-behavioural therapies on to underlying mechanisms, and provide a basis for understanding how different elements of common psychotherapy programs work.
Journal ArticleDOI

Qualitative and quantitative evaluations of mathematical models for animal learning and behaviors

TL;DR: It is suggested that the quantitative and qualitative approaches are complimentary and jointly provide a powerful framework for investigating psychological processes underlying animal learning and behaviors.

Learning and memory systems supporting decision making in the human brain

TL;DR: Learning and memory systems supporting decision making in the human brain are studied in order to better understand the role of language and memory in decision making.
References
More filters
Journal ArticleDOI

A new look at the statistical model identification

TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.
Journal ArticleDOI

Estimating the Dimension of a Model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Estimating the dimension of a model

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Book

Pattern Recognition and Machine Learning

TL;DR: Probability Distributions, linear models for Regression, Linear Models for Classification, Neural Networks, Graphical Models, Mixture Models and EM, Sampling Methods, Continuous Latent Variables, Sequential Data are studied.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Trial-by-trial data analysis using computational models" ?

As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox ( see also O ’ Doherty et al., 2007 ). In particular, the authors discuss the rationale for these methods and the questions they are suited to address. Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware. The authors focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.