What contributions have the authors mentioned in the paper "Trial-by-trial data analysis using computational models" ?

As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox ( see also O ’ Doherty et al., 2007 ). In particular, the authors discuss the rationale for these methods and the questions they are suited to address. Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware. The authors focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.

(Open Access) Trial-by-trial data analysis using computational models (2011) | Nathaniel D. Daw

Trial-by-trial data analysis using computational models

Nathaniel D. Daw

August 27, 2009

[manuscript for Attention & Performance XXIII.]

1 Introduction

In numerous and high-proﬁle studies, researchers have recently begun to integrate computational models

into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher,

1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al.,

2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007;

Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008).

As these techniques are spreading rapidly, but have been developed and documented somewhat sporadi-

cally alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty

et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to

address. We then offer a relatively practical tutorial about the basic statistical methods for their answer

and how they can be applied to data analysis. The techniques are illustrated with ﬁts of simple models

to simulated datasets. Throughout, we ﬂag interpretational and technical pitfalls of which we believe au-

thors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat

idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these tech-

niques as instances of a general set of tools that can be applied to analyze behavioral and neural data of

many sorts.

A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientiﬁc

conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are

also excellent books that cover statistical inference of this general sort with much greater generality, formal

precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).

2 Background

Much work in this area grew out of the celebrated observation (Barto, 1995; Schultz et al., 1997) that the

ﬁring of midbrain dopamine neurons (and also the BOLD signal measured via fMRI in their primary target,

the striatum; Delgado et al., 2000; Knutson et al., 2000; McClure et al., 2003; O’Doherty et al., 2003) resembles

a “prediction error” signal used in a number of computational algorithms for reinforcement learning (RL,

i.e. trial and error learning in decision problems; Sutton and Barto, 1998). Although the original empirical

articles reported activity averaged across many trials, and the mean behavior of computational simulations

was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying

neural activity) changes trial by trial in response to feedback. In fact, the computational theories are framed

in just these terms, and so more recent work on the system (O’Doherty et al., 2003; Bayer and Glimcher,

2005) has focused on comparing their predictions to raw data timeseries, trial by trial: measuring, in effect,

the theories’ goodness of ﬁt to the data, on average, rather than their goodness of ﬁt to the averaged data.

This change in approach represents a major advance in the use of computational models for experimen-

tal design and analysis, which is still unfolding. Used this way, computational models represent excep-

tionally detailed, quantitative hypotheses about how the brain approaches a problem, which are amenable

to direct experimental test. As noted, such trial-by-trial analyses are particularly suitable to developing a

more detailed and dynamic picture of learning than was previously available.

In a standard experiential decision experiment, such as a “bandit” task (Sugrue et al., 2004; Lau and

Glimcher, 2005; Daw et al., 2006), a subject is offered repeated opportunities to choose between multiple

options (e.g. slot machines) and receives rewards or punishments according to her choice on each trial.

Data might consist of a series of choices and outcomes (one per trial). In principle, any arbitrary relation-

ship might obtain between the entire list of past choices and outcomes, and the next one. Computational

theories constitute particular claims about some more restricted function by which previous choices and

feedback give rise to subsequent choices. For instance, standard RL models (such as “Q learning”; Watkins,

1989) envision that subjects track the expected reward for each slot machine, via some sort of running av-

erage over the feedback, and it is only through these aggregated “value” predictions that past feedback

determines future choices.

This example points to another important feature of this approach, which is that the theories purport

to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,”

or difference between the received and expected rewards). That is, the theories permit the estimation of

quantities (expectations, expectation violations) that would otherwise be subjective; this, in turn, enables

the search for neural correlates of these estimates (Platt and Glimcher, 1999; Sugrue et al., 2004).

By comparing the model’s predictions to trial-by-trial experimental data, such as choices or BOLD sig-

nals, it is possible using a mixture of Bayesian and classical statistical techniques to answer two sorts of

questions about a model, which are discussed in Sections 3 and 4 below. The art is framing questions of

scientiﬁc interest in these terms.

The ﬁrst question is parameter estimation. RL models typically have a number of free parameters —

measuring quantities such as the “learning rate,” or the degree to which subjects update their beliefs in

response to feedback. Often, these parameters characterize (or new parameters can be introduced so as to

characterize) factors that are of experimental interest. For instance, Behrens et al. (2007) tested predictions

about how particular task manipulations would affect the learning rate.

The second type of question that can be addressed is model comparison. Different computational mod-

els, in effect, constitute different hypotheses about the learning process that gave rise to the data. These

hypotheses may be tested against one another on the basis of their ﬁt to the data. For example, Hamp-

ton et al. (2008) use this method to compare which of different approaches subjects use for anticipating an

opponent’s behavior in a multiplayer competitive game.

Learning and observation models: In order to appreciate the extent to which the same methods may

be applied to different sets of data, it is useful to separate a computational theory into two parts. The ﬁrst,

which we will call the learning model, describes the dynamics of the model’s internal variables such as the

reward expected for each slot machine. The second part, which we will call the observation model, describes

how the model’s internal variables are reﬂected in observed data: for instance, how expected values drive

choice or how prediction errors produce neural spiking. Essentially, the observation model regresses the

learning model’s internal variables onto the observed data; it plays a similar role as (and is often, in fact,

identical to) the “link function” in generalized linear modeling. In this way, a common learning process

(a single learning model) may be viewed as giving rise to distinct observable data streams in a number of

different modalities (e.g., choices and BOLD, through two separate observation models). Thus, although we

describe the methods in this tutorial primarily in terms of choice data, they are directly applicable to other

modalities simply by substituting a different observation model.

Crucially, whereas the learning model is typically deterministic, the observation models are noisy: that

is, given the internal variables produced by the learning model, an observation model assigns some proba-

bility to any possible observations. Thus the “ﬁt” of different learning models, or their parameters, to any

observed data can be quantiﬁed statistically in terms of the probability they assign to the data, a procedure

at the core of the methods that follow.

3 Parameter estimation

Model parameters can characterize a variety of scientiﬁcally interesting quantities, from how quickly sub-

jects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al.,

2007). Here we consider how to obtain statistical results about parameters’ values from data. We ﬁrst con-

sider the general statistical rationale underlying the problem; then develop the details for an example RL

model before considering various pragmatic factors of actually performing these analyses on data. Finally,

having discussed these details in terms of choice data, we discuss how the same methods may be applied

to other sorts of data.

Suppose we have some model M, with a vector of free parameters θ

. The model (here, the composite

of our learning and observation models) describes a probability distribution, or likelihood function, P(D |

M, θ

) over possible data sets D. Then, Bayes’ rule tells us that having observed a data set D,

P(θ

| D, M) ∝ P(D | M, θ

) · P(θ

| M) (1)

That is, the posterior probability distribution over the free parameters, given the data, is proportional to the

product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability

of the parameters. This equation famously shows how to start with a theory of how parameters (noisily)

produce data, and invert it into an theory by which data (noisily) reveal the parameters that produced

it. Classically, we seek a point estimate of the parameters θ

rather than a posterior distribution over

all possible values; if we neglect (or treat as ﬂat) the prior over the parameters P(θ

|M), then the most

probable value for θ

is the maximum likelihood estimate: the setting of the parameters that maximizes the

likelihood function, P(D | M, θ

). We denote this

3.1 Maximum likelihood estimation for RL

An RL model: We may see how the general ideas play out in a simple reinforcement learning setting.

Consider a simple game in which on each trial t, a subject makes a choice c

(= L or R) between a left and

a right slot machine, and receives a reward r

(= $1 or $0) stochastically. According to a simple Q-learning

model (Watkins, 1989), on each trial the subject assigns an expected value to each machine: Q

(L) and

(R). We initialize these values to (say) 0, and then on each trial, the value for the chosen machine is

updated as

t+1

) = Q

) + α · δ

(2)

where 0 ≤ α ≤1 is a free learning rate parameter, and δ

= r

− Q

) is the prediction error. Equation 2 is

our learning model. To explain the choices c

in terms of the values Q

we assume an observation model. In

RL, it is often assumed that that subjects choose probabilistically according to a softmax distribution:

P(c

= L | Q

(L), Q

(R)) =

exp(β · Q

(L))

exp(β · Q

(R)) + exp(β · Q

(L))

(3)

Here, β is a free parameter known in RL as the inverse temperature parameter. However, note that Equa-

tion 3 is also equivalent to standard logistic regression where the dependent variable is the binary choice

variable c

and there is one predictor variable, the difference in values Q

(L) − Q

(R). Therefore, β can also

be as viewed the regression weight connecting the Qs to the choices. More generally, when there are more

than two choice options, the softmax model corresponds to a generalization of logistic regression known as

conditional logit regression (McFadden, 1974).

The model of Equations 2 and 3 is only a representative example of the sorts of algorithms used to

study reinforcement learning. Since our focus here is on the methodology for estimation given a model,

a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and

Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments). That said, most models in the literature

are variants on the example shown here. Another commonly used (Daw et al., 2006; Behrens et al., 2007) and

seemingly rather different family of learning methods is Bayesian models such as the Kalman ﬁlter (Kakade

0 0.5 1 1.5 2

0.2

0.4

0.6

0.8

Figure 1: Likelihood surface for simulated reinforcement learning data, as a function of two free parame-

ters. Lighter colors denote higher data likelihood. The maximum likelihood estimate is shown as an “o”

surrounded by an ellipse of one standard error (a region of about 90% conﬁdence); the true parameters

from which the data were generated are denoted by an “x”.

and Dayan, 2002). In fact, the Q-learning rule of Equation 2 can be seen as a simpliﬁed case of the Kalman

ﬁlter: the Bayesian model uses the same learning rule but has additional machinery that determines the

learning rate parameter α on a trial-by-trial basis (Kakade and Dayan, 2002; Behrens et al., 2007; Daw et al.,

2008).

Data likelihood: Given the model described above, the probability of a whole dataset D (i.e., a whole

sequence of choices c = c

1...T

given the rewards r = r

1...T

) is just product of their probabilities from Equation

∏

P(c

= L | Q

(L), Q

(R)) (4)

Note that the terms Q

in the softmax are determined (via equation 2) by the rewards r

1...t−1

and choices

1...t−1

on trials prior to t.

Together, Equations 2 and 3 constitute a full likelihood function P(D | M, θ

), and we can estimate the

free parameters (θ

α, β

) by maximum likelihood. Figure 1 illustrates the process. 1,000 choice trials

were simulated according to the model (with parameters α = .25 and β = 1, red x). The likelihood of the

observed data was then computed for a range of parameters, and plotted (with brighter colors for higher

likelihood) on a 2-D grid. In this case, the maximum likelihood point (

α = .34 and

β = .93, blue circle) was

near the true parameters.

Conﬁdence intervals: Of course, in order actually to test a hypothesis about the parameters’ values,

we need to be able to make statistical claims about the quality of the estimate

. Intuitively, the degree

to which our estimate can be trusted depends on how much better it accounts for the data than other

nearby parameter estimates, that is on how sharply peaked is the “hill” of data likelihoods in the space

of parameters. Such peakiness is characterized by the second derivative (the Hessian) of the likelihood

function with respect to the parameters. The Hessian is a square matrix (here, 2x2) with a row and column

for each parameter. Evaluated at the peak point

, the elements of the Hessian are larger the more rapidly

the likelihood function is dropping off away from it in different directions, which corresponds to a more

reliable estimate of the parameters. Conversely, the matrix inverse of the Hessian (like the reciprocal of a

scalar) is larger for poorer estimates, like error bars. More precisely, if H is the Hessian of the negative log of

the likelihood function at the maximum likelihood point

, then a standard estimator for the covariance of

the parameter estimates is its matrix inverse H

−1

(MacKay, 2003).

The diagonal terms of H

−1

correspond to variances for each parameter separately, and their square

roots measure one standard error on the parameter. Thus, for instance, 95% conﬁdence intervals around

the maximum likelihood estimate may be estimated as

θ plus or minus 1.96 standard errors.

Covariance between parameters: The off-diagonal terms of of H

−1

measure covariance between the

parameters, and are useful for diagnosing model ﬁt. In general, large off-diagonal terms are a symptom of

a poorly speciﬁed model or some kinds of bad data. In the worst case, two parameters may be redundant,

so that there is no unique optimum. The Q learning model has a more moderate coupling between the

parameters. As can be seen by the elongated, tilted shape of the “ridge” in Figure 1, estimates of α and

β tend to be inversely coupled in this model. By increasing β while decreasing α (or vice versa: moving

northwest or southeast in the ﬁgure), a similar likelihood is obtained. This is because the reward r

multiplied by both α (in Equation 2 to update Q

) and then by β (in Equation 3) before affecting the choice

likelihood on the next trial. As a result of this, either parameter individually cannot be estimated so tightly

by itself (the “ridge” is a bit wide if you cross it horizontally in β or vertically in α), but their product is well

estimated (the hill is most narrow when crossed from northeast to southwest). The blue oval in the ﬁgure

traces out a one-standard error ellipse in the two parameters jointly, derived from H

−1

; its tilt follows the

contour of the ridge.

Often in applications such as logistic regression, a corrected covariance estimator is used that is thought

to be more robust to problems such as mismatch between the true and assumed models. This “Huber-

White” or “sandwich” estimator (Huber, 1967; Freedman, 2006) is H

−1

where B =

∑

g(c

)

g(c

and g(c

), in turn, is the gradient (vector of ﬁrst partial derivatives with respect to the parameters) of the

negative log likelihood of the tth data point c

, evaluated at

. This is harder to compute in practice, since

it involves keeping track of g, which is laborious. However, as discussed below, g can also be useful when

searching for the maximum likelihood point.

3.2 Pragmatics

Above, we developed the general equations for maximum likelihood parameter estimation in an RL model;

how can these be implemented in practice for data analysis?

First, although we have noted an equivalence between Equation 3 and logistic regression, it is not pos-

sible simply to use an off-the-shelf regression package to estimate the parameters. This is because although

the observation stage of the model represents a logistic regression from values Q

to choices c

, the values

are not ﬁxed but themselves depend on the free parameters (here, α) of the learning process. As these do

not enter the likelihood linearly they cannot be estimated by a generalized linear model. Thus, we must

search for the full set of free parameters that optimize the likelihood function.

Likelihood function: At the heart of optimizing the likelihood function is computing it. It is straight-

forward to write a function that takes in a dataset (a sequence of choices and rewards) and a candidate

setting of the free parameters, loops over the data computing equations 2 and 3, and returns the aggregate

likelihood of the data. Importantly, the product in Equation 4 is often an exceptionally small number; it is

thus numerically more stable to compute its log, i.e. the sum over trials of the log of the choice probability

from equation 3, which is β · Q

) − log(exp(β · Q

(L)) + exp(β · Q

(R))). Since log is a monotonic func-

tion, this quantity has the same optimum but is less likely to underﬂow the minimum ﬂoating point value

representable by a computer. (Another numerical trick is to note that Equation 3 is invariant to the addition

or subtraction of any constant to all of the Q values. The chance of the exponential under- or overﬂowing

can thus be reduced by evaluating the log probability for Q values after ﬁrst subtracting their mean.)

How, then, to ﬁnd the optimal likelihood? In general, it may be tempting to discretize the space of

free parameters, compute the likelihood everywhere, and simply search for the best, much as is illustrated

in Figure 1. We recommend against this approach. First, most models of interest have more than two

parameters, and exhaustively testing all combinations in a higher dimensional space becomes intractable.

Trial-by-trial data analysis using computational models

Figures

Citations

Dopamine, decision-making, and aging : neural and behavioural correlates

Computational modeling of behavioral tasks: An illustration on a classic reinforcement learning paradigm

Different components of cognitive-behavioural therapy affect specific cognitive mechanisms

Qualitative and quantitative evaluations of mathematical models for animal learning and behaviors

Learning and memory systems supporting decision making in the human brain

References

A new look at the statistical model identification

Estimating the Dimension of a Model

Reinforcement Learning: An Introduction

Estimating the dimension of a model

Pattern Recognition and Machine Learning

Related Papers (5)

Reinforcement Learning: An Introduction

A Neural Substrate of Prediction and Reward

Cortical substrates for exploratory decisions in humans

Learning the value of information in an uncertain world

Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Trial-by-trial data analysis using computational models" ?