Book Chapter•DOI•

Trial-by-trial data analysis using computational models

01 May 2011-

TL;DR: The present review aims to clarify the toolbox by cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts.

read less

Abstract: In numerous and high-profile studies, researchers have recently begun to integrate computational models into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher, 1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al., 2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007; Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008). As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to address. We then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis. The techniques are illustrated with fits of simple models to simulated datasets. Throughout, we flag interpretational and technical pitfalls of which we believe authors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientific conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are also excellent books that cover statistical inference of this general sort with much greater generality, formal precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).

...read moreread less

Summary (4 min read)

Jump to: [1 Introduction] – [2 Background] – [3 Parameter estimation] – [3.1 Maximum likelihood estimation for RL] – [3.2 Pragmatics] – [3.3 Intersubject variability and random effects] – [3.5 Extensions] – [4 Model comparison] – [4.1 Examples from RL] – [4.2 Classical model comparison] – [4.3 Bayesian model comparison in theory] – [4.4 Bayesian model comparison in practice] – [4.5 Summary and recommendations] – [4.6 Model comparison and populations] and [5 Pitfalls and alternatives]

1 Introduction

The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.
The techniques are illustrated with fits of simple models to simulated datasets.
Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware.

2 Background

And the mean behavior of computational simulations was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying neural activity) changes trial by trial in response to feedback.
This example points to another important feature of this approach, which is that the theories purport to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,” or difference between the received and expected rewards).
These hypotheses may be tested against one another on the basis of their fit to the data.
The second part, which the authors will call the observation model, describes how the model’s internal variables are reflected in observed data: for instance, how expected values drive choice or how prediction errors produce neural spiking.

3 Parameter estimation

Model parameters can characterize a variety of scientifically interesting quantities, from how quickly subjects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al., 2007).
Here the authors consider how to obtain statistical results about parameters’ values from data.
The authors first consider the general statistical rationale underlying the problem; then develop the details for an example RL model before considering various pragmatic factors of actually performing these analyses on data.
That is, the posterior probability distribution over the free parameters, given the data, is proportional to the product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability of the parameters.
This equation famously shows how to start with a theory of how parameters produce data, and invert it into an theory by which data reveal the parameters that produced it.

3.1 Maximum likelihood estimation for RL

The authors may see how the general ideas play out in a simple reinforcement learning setting.
Since their focus here is on the methodology for estimation given a model, a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments).
Evaluated at the peak point θ̂M, the elements of the Hessian are larger the more rapidly the likelihood function is dropping off away from it in different directions, which corresponds to a more reliable estimate of the parameters.
The diagonal terms of H−1 correspond to variances for each parameter separately, and their square roots measure one standard error on the parameter.
The Q learning model has a more moderate coupling between the parameters.

3.2 Pragmatics

At the heart of optimizing the likelihood function is computing it.
Second, and not unrelated, discretizing the parameters too coarsely, or searching within an inappropriate range, can lead to poor results; worse yet, since the parameters are typically coupled , a poor search on one will also corrupt the estimates for other parameters.
Thus, it makes some sense to constrain the parameter within this range.
There may, for instance, be features of the data that can only be captured within the model in question by adopting seemingly nonsensical parameters.

3.3 Intersubject variability and random effects

For population-level questions, treating parameters as fixed effects and thereby conflating within- and between-subject variability can lead to serious problems such as overstating the true significance of results.
CN | µα, µβ, σα, σβ)P(µα, µβ, σα, σβ) (7) Estimating population parameters in a hierarchical model: Equation 7 puts us in in a position, in principle, to estimate the population parameters from the set of all subjects’ choices, using maximum likelihood or maximum a posteriori methods exactly as discussed for individual subjects in the previous section.
Moreover, assuming the distributions P(αi | µα, σα) and P(βi | µβ, σβ) are Gaussian, then finding the population parameters for these expressions is just the familiar problem of estimating a Gaussian distribution from samples.

3.5 Extensions

The basic procedures outlined above admit of many extensions.
In fact, this observation model (augmented with a hierarchical random effects model over the regression weights, such as β1, across the population) is identical to the general linear model used in standard fMRI analysis packages such as SPM.
Now the authors may also compute a second timeseries ∂δt/∂α: the partial derivative of the prediction error timeseries with respect to α.
Parameters µα1 and so on can be estimated to determine what the modes are that best fit the data; π1 controls the predominance of subject type 1; and the question how many types of subjects do the data support is a model selection question, answerable by the methods discussed in Section 4.
In all, fast acquisition followed by stable choice of the better option might be modeled with a decrease over trials in the learning rate, perhaps combined with an increase in the softmax temperature.

4 Model comparison

So far, the authors have assumed a fixed model of the data and sought to estimate its free parameters.
In particular, models like that of Equation 2 and its alternatives constitute different hypotheses about the mechanisms or algorithms that the brain uses to solve RL problems.
The methods discussed below address this problem.
In some cases, questions of interest might be framed either in terms of parameter estimation or model selection, and thus addressed using either the methods of the previous or the current situation.

4.1 Examples from RL

The authors illustrate the issues of model selection using some simple alternatives to the model of choice behavior discussed thus far.
In fact, the practical ingredients for model evaluation are basically the same as those for parameter estimation; as before, what is needed is simply to compute data likelihoods under a model, optimize parameters, and estimate Hessians.
Depending on β (assumed to be fixed), these asymptotic learned values may imply less-than-complete preference for the better option over the worse.
As already noted, this introduces some difficulty in comparing them.

4.2 Classical model comparison

Conversely, a model fits well exactly to the extent that it captures the repeatable aspects of the data, allowing good predictions of additional datasets.
In contrast, it has rarely been used in studies of reinforcement learning (though see Camerer and Ho, 1999).
Let us consider again the likelihood of a single dataset, using best-fitting parameters.

4.3 Bayesian model comparison in theory

The key quantity here is P(D | M), known as the model evidence: the probability of the data under the model.
Importantly, this expression does not make reference to any particular parameter settings such as θ̂M: since in asking how well a model predicts data the authors are not given any particular parameters.
This means that a more flexible model (one with more parameters that is able to achieve good fit to many data sets with different particular parameter settings) must correspondingly assign lower P(D | M) to all of them since a fixed probability of 1 is divided among them all.
The result of a Bayesian model comparison is a statistical claim about the relative fit of one model over another.
Kass and Raftery (1995) present a table of conventions for interpreting Bayes factors; note that their logs are taken in base-10 rather than base-e.

4.4 Bayesian model comparison in practice

The theory of Bayesian model selection is very useful conceptual framework; for instance, it clarifies why the maximum likelihood score is an inappropriate metric for model comparison.
The authors have mostly ignored priors thus far, because their subjective nature arguably makes them problematic in the context of objective scientific communication.
Thus, most of the methods discussed below do require assuming a prior over parameters.
See MacKay (2003) and Bishop (2006) for discussion of more elaborate sampling techniques that attempt to cope with this situation.
Similarly, H is the Hessian of the function being optimized (minus the sum of the first two terms of Equation 17), evaluated at the MAP point, not the Hessian of just the log likelihood.

4.5 Summary and recommendations

Models may be compared to one another on the basis of the likelihood they assign to data; however, if this likelihood is computed at parameters chosen to optimize it, the measure must be corrected for overfitting to allow a fair comparison between models with different numbers of parameters.
In practice, when models are nested, the authors suggest using a likelihood ratio test, since this permits reporting a classical p-value and is well accepted.
If one is willing to define a prior, and defend it, the authors suggest exploring the Laplace approximation, which is almost as simple but far better founded.
Even if two candidate models have the same number of parameters — and thus scores like BIC are equivalent to just comparing raw likelihoods — the complexity penalty implied by Equation 15 may not actually be the same between them if the two sets of parameters are differently constrained, either a priori or by the data.
As in Figure 5, this more accurate assessment can have salutary effects.

4.6 Model comparison and populations

So far the authors have described model comparison mostly in the abstract, with applications to choice data at the single subject level.
There are a number of possibilities, of which the simplest will often suffice.
These aggregates can then be compared between two models to compute a Bayes factor over the population.
Finally, one could take the identity of the model as varying over subjects, i.e. as a random effect (Stephan et al., 2009).
This involves adding another level to the hierarchy of Figure 2, according to which, for each subject, one of a set of models is drawn with some according to a multinomial distribution (given by new free parameters) and then the model’s parameters and the data are drawn as before.

5 Pitfalls and alternatives

The authors close this tutorial by identifying some pitfalls, caveats, and concerns with these methods that they think it is important for readers to appreciate.
As mentioned, even if learning parameters such as temperature and learning rate are actually independent from one another (i.e., in terms of their distribution across a population), the estimates of those parameters from data may be correlated due to their having similar expected effects on observable data .
For choice data, exponentiating this average log likelihood, exp(L/T) produces a probability that is easily interpreted relative to the chance level.
This points again to the fact that these methods are suited to drawing relative conclusions comparing multiple hypotheses (the data support model A over model B).

Did you find this useful? Give us your feedback

Figures (5)

Figure 2: Models of population data. (a) Fixed effects: Model parameters are shared between subjects. (b) Random effects: Each subject’s parameters are drawn from a common population distribution.

Figure 1: Likelihood surface for simulated reinforcement learning data, as a function of two free parameters. Lighter colors denote higher data likelihood. The maximum likelihood estimate is shown as an “o” surrounded by an ellipse of one standard error (a region of about 90% confidence); the true parameters from which the data were generated are denoted by an “x”.

Figure 5: Model comparison with complexity penalties: 300 choice trials each from 20 subjects were simulated using the one-parameter policy model (blue dots) and the two-parameter value model (green dots); the choices were then fit with both models and Bayes factors comparing the two models were computed according to both BIC and Laplace approximations to the model evidence. (For Laplace, the prior over parameters were taken as uniform over a large range.) BIC (left) overpenalizes the value model for its additional free parameter, and favors the simpler policy model even for many of the simulated value model subjects (green dots below the dashed line); the Laplace approximation not only sets the penalty more appropriately, but it also separates the two sets of subjects more effectively because it takes into account not just the raw number of parameters but also how well they were actually fit for each subject.

Figure 3: (a) Estimating population parameters via summary statistics. Learning parameters for 20 simulated subjects from the bivariate Gaussian distribution with mean and standard deviation shown in red. 1,000 choice trials for each subject were simulated, and the model fit to each individual by maximum likelihood. The individual fits are shown as dots, and their summary statistics are shown in blue. Here, the population mean is well estimated, but the population variance is inflated. (b) Parameters for the simulated subjects were re-estimated by maximum a posteriori taking the population summary statistics from part a (gray ellipse) as a prior. Estimates are pulled toward the population mean. Summary statistics for the new estimates are shown in blue.

Figure 4: Overfitting: 300 trials each from 20 subjects were simulated using the basic Q learning model, then fit with the model including an additional parameter κ capturing choice autocorrelation. For each simulated subject, the maximum likelihood estimate for κ is plotted against the difference in log-likelihood between this model and the best fit of the true model with κ = 0. These differences are always positive, demonstrating overfitting, but rarely exceed the level expected by chance (95% significance level from likelihood ratio test, gray dashed line).

Content maybe subject to copyright Report

Trial-by-trial data analysis using computational models

Nathaniel D. Daw

August 27, 2009

[manuscript for Attention & Performance XXIII.]

1 Introduction

In numerous and high-proﬁle studies, researchers have recently begun to integrate computational models

into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher,

1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al.,

2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007;

Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008).

As these techniques are spreading rapidly, but have been developed and documented somewhat sporadi-

cally alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty

et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to

address. We then offer a relatively practical tutorial about the basic statistical methods for their answer

and how they can be applied to data analysis. The techniques are illustrated with ﬁts of simple models

to simulated datasets. Throughout, we ﬂag interpretational and technical pitfalls of which we believe au-

thors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat

idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these tech-

niques as instances of a general set of tools that can be applied to analyze behavioral and neural data of

many sorts.

A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientiﬁc

conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are

also excellent books that cover statistical inference of this general sort with much greater generality, formal

precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).

2 Background

Much work in this area grew out of the celebrated observation (Barto, 1995; Schultz et al., 1997) that the

ﬁring of midbrain dopamine neurons (and also the BOLD signal measured via fMRI in their primary target,

the striatum; Delgado et al., 2000; Knutson et al., 2000; McClure et al., 2003; O’Doherty et al., 2003) resembles

a “prediction error” signal used in a number of computational algorithms for reinforcement learning (RL,

i.e. trial and error learning in decision problems; Sutton and Barto, 1998). Although the original empirical

articles reported activity averaged across many trials, and the mean behavior of computational simulations

was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying

neural activity) changes trial by trial in response to feedback. In fact, the computational theories are framed

in just these terms, and so more recent work on the system (O’Doherty et al., 2003; Bayer and Glimcher,

2005) has focused on comparing their predictions to raw data timeseries, trial by trial: measuring, in effect,

the theories’ goodness of ﬁt to the data, on average, rather than their goodness of ﬁt to the averaged data.

This change in approach represents a major advance in the use of computational models for experimen-

tal design and analysis, which is still unfolding. Used this way, computational models represent excep-

tionally detailed, quantitative hypotheses about how the brain approaches a problem, which are amenable

to direct experimental test. As noted, such trial-by-trial analyses are particularly suitable to developing a

more detailed and dynamic picture of learning than was previously available.

In a standard experiential decision experiment, such as a “bandit” task (Sugrue et al., 2004; Lau and

Glimcher, 2005; Daw et al., 2006), a subject is offered repeated opportunities to choose between multiple

options (e.g. slot machines) and receives rewards or punishments according to her choice on each trial.

Data might consist of a series of choices and outcomes (one per trial). In principle, any arbitrary relation-

ship might obtain between the entire list of past choices and outcomes, and the next one. Computational

theories constitute particular claims about some more restricted function by which previous choices and

feedback give rise to subsequent choices. For instance, standard RL models (such as “Q learning”; Watkins,

1989) envision that subjects track the expected reward for each slot machine, via some sort of running av-

erage over the feedback, and it is only through these aggregated “value” predictions that past feedback

determines future choices.

This example points to another important feature of this approach, which is that the theories purport

to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,”

or difference between the received and expected rewards). That is, the theories permit the estimation of

quantities (expectations, expectation violations) that would otherwise be subjective; this, in turn, enables

the search for neural correlates of these estimates (Platt and Glimcher, 1999; Sugrue et al., 2004).

By comparing the model’s predictions to trial-by-trial experimental data, such as choices or BOLD sig-

nals, it is possible using a mixture of Bayesian and classical statistical techniques to answer two sorts of

questions about a model, which are discussed in Sections 3 and 4 below. The art is framing questions of

scientiﬁc interest in these terms.

The ﬁrst question is parameter estimation. RL models typically have a number of free parameters —

measuring quantities such as the “learning rate,” or the degree to which subjects update their beliefs in

response to feedback. Often, these parameters characterize (or new parameters can be introduced so as to

characterize) factors that are of experimental interest. For instance, Behrens et al. (2007) tested predictions

about how particular task manipulations would affect the learning rate.

The second type of question that can be addressed is model comparison. Different computational mod-

els, in effect, constitute different hypotheses about the learning process that gave rise to the data. These

hypotheses may be tested against one another on the basis of their ﬁt to the data. For example, Hamp-

ton et al. (2008) use this method to compare which of different approaches subjects use for anticipating an

opponent’s behavior in a multiplayer competitive game.

Learning and observation models: In order to appreciate the extent to which the same methods may

be applied to different sets of data, it is useful to separate a computational theory into two parts. The ﬁrst,

which we will call the learning model, describes the dynamics of the model’s internal variables such as the

reward expected for each slot machine. The second part, which we will call the observation model, describes

how the model’s internal variables are reﬂected in observed data: for instance, how expected values drive

choice or how prediction errors produce neural spiking. Essentially, the observation model regresses the

learning model’s internal variables onto the observed data; it plays a similar role as (and is often, in fact,

identical to) the “link function” in generalized linear modeling. In this way, a common learning process

(a single learning model) may be viewed as giving rise to distinct observable data streams in a number of

different modalities (e.g., choices and BOLD, through two separate observation models). Thus, although we

describe the methods in this tutorial primarily in terms of choice data, they are directly applicable to other

modalities simply by substituting a different observation model.

Crucially, whereas the learning model is typically deterministic, the observation models are noisy: that

is, given the internal variables produced by the learning model, an observation model assigns some proba-

bility to any possible observations. Thus the “ﬁt” of different learning models, or their parameters, to any

observed data can be quantiﬁed statistically in terms of the probability they assign to the data, a procedure

at the core of the methods that follow.

3 Parameter estimation

Model parameters can characterize a variety of scientiﬁcally interesting quantities, from how quickly sub-

jects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al.,

2007). Here we consider how to obtain statistical results about parameters’ values from data. We ﬁrst con-

sider the general statistical rationale underlying the problem; then develop the details for an example RL

model before considering various pragmatic factors of actually performing these analyses on data. Finally,

having discussed these details in terms of choice data, we discuss how the same methods may be applied

to other sorts of data.

Suppose we have some model M, with a vector of free parameters θ

. The model (here, the composite

of our learning and observation models) describes a probability distribution, or likelihood function, P(D |

M, θ

) over possible data sets D. Then, Bayes’ rule tells us that having observed a data set D,

P(θ

| D, M) ∝ P(D | M, θ

) · P(θ

| M) (1)

That is, the posterior probability distribution over the free parameters, given the data, is proportional to the

product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability

of the parameters. This equation famously shows how to start with a theory of how parameters (noisily)

produce data, and invert it into an theory by which data (noisily) reveal the parameters that produced

it. Classically, we seek a point estimate of the parameters θ

rather than a posterior distribution over

all possible values; if we neglect (or treat as ﬂat) the prior over the parameters P(θ

|M), then the most

probable value for θ

is the maximum likelihood estimate: the setting of the parameters that maximizes the

likelihood function, P(D | M, θ

). We denote this

3.1 Maximum likelihood estimation for RL

An RL model: We may see how the general ideas play out in a simple reinforcement learning setting.

Consider a simple game in which on each trial t, a subject makes a choice c

(= L or R) between a left and

a right slot machine, and receives a reward r

(= $1 or $0) stochastically. According to a simple Q-learning

model (Watkins, 1989), on each trial the subject assigns an expected value to each machine: Q

(L) and

(R). We initialize these values to (say) 0, and then on each trial, the value for the chosen machine is

updated as

t+1

) = Q

) + α · δ

(2)

where 0 ≤ α ≤1 is a free learning rate parameter, and δ

= r

− Q

) is the prediction error. Equation 2 is

our learning model. To explain the choices c

in terms of the values Q

we assume an observation model. In

RL, it is often assumed that that subjects choose probabilistically according to a softmax distribution:

P(c

= L | Q

(L), Q

(R)) =

exp(β · Q

(L))

exp(β · Q

(R)) + exp(β · Q

(L))

(3)

Here, β is a free parameter known in RL as the inverse temperature parameter. However, note that Equa-

tion 3 is also equivalent to standard logistic regression where the dependent variable is the binary choice

variable c

and there is one predictor variable, the difference in values Q

(L) − Q

(R). Therefore, β can also

be as viewed the regression weight connecting the Qs to the choices. More generally, when there are more

than two choice options, the softmax model corresponds to a generalization of logistic regression known as

conditional logit regression (McFadden, 1974).

The model of Equations 2 and 3 is only a representative example of the sorts of algorithms used to

study reinforcement learning. Since our focus here is on the methodology for estimation given a model,

a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and

Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments). That said, most models in the literature

are variants on the example shown here. Another commonly used (Daw et al., 2006; Behrens et al., 2007) and

seemingly rather different family of learning methods is Bayesian models such as the Kalman ﬁlter (Kakade

0 0.5 1 1.5 2

0.2

0.4

0.6

0.8

Figure 1: Likelihood surface for simulated reinforcement learning data, as a function of two free parame-

ters. Lighter colors denote higher data likelihood. The maximum likelihood estimate is shown as an “o”

surrounded by an ellipse of one standard error (a region of about 90% conﬁdence); the true parameters

from which the data were generated are denoted by an “x”.

and Dayan, 2002). In fact, the Q-learning rule of Equation 2 can be seen as a simpliﬁed case of the Kalman

ﬁlter: the Bayesian model uses the same learning rule but has additional machinery that determines the

learning rate parameter α on a trial-by-trial basis (Kakade and Dayan, 2002; Behrens et al., 2007; Daw et al.,

2008).

Data likelihood: Given the model described above, the probability of a whole dataset D (i.e., a whole

sequence of choices c = c

1...T

given the rewards r = r

1...T

) is just product of their probabilities from Equation

∏

P(c

= L | Q

(L), Q

(R)) (4)

Note that the terms Q

in the softmax are determined (via equation 2) by the rewards r

1...t−1

and choices

1...t−1

on trials prior to t.

Together, Equations 2 and 3 constitute a full likelihood function P(D | M, θ

), and we can estimate the

free parameters (θ

α, β

) by maximum likelihood. Figure 1 illustrates the process. 1,000 choice trials

were simulated according to the model (with parameters α = .25 and β = 1, red x). The likelihood of the

observed data was then computed for a range of parameters, and plotted (with brighter colors for higher

likelihood) on a 2-D grid. In this case, the maximum likelihood point (

α = .34 and

β = .93, blue circle) was

near the true parameters.

Conﬁdence intervals: Of course, in order actually to test a hypothesis about the parameters’ values,

we need to be able to make statistical claims about the quality of the estimate

. Intuitively, the degree

to which our estimate can be trusted depends on how much better it accounts for the data than other

nearby parameter estimates, that is on how sharply peaked is the “hill” of data likelihoods in the space

of parameters. Such peakiness is characterized by the second derivative (the Hessian) of the likelihood

function with respect to the parameters. The Hessian is a square matrix (here, 2x2) with a row and column

for each parameter. Evaluated at the peak point

, the elements of the Hessian are larger the more rapidly

the likelihood function is dropping off away from it in different directions, which corresponds to a more

reliable estimate of the parameters. Conversely, the matrix inverse of the Hessian (like the reciprocal of a

scalar) is larger for poorer estimates, like error bars. More precisely, if H is the Hessian of the negative log of

the likelihood function at the maximum likelihood point

, then a standard estimator for the covariance of

the parameter estimates is its matrix inverse H

−1

(MacKay, 2003).

The diagonal terms of H

−1

correspond to variances for each parameter separately, and their square

roots measure one standard error on the parameter. Thus, for instance, 95% conﬁdence intervals around

the maximum likelihood estimate may be estimated as

θ plus or minus 1.96 standard errors.

Covariance between parameters: The off-diagonal terms of of H

−1

measure covariance between the

parameters, and are useful for diagnosing model ﬁt. In general, large off-diagonal terms are a symptom of

a poorly speciﬁed model or some kinds of bad data. In the worst case, two parameters may be redundant,

so that there is no unique optimum. The Q learning model has a more moderate coupling between the

parameters. As can be seen by the elongated, tilted shape of the “ridge” in Figure 1, estimates of α and

β tend to be inversely coupled in this model. By increasing β while decreasing α (or vice versa: moving

northwest or southeast in the ﬁgure), a similar likelihood is obtained. This is because the reward r

multiplied by both α (in Equation 2 to update Q

) and then by β (in Equation 3) before affecting the choice

likelihood on the next trial. As a result of this, either parameter individually cannot be estimated so tightly

by itself (the “ridge” is a bit wide if you cross it horizontally in β or vertically in α), but their product is well

estimated (the hill is most narrow when crossed from northeast to southwest). The blue oval in the ﬁgure

traces out a one-standard error ellipse in the two parameters jointly, derived from H

−1

; its tilt follows the

contour of the ridge.

Often in applications such as logistic regression, a corrected covariance estimator is used that is thought

to be more robust to problems such as mismatch between the true and assumed models. This “Huber-

White” or “sandwich” estimator (Huber, 1967; Freedman, 2006) is H

−1

where B =

∑

g(c

)

g(c

and g(c

), in turn, is the gradient (vector of ﬁrst partial derivatives with respect to the parameters) of the

negative log likelihood of the tth data point c

, evaluated at

. This is harder to compute in practice, since

it involves keeping track of g, which is laborious. However, as discussed below, g can also be useful when

searching for the maximum likelihood point.

3.2 Pragmatics

Above, we developed the general equations for maximum likelihood parameter estimation in an RL model;

how can these be implemented in practice for data analysis?

First, although we have noted an equivalence between Equation 3 and logistic regression, it is not pos-

sible simply to use an off-the-shelf regression package to estimate the parameters. This is because although

the observation stage of the model represents a logistic regression from values Q

to choices c

, the values

are not ﬁxed but themselves depend on the free parameters (here, α) of the learning process. As these do

not enter the likelihood linearly they cannot be estimated by a generalized linear model. Thus, we must

search for the full set of free parameters that optimize the likelihood function.

Likelihood function: At the heart of optimizing the likelihood function is computing it. It is straight-

forward to write a function that takes in a dataset (a sequence of choices and rewards) and a candidate

setting of the free parameters, loops over the data computing equations 2 and 3, and returns the aggregate

likelihood of the data. Importantly, the product in Equation 4 is often an exceptionally small number; it is

thus numerically more stable to compute its log, i.e. the sum over trials of the log of the choice probability

from equation 3, which is β · Q

) − log(exp(β · Q

(L)) + exp(β · Q

(R))). Since log is a monotonic func-

tion, this quantity has the same optimum but is less likely to underﬂow the minimum ﬂoating point value

representable by a computer. (Another numerical trick is to note that Equation 3 is invariant to the addition

or subtraction of any constant to all of the Q values. The chance of the exponential under- or overﬂowing

can thus be reduced by evaluating the log probability for Q values after ﬁrst subtracting their mean.)

How, then, to ﬁnd the optimal likelihood? In general, it may be tempting to discretize the space of

free parameters, compute the likelihood everywhere, and simply search for the best, much as is illustrated

in Figure 1. We recommend against this approach. First, most models of interest have more than two

parameters, and exhaustively testing all combinations in a higher dimensional space becomes intractable.

HTML Viewer

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Trial-by-trial data analysis using computational models" ?

As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox ( see also O ’ Doherty et al., 2007 ). In particular, the authors discuss the rationale for these methods and the questions they are suited to address. Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware. The authors focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.

Trial-by-trial data analysis using computational models

Summary (4 min read)

1 Introduction

2 Background

3 Parameter estimation

3.1 Maximum likelihood estimation for RL

3.2 Pragmatics

3.3 Intersubject variability and random effects

3.5 Extensions

4 Model comparison

4.1 Examples from RL

4.2 Classical model comparison

4.3 Bayesian model comparison in theory

4.4 Bayesian model comparison in practice

4.5 Summary and recommendations

4.6 Model comparison and populations

5 Pitfalls and alternatives

Figures (5)

Citations

Cites background from "Trial-by-trial data analysis using ..."

Cites result from "Trial-by-trial data analysis using ..."

References

"Trial-by-trial data analysis using ..." refers background in this paper

"Trial-by-trial data analysis using ..." refers background or methods in this paper

"Trial-by-trial data analysis using ..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Trial-by-trial data analysis using computational models" ?