scispace - formally typeset
Search or ask a question
Book ChapterDOI

Trial-by-trial data analysis using computational models

01 May 2011-
TL;DR: The present review aims to clarify the toolbox by cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts.
Abstract: In numerous and high-profile studies, researchers have recently begun to integrate computational models into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher, 1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al., 2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007; Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008). As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to address. We then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis. The techniques are illustrated with fits of simple models to simulated datasets. Throughout, we flag interpretational and technical pitfalls of which we believe authors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientific conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are also excellent books that cover statistical inference of this general sort with much greater generality, formal precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).

Summary (4 min read)

1 Introduction

  • The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.
  • The techniques are illustrated with fits of simple models to simulated datasets.
  • Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware.

2 Background

  • And the mean behavior of computational simulations was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying neural activity) changes trial by trial in response to feedback.
  • This example points to another important feature of this approach, which is that the theories purport to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,” or difference between the received and expected rewards).
  • These hypotheses may be tested against one another on the basis of their fit to the data.
  • The second part, which the authors will call the observation model, describes how the model’s internal variables are reflected in observed data: for instance, how expected values drive choice or how prediction errors produce neural spiking.

3 Parameter estimation

  • Model parameters can characterize a variety of scientifically interesting quantities, from how quickly subjects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al., 2007).
  • Here the authors consider how to obtain statistical results about parameters’ values from data.
  • The authors first consider the general statistical rationale underlying the problem; then develop the details for an example RL model before considering various pragmatic factors of actually performing these analyses on data.
  • That is, the posterior probability distribution over the free parameters, given the data, is proportional to the product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability of the parameters.
  • This equation famously shows how to start with a theory of how parameters produce data, and invert it into an theory by which data reveal the parameters that produced it.

3.1 Maximum likelihood estimation for RL

  • The authors may see how the general ideas play out in a simple reinforcement learning setting.
  • Since their focus here is on the methodology for estimation given a model, a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments).
  • Evaluated at the peak point θ̂M, the elements of the Hessian are larger the more rapidly the likelihood function is dropping off away from it in different directions, which corresponds to a more reliable estimate of the parameters.
  • The diagonal terms of H−1 correspond to variances for each parameter separately, and their square roots measure one standard error on the parameter.
  • The Q learning model has a more moderate coupling between the parameters.

3.2 Pragmatics

  • At the heart of optimizing the likelihood function is computing it.
  • Second, and not unrelated, discretizing the parameters too coarsely, or searching within an inappropriate range, can lead to poor results; worse yet, since the parameters are typically coupled , a poor search on one will also corrupt the estimates for other parameters.
  • Thus, it makes some sense to constrain the parameter within this range.
  • There may, for instance, be features of the data that can only be captured within the model in question by adopting seemingly nonsensical parameters.

3.3 Intersubject variability and random effects

  • For population-level questions, treating parameters as fixed effects and thereby conflating within- and between-subject variability can lead to serious problems such as overstating the true significance of results.
  • CN | µα, µβ, σα, σβ)P(µα, µβ, σα, σβ) (7) Estimating population parameters in a hierarchical model: Equation 7 puts us in in a position, in principle, to estimate the population parameters from the set of all subjects’ choices, using maximum likelihood or maximum a posteriori methods exactly as discussed for individual subjects in the previous section.
  • Moreover, assuming the distributions P(αi | µα, σα) and P(βi | µβ, σβ) are Gaussian, then finding the population parameters for these expressions is just the familiar problem of estimating a Gaussian distribution from samples.

3.5 Extensions

  • The basic procedures outlined above admit of many extensions.
  • In fact, this observation model (augmented with a hierarchical random effects model over the regression weights, such as β1, across the population) is identical to the general linear model used in standard fMRI analysis packages such as SPM.
  • Now the authors may also compute a second timeseries ∂δt/∂α: the partial derivative of the prediction error timeseries with respect to α.
  • Parameters µα1 and so on can be estimated to determine what the modes are that best fit the data; π1 controls the predominance of subject type 1; and the question how many types of subjects do the data support is a model selection question, answerable by the methods discussed in Section 4.
  • In all, fast acquisition followed by stable choice of the better option might be modeled with a decrease over trials in the learning rate, perhaps combined with an increase in the softmax temperature.

4 Model comparison

  • So far, the authors have assumed a fixed model of the data and sought to estimate its free parameters.
  • In particular, models like that of Equation 2 and its alternatives constitute different hypotheses about the mechanisms or algorithms that the brain uses to solve RL problems.
  • The methods discussed below address this problem.
  • In some cases, questions of interest might be framed either in terms of parameter estimation or model selection, and thus addressed using either the methods of the previous or the current situation.

4.1 Examples from RL

  • The authors illustrate the issues of model selection using some simple alternatives to the model of choice behavior discussed thus far.
  • In fact, the practical ingredients for model evaluation are basically the same as those for parameter estimation; as before, what is needed is simply to compute data likelihoods under a model, optimize parameters, and estimate Hessians.
  • Depending on β (assumed to be fixed), these asymptotic learned values may imply less-than-complete preference for the better option over the worse.
  • As already noted, this introduces some difficulty in comparing them.

4.2 Classical model comparison

  • Conversely, a model fits well exactly to the extent that it captures the repeatable aspects of the data, allowing good predictions of additional datasets.
  • In contrast, it has rarely been used in studies of reinforcement learning (though see Camerer and Ho, 1999).
  • Let us consider again the likelihood of a single dataset, using best-fitting parameters.

4.3 Bayesian model comparison in theory

  • The key quantity here is P(D | M), known as the model evidence: the probability of the data under the model.
  • Importantly, this expression does not make reference to any particular parameter settings such as θ̂M: since in asking how well a model predicts data the authors are not given any particular parameters.
  • This means that a more flexible model (one with more parameters that is able to achieve good fit to many data sets with different particular parameter settings) must correspondingly assign lower P(D | M) to all of them since a fixed probability of 1 is divided among them all.
  • The result of a Bayesian model comparison is a statistical claim about the relative fit of one model over another.
  • Kass and Raftery (1995) present a table of conventions for interpreting Bayes factors; note that their logs are taken in base-10 rather than base-e.

4.4 Bayesian model comparison in practice

  • The theory of Bayesian model selection is very useful conceptual framework; for instance, it clarifies why the maximum likelihood score is an inappropriate metric for model comparison.
  • The authors have mostly ignored priors thus far, because their subjective nature arguably makes them problematic in the context of objective scientific communication.
  • Thus, most of the methods discussed below do require assuming a prior over parameters.
  • See MacKay (2003) and Bishop (2006) for discussion of more elaborate sampling techniques that attempt to cope with this situation.
  • Similarly, H is the Hessian of the function being optimized (minus the sum of the first two terms of Equation 17), evaluated at the MAP point, not the Hessian of just the log likelihood.

4.5 Summary and recommendations

  • Models may be compared to one another on the basis of the likelihood they assign to data; however, if this likelihood is computed at parameters chosen to optimize it, the measure must be corrected for overfitting to allow a fair comparison between models with different numbers of parameters.
  • In practice, when models are nested, the authors suggest using a likelihood ratio test, since this permits reporting a classical p-value and is well accepted.
  • If one is willing to define a prior, and defend it, the authors suggest exploring the Laplace approximation, which is almost as simple but far better founded.
  • Even if two candidate models have the same number of parameters — and thus scores like BIC are equivalent to just comparing raw likelihoods — the complexity penalty implied by Equation 15 may not actually be the same between them if the two sets of parameters are differently constrained, either a priori or by the data.
  • As in Figure 5, this more accurate assessment can have salutary effects.

4.6 Model comparison and populations

  • So far the authors have described model comparison mostly in the abstract, with applications to choice data at the single subject level.
  • There are a number of possibilities, of which the simplest will often suffice.
  • These aggregates can then be compared between two models to compute a Bayes factor over the population.
  • Finally, one could take the identity of the model as varying over subjects, i.e. as a random effect (Stephan et al., 2009).
  • This involves adding another level to the hierarchy of Figure 2, according to which, for each subject, one of a set of models is drawn with some according to a multinomial distribution (given by new free parameters) and then the model’s parameters and the data are drawn as before.

5 Pitfalls and alternatives

  • The authors close this tutorial by identifying some pitfalls, caveats, and concerns with these methods that they think it is important for readers to appreciate.
  • As mentioned, even if learning parameters such as temperature and learning rate are actually independent from one another (i.e., in terms of their distribution across a population), the estimates of those parameters from data may be correlated due to their having similar expected effects on observable data .
  • For choice data, exponentiating this average log likelihood, exp(L/T) produces a probability that is easily interpreted relative to the chance level.
  • This points again to the fact that these methods are suited to drawing relative conclusions comparing multiple hypotheses (the data support model A over model B).

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

Trial-by-trial data analysis using computational models
Nathaniel D. Daw
August 27, 2009
[manuscript for Attention & Performance XXIII.]
1 Introduction
In numerous and high-profile studies, researchers have recently begun to integrate computational models
into the analysis of data from experiments on reward learning and decision making (Platt and Glimcher,
1999; O’Doherty et al., 2003; Sugrue et al., 2004; Barraclough et al., 2004; Samejima et al., 2005; Daw et al.,
2006; Li et al., 2006; Frank et al., 2007; Tom et al., 2007; Kable and Glimcher, 2007; Lohrenz et al., 2007;
Schonberg et al., 2007; Wittmann et al., 2008; Hare et al., 2008; Hampton et al., 2008; Plassmann et al., 2008).
As these techniques are spreading rapidly, but have been developed and documented somewhat sporadi-
cally alongside the studies themselves, the present review aims to clarify the toolbox (see also O’Doherty
et al., 2007). In particular, we discuss the rationale for these methods and the questions they are suited to
address. We then offer a relatively practical tutorial about the basic statistical methods for their answer
and how they can be applied to data analysis. The techniques are illustrated with fits of simple models
to simulated datasets. Throughout, we flag interpretational and technical pitfalls of which we believe au-
thors, reviewers, and readers should be aware. We focus on cataloging the particular, admittedly somewhat
idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these tech-
niques as instances of a general set of tools that can be applied to analyze behavioral and neural data of
many sorts.
A number of other reviews (Daw and Doya, 2006; Dayan and Niv, 2008) have focused on the scientific
conclusions that have been obtained with these methods, an issue we omit almost entirely here. There are
also excellent books that cover statistical inference of this general sort with much greater generality, formal
precision, and detail (MacKay, 2003; Gelman et al., 2004; Bishop, 2006; Gelman and Hill, 2007).
2 Background
Much work in this area grew out of the celebrated observation (Barto, 1995; Schultz et al., 1997) that the
firing of midbrain dopamine neurons (and also the BOLD signal measured via fMRI in their primary target,
the striatum; Delgado et al., 2000; Knutson et al., 2000; McClure et al., 2003; O’Doherty et al., 2003) resembles
a “prediction error signal used in a number of computational algorithms for reinforcement learning (RL,
i.e. trial and error learning in decision problems; Sutton and Barto, 1998). Although the original empirical
articles reported activity averaged across many trials, and the mean behavior of computational simulations
was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying
neural activity) changes trial by trial in response to feedback. In fact, the computational theories are framed
in just these terms, and so more recent work on the system (O’Doherty et al., 2003; Bayer and Glimcher,
2005) has focused on comparing their predictions to raw data timeseries, trial by trial: measuring, in effect,
the theories’ goodness of fit to the data, on average, rather than their goodness of fit to the averaged data.
1

This change in approach represents a major advance in the use of computational models for experimen-
tal design and analysis, which is still unfolding. Used this way, computational models represent excep-
tionally detailed, quantitative hypotheses about how the brain approaches a problem, which are amenable
to direct experimental test. As noted, such trial-by-trial analyses are particularly suitable to developing a
more detailed and dynamic picture of learning than was previously available.
In a standard experiential decision experiment, such as a “bandit” task (Sugrue et al., 2004; Lau and
Glimcher, 2005; Daw et al., 2006), a subject is offered repeated opportunities to choose between multiple
options (e.g. slot machines) and receives rewards or punishments according to her choice on each trial.
Data might consist of a series of choices and outcomes (one per trial). In principle, any arbitrary relation-
ship might obtain between the entire list of past choices and outcomes, and the next one. Computational
theories constitute particular claims about some more restricted function by which previous choices and
feedback give rise to subsequent choices. For instance, standard RL models (such as “Q learning”; Watkins,
1989) envision that subjects track the expected reward for each slot machine, via some sort of running av-
erage over the feedback, and it is only through these aggregated “value” predictions that past feedback
determines future choices.
This example points to another important feature of this approach, which is that the theories purport
to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,”
or difference between the received and expected rewards). That is, the theories permit the estimation of
quantities (expectations, expectation violations) that would otherwise be subjective; this, in turn, enables
the search for neural correlates of these estimates (Platt and Glimcher, 1999; Sugrue et al., 2004).
By comparing the model’s predictions to trial-by-trial experimental data, such as choices or BOLD sig-
nals, it is possible using a mixture of Bayesian and classical statistical techniques to answer two sorts of
questions about a model, which are discussed in Sections 3 and 4 below. The art is framing questions of
scientific interest in these terms.
The first question is parameter estimation. RL models typically have a number of free parameters
measuring quantities such as the “learning rate,” or the degree to which subjects update their beliefs in
response to feedback. Often, these parameters characterize (or new parameters can be introduced so as to
characterize) factors that are of experimental interest. For instance, Behrens et al. (2007) tested predictions
about how particular task manipulations would affect the learning rate.
The second type of question that can be addressed is model comparison. Different computational mod-
els, in effect, constitute different hypotheses about the learning process that gave rise to the data. These
hypotheses may be tested against one another on the basis of their fit to the data. For example, Hamp-
ton et al. (2008) use this method to compare which of different approaches subjects use for anticipating an
opponent’s behavior in a multiplayer competitive game.
Learning and observation models: In order to appreciate the extent to which the same methods may
be applied to different sets of data, it is useful to separate a computational theory into two parts. The first,
which we will call the learning model, describes the dynamics of the model’s internal variables such as the
reward expected for each slot machine. The second part, which we will call the observation model, describes
how the model’s internal variables are reflected in observed data: for instance, how expected values drive
choice or how prediction errors produce neural spiking. Essentially, the observation model regresses the
learning model’s internal variables onto the observed data; it plays a similar role as (and is often, in fact,
identical to) the “link function” in generalized linear modeling. In this way, a common learning process
(a single learning model) may be viewed as giving rise to distinct observable data streams in a number of
different modalities (e.g., choices and BOLD, through two separate observation models). Thus, although we
describe the methods in this tutorial primarily in terms of choice data, they are directly applicable to other
modalities simply by substituting a different observation model.
Crucially, whereas the learning model is typically deterministic, the observation models are noisy: that
is, given the internal variables produced by the learning model, an observation model assigns some proba-
bility to any possible observations. Thus the “fit” of different learning models, or their parameters, to any
observed data can be quantified statistically in terms of the probability they assign to the data, a procedure
at the core of the methods that follow.
2

3 Parameter estimation
Model parameters can characterize a variety of scientifically interesting quantities, from how quickly sub-
jects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al.,
2007). Here we consider how to obtain statistical results about parameters’ values from data. We first con-
sider the general statistical rationale underlying the problem; then develop the details for an example RL
model before considering various pragmatic factors of actually performing these analyses on data. Finally,
having discussed these details in terms of choice data, we discuss how the same methods may be applied
to other sorts of data.
Suppose we have some model M, with a vector of free parameters θ
M
. The model (here, the composite
of our learning and observation models) describes a probability distribution, or likelihood function, P(D |
M, θ
M
) over possible data sets D. Then, Bayes’ rule tells us that having observed a data set D,
P(θ
M
| D, M) P(D | M, θ
M
) · P(θ
M
| M) (1)
That is, the posterior probability distribution over the free parameters, given the data, is proportional to the
product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability
of the parameters. This equation famously shows how to start with a theory of how parameters (noisily)
produce data, and invert it into an theory by which data (noisily) reveal the parameters that produced
it. Classically, we seek a point estimate of the parameters θ
M
rather than a posterior distribution over
all possible values; if we neglect (or treat as flat) the prior over the parameters P(θ
M
|M), then the most
probable value for θ
M
is the maximum likelihood estimate: the setting of the parameters that maximizes the
likelihood function, P(D | M, θ
M
). We denote this
ˆ
θ
M
.
3.1 Maximum likelihood estimation for RL
An RL model: We may see how the general ideas play out in a simple reinforcement learning setting.
Consider a simple game in which on each trial t, a subject makes a choice c
t
(= L or R) between a left and
a right slot machine, and receives a reward r
t
(= $1 or $0) stochastically. According to a simple Q-learning
model (Watkins, 1989), on each trial the subject assigns an expected value to each machine: Q
t
(L) and
Q
t
(R). We initialize these values to (say) 0, and then on each trial, the value for the chosen machine is
updated as
Q
t+1
(c
t
) = Q
t
(c
t
) + α · δ
t
(2)
where 0 α 1 is a free learning rate parameter, and δ
t
= r
t
Q
t
(c
t
) is the prediction error. Equation 2 is
our learning model. To explain the choices c
t
in terms of the values Q
t
we assume an observation model. In
RL, it is often assumed that that subjects choose probabilistically according to a softmax distribution:
P(c
t
= L | Q
t
(L), Q
t
(R)) =
exp(β · Q
t
(L))
exp(β · Q
t
(R)) + exp(β · Q
t
(L))
(3)
Here, β is a free parameter known in RL as the inverse temperature parameter. However, note that Equa-
tion 3 is also equivalent to standard logistic regression where the dependent variable is the binary choice
variable c
t
and there is one predictor variable, the difference in values Q
t
(L) Q
t
(R). Therefore, β can also
be as viewed the regression weight connecting the Qs to the choices. More generally, when there are more
than two choice options, the softmax model corresponds to a generalization of logistic regression known as
conditional logit regression (McFadden, 1974).
The model of Equations 2 and 3 is only a representative example of the sorts of algorithms used to
study reinforcement learning. Since our focus here is on the methodology for estimation given a model,
a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and
Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments). That said, most models in the literature
are variants on the example shown here. Another commonly used (Daw et al., 2006; Behrens et al., 2007) and
seemingly rather different family of learning methods is Bayesian models such as the Kalman filter (Kakade
3

β
α
0 0.5 1 1.5 2
0
0.2
0.4
0.6
0.8
1
Figure 1: Likelihood surface for simulated reinforcement learning data, as a function of two free parame-
ters. Lighter colors denote higher data likelihood. The maximum likelihood estimate is shown as an “o”
surrounded by an ellipse of one standard error (a region of about 90% confidence); the true parameters
from which the data were generated are denoted by an “x”.
and Dayan, 2002). In fact, the Q-learning rule of Equation 2 can be seen as a simplified case of the Kalman
filter: the Bayesian model uses the same learning rule but has additional machinery that determines the
learning rate parameter α on a trial-by-trial basis (Kakade and Dayan, 2002; Behrens et al., 2007; Daw et al.,
2008).
Data likelihood: Given the model described above, the probability of a whole dataset D (i.e., a whole
sequence of choices c = c
1...T
given the rewards r = r
1...T
) is just product of their probabilities from Equation
3,
t
P(c
t
= L | Q
t
(L), Q
t
(R)) (4)
Note that the terms Q
t
in the softmax are determined (via equation 2) by the rewards r
1...t1
and choices
c
1...t1
on trials prior to t.
Together, Equations 2 and 3 constitute a full likelihood function P(D | M, θ
M
), and we can estimate the
free parameters (θ
M
=
h
α, β
i
) by maximum likelihood. Figure 1 illustrates the process. 1,000 choice trials
were simulated according to the model (with parameters α = .25 and β = 1, red x). The likelihood of the
observed data was then computed for a range of parameters, and plotted (with brighter colors for higher
likelihood) on a 2-D grid. In this case, the maximum likelihood point (
ˆ
α = .34 and
ˆ
β = .93, blue circle) was
near the true parameters.
Confidence intervals: Of course, in order actually to test a hypothesis about the parameters’ values,
we need to be able to make statistical claims about the quality of the estimate
ˆ
θ
M
. Intuitively, the degree
to which our estimate can be trusted depends on how much better it accounts for the data than other
nearby parameter estimates, that is on how sharply peaked is the “hill” of data likelihoods in the space
of parameters. Such peakiness is characterized by the second derivative (the Hessian) of the likelihood
function with respect to the parameters. The Hessian is a square matrix (here, 2x2) with a row and column
for each parameter. Evaluated at the peak point
ˆ
θ
M
, the elements of the Hessian are larger the more rapidly
the likelihood function is dropping off away from it in different directions, which corresponds to a more
reliable estimate of the parameters. Conversely, the matrix inverse of the Hessian (like the reciprocal of a
4

scalar) is larger for poorer estimates, like error bars. More precisely, if H is the Hessian of the negative log of
the likelihood function at the maximum likelihood point
ˆ
θ
M
, then a standard estimator for the covariance of
the parameter estimates is its matrix inverse H
1
(MacKay, 2003).
The diagonal terms of H
1
correspond to variances for each parameter separately, and their square
roots measure one standard error on the parameter. Thus, for instance, 95% confidence intervals around
the maximum likelihood estimate may be estimated as
ˆ
θ plus or minus 1.96 standard errors.
Covariance between parameters: The off-diagonal terms of of H
1
measure covariance between the
parameters, and are useful for diagnosing model fit. In general, large off-diagonal terms are a symptom of
a poorly specified model or some kinds of bad data. In the worst case, two parameters may be redundant,
so that there is no unique optimum. The Q learning model has a more moderate coupling between the
parameters. As can be seen by the elongated, tilted shape of the “ridge” in Figure 1, estimates of α and
β tend to be inversely coupled in this model. By increasing β while decreasing α (or vice versa: moving
northwest or southeast in the figure), a similar likelihood is obtained. This is because the reward r
t
is
multiplied by both α (in Equation 2 to update Q
t
) and then by β (in Equation 3) before affecting the choice
likelihood on the next trial. As a result of this, either parameter individually cannot be estimated so tightly
by itself (the “ridge” is a bit wide if you cross it horizontally in β or vertically in α), but their product is well
estimated (the hill is most narrow when crossed from northeast to southwest). The blue oval in the figure
traces out a one-standard error ellipse in the two parameters jointly, derived from H
1
; its tilt follows the
contour of the ridge.
Often in applications such as logistic regression, a corrected covariance estimator is used that is thought
to be more robust to problems such as mismatch between the true and assumed models. This “Huber-
White” or “sandwich” estimator (Huber, 1967; Freedman, 2006) is H
1
BH
1
where B =
t
g(c
t
)
T
g(c
t
),
and g(c
t
), in turn, is the gradient (vector of first partial derivatives with respect to the parameters) of the
negative log likelihood of the tth data point c
t
, evaluated at
ˆ
θ
M
. This is harder to compute in practice, since
it involves keeping track of g, which is laborious. However, as discussed below, g can also be useful when
searching for the maximum likelihood point.
3.2 Pragmatics
Above, we developed the general equations for maximum likelihood parameter estimation in an RL model;
how can these be implemented in practice for data analysis?
First, although we have noted an equivalence between Equation 3 and logistic regression, it is not pos-
sible simply to use an off-the-shelf regression package to estimate the parameters. This is because although
the observation stage of the model represents a logistic regression from values Q
t
to choices c
t
, the values
are not fixed but themselves depend on the free parameters (here, α) of the learning process. As these do
not enter the likelihood linearly they cannot be estimated by a generalized linear model. Thus, we must
search for the full set of free parameters that optimize the likelihood function.
Likelihood function: At the heart of optimizing the likelihood function is computing it. It is straight-
forward to write a function that takes in a dataset (a sequence of choices and rewards) and a candidate
setting of the free parameters, loops over the data computing equations 2 and 3, and returns the aggregate
likelihood of the data. Importantly, the product in Equation 4 is often an exceptionally small number; it is
thus numerically more stable to compute its log, i.e. the sum over trials of the log of the choice probability
from equation 3, which is β · Q
t
(c
t
) log(exp(β · Q
t
(L)) + exp(β · Q
t
(R))). Since log is a monotonic func-
tion, this quantity has the same optimum but is less likely to underflow the minimum floating point value
representable by a computer. (Another numerical trick is to note that Equation 3 is invariant to the addition
or subtraction of any constant to all of the Q values. The chance of the exponential under- or overflowing
can thus be reduced by evaluating the log probability for Q values after first subtracting their mean.)
How, then, to find the optimal likelihood? In general, it may be tempting to discretize the space of
free parameters, compute the likelihood everywhere, and simply search for the best, much as is illustrated
in Figure 1. We recommend against this approach. First, most models of interest have more than two
parameters, and exhaustively testing all combinations in a higher dimensional space becomes intractable.
5

Citations
More filters
Journal ArticleDOI
24 Mar 2011-Neuron
TL;DR: A multistep decision task designed to challenge the notion of a separate model-free learner and suggest a more integrated computational architecture for high-level human decision-making.

1,411 citations

Journal ArticleDOI
TL;DR: This work reviews recent advances in data driven and theory driven Computational psychiatry, with an emphasis on clinical applications, and highlights the utility of combining them.
Abstract: Translating advances in neuroscience into benefits for patients with mental illness presents enormous challenges because it involves both the most complex organ, the brain, and its interaction with a similarly complex environment. Dealing with such complexities demands powerful techniques. Computational psychiatry combines multiple levels and types of computation with multiple types of data in an effort to improve understanding, prediction and treatment of mental illness. Computational psychiatry, broadly defined, encompasses two complementary approaches: data driven and theory driven. Data-driven approaches apply machine-learning methods to high-dimensional data to improve classification of disease, predict treatment outcomes or improve treatment selection. These approaches are generally agnostic as to the underlying mechanisms. Theory-driven approaches, in contrast, use models that instantiate prior knowledge of, or explicit hypotheses about, such mechanisms, possibly at multiple levels of analysis and abstraction. We review recent advances in both approaches, with an emphasis on clinical applications, and highlight the utility of combining them.

672 citations

Journal ArticleDOI
01 Jun 2011-Brain
TL;DR: The findings support the suggestion that a disruption in the encoding of prediction error signals contributes to anhedonia symptoms in depression and support proposals that psychiatric syndromes reflect different disorders of neural valuation and incentive salience formation.
Abstract: The dopamine system has been linked to anhedonia in depression and both the positive and negative symptoms of schizophrenia, but it remains unclear how dopamine dysfunction could mechanistically relate to observed symptoms. There is considerable evidence that phasic dopamine signals encode prediction error (differences between expected and actual outcomes), with reinforcement learning theories being based on prediction error-mediated learning of associations. It has been hypothesized that abnormal encoding of neural prediction error signals could underlie anhedonia in depression and negative symptoms in schizophrenia by disrupting learning and blunting the salience of rewarding events, and contribute to psychotic symptoms by promoting aberrant perceptions and the formation of delusions. To test this, we used model based functional magnetic resonance imaging and an instrumental reward-learning task to investigate the neural correlates of prediction errors and expected-reward values in patients with depression (n=15), patients with schizophrenia (n=14) and healthy controls (n=17). Both patient groups exhibited abnormalities in neural prediction errors, but the spatial pattern of abnormality differed, with the degree of abnormality correlating with syndrome severity. Specifically, reduced prediction errors in the striatum and midbrain were found in depression, with the extent of signal reduction in the bilateral caudate, nucleus accumbens and midbrain correlating with increased anhedonia severity. In schizophrenia, reduced prediction error signals were observed in the caudate, thalamus, insula and amygdala-hippocampal complex, with a trend for reduced prediction errors in the midbrain, and the degree of blunting in the encoding of prediction errors in the insula, amygdala-hippocampal complex and midbrain correlating with increased severity of psychotic symptoms. Schizophrenia was also associated with disruption in the encoding of expected-reward values in the bilateral amygdala-hippocampal complex and parahippocampal gyrus, with the degree of disruption correlating with psychotic symptom severity. Neural signal abnormalities did not correlate with negative symptom severity in schizophrenia. These findings support the suggestion that a disruption in the encoding of prediction error signals contributes to anhedonia symptoms in depression. In schizophrenia, the findings support the postulate of an abnormality in error-dependent updating of inferences and beliefs driving psychotic symptoms. Phasic dopamine abnormalities in depression and schizophrenia are suggested by our observation of prediction error abnormalities in dopamine-rich brain areas, given the evidence for dopamine encoding prediction errors. The findings are consistent with proposals that psychiatric syndromes reflect different disorders of neural valuation and incentive salience formation, which helps bridge the gap between biological and phenomenological levels of understanding.

410 citations


Cites background from "Trial-by-trial data analysis using ..."

  • ...…parameters were re-estimated applying prior information about the likely range of parameters (the prior being derived from the previous stage) to regularize estimates and avoid extreme (implausible) - or -values due to the inherent noisiness of the maximum likelihood estimation (Daw, 2009)....

    [...]

  • ...In addition, three patients with schizophrenia received long-term anti-depressant medication because of previous episodes of depressive illness: sertraline 50 mg, paroxetine 50 mg and citalopram 20 mg, per day....

    [...]

  • ...…(Pessiglione et al., 2006; Murray et al., 2007), a single set of parameters was fitted across all groups and subjects since it has been noticed that multi-subject functional MRI results are more robust if a single set of parameters is used to generate regressors for all subjects (Daw, 2009)....

    [...]

Journal ArticleDOI
TL;DR: This article used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk and found that the learned values of cues that predict rewards of equal mean but different variance are indeed modulated by experienced risk.
Abstract: Humans and animals are exquisitely, though idiosyncratically, sensitive to risk or variance in the outcomes of their actions. Economic, psychological, and neural aspects of this are well studied when information about risk is provided explicitly. However, we must normally learn about outcomes from experience, through trial and error. Traditional models of such reinforcement learning focus on learning about the mean reward value of cues and ignore higher order moments such as variance. We used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk. Our analysis focused on anatomically delineated regions of a priori interest in the nucleus accumbens, where blood oxygenation level-dependent (BOLD) signals have been suggested as correlating with quantities derived from reinforcement learning. We first provide unbiased evidence that the raw BOLD signal in these regions corresponds closely to a reward prediction error. We then derive from this signal the learned values of cues that predict rewards of equal mean but different variance and show that these values are indeed modulated by experienced risk. Moreover, a close neurometric–psychometric coupling exists between the fluctuations of the experience-based evaluations of risky options that we measured neurally and the fluctuations in behavioral risk aversion. This suggests that risk sensitivity is integral to human learning, illuminating economic models of choice, neuroscientific models of affective learning, and the workings of the underlying neural mechanisms.

319 citations

Journal ArticleDOI
TL;DR: This study proposes a new computational model that accounts for the dynamic integration of RL and WM processes observed in subjects’ behavior, and specifies distinct influences of the high‐level and low‐level cognitive functions on instrumental learning, beyond the possibilities offered by simple RL models.
Abstract: Instrumental learning involves corticostriatal circuitry and the dopaminergic system. This system is typically modeled in the reinforcement learning (RL) framework by incrementally accumulating reward values of states and actions. However, human learning also implicates prefrontal cortical mechanisms involved in higher level cognitive functions. The interaction of these systems remains poorly understood, and models of human behavior often ignore working memory (WM) and therefore incorrectly assign behavioral variance to the RL system. Here we designed a task that highlights the profound entanglement of these two processes, even in simple learning problems. By systematically varying the size of the learning problem and delay between stimulus repetitions, we separately extracted WM-specific effects of load and delay on learning. We propose a new computational model that accounts for the dynamic integration of RL and WM processes observed in subjects' behavior. Incorporating capacity-limited WM into the model allowed us to capture behavioral variance that could not be captured in a pure RL framework even if we (implausibly) allowed separate RL systems for each set size. The WM component also allowed for a more reasonable estimation of a single RL process. Finally, we report effects of two genetic polymorphisms having relative specificity for prefrontal and basal ganglia functions. Whereas the COMT gene coding for catechol-O-methyl transferase selectively influenced model estimates of WM capacity, the GPR6 gene coding for G-protein-coupled receptor 6 influenced the RL learning rate. Thus, this study allowed us to specify distinct influences of the high-level and low-level cognitive functions on instrumental learning, beyond the possibilities offered by simple RL models.

316 citations


Cites result from "Trial-by-trial data analysis using ..."

  • ...…we also confirmed that the results reported below hold for a simple hierarchical fitting procedure in which summary statistics for each parameter were estimated across the entire group of subjects and which then acted as priors for the estimation of individual subject parameters (Daw, 2011)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.
Abstract: The history of the development of statistical hypothesis testing in time series analysis is reviewed briefly and it is pointed out that the hypothesis testing procedure is not adequately defined as the procedure for statistical model identification. The classical maximum likelihood estimation procedure is reviewed and a new estimate minimum information theoretical criterion (AIC) estimate (MAICE) which is designed for the purpose of statistical identification is introduced. When there are several competing models the MAICE is defined by the model and the maximum likelihood estimates of the parameters which give the minimum of AIC defined by AIC = (-2)log-(maximum likelihood) + 2(number of independently adjusted parameters within the model). MAICE provides a versatile procedure for statistical model identification which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure. The practical utility of MAICE in time series analysis is demonstrated with some numerical examples.

47,133 citations


"Trial-by-trial data analysis using ..." refers background in this paper

  • ...The most common is the Akaike information criterion (AIC; Akaike, 1974), log(P(D | M, θ̂M))− n....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Abstract: The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.

38,681 citations

Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations


"Trial-by-trial data analysis using ..." refers background or methods in this paper

  • ...…the striatum; Delgado et al., 2000; Knutson et al., 2000; McClure et al., 2003; O’Doherty et al., 2003) resembles a “prediction error” signal used in a number of computational algorithms for reinforcement learning (RL, i.e. trial and error learning in decision problems; Sutton and Barto, 1998)....

    [...]

  • ..., 2003) resembles a “prediction error” signal used in a number of computational algorithms for reinforcement learning (RL, i.e. trial and error learning in decision problems; Sutton and Barto, 1998)....

    [...]

  • ...Since our focus here is on the methodology for estimation given a model, a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments)....

    [...]

01 Jan 2005
TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Abstract: The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.

36,760 citations


"Trial-by-trial data analysis using ..." refers methods in this paper

  • ...BIC and cousins: A simpler approximation, which can be obtained from Equation 17 in a limit of large data, is the Bayesian Information Criterion (BIC; Schwarz, 1978)....

    [...]

Book
Christopher M. Bishop1
17 Aug 2006
TL;DR: Probability Distributions, linear models for Regression, Linear Models for Classification, Neural Networks, Graphical Models, Mixture Models and EM, Sampling Methods, Continuous Latent Variables, Sequential Data are studied.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

22,840 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Trial-by-trial data analysis using computational models" ?

As these techniques are spreading rapidly, but have been developed and documented somewhat sporadically alongside the studies themselves, the present review aims to clarify the toolbox ( see also O ’ Doherty et al., 2007 ). In particular, the authors discuss the rationale for these methods and the questions they are suited to address. Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware. The authors focus on cataloging the particular, admittedly somewhat idiosyncratic, combination of techniques frequently used in this literature, but also on exposing these techniques as instances of a general set of tools that can be applied to analyze behavioral and neural data of many sorts. The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.