Trial-by-trial data analysis using computational models
Summary (4 min read)
1 Introduction
- The authors then offer a relatively practical tutorial about the basic statistical methods for their answer and how they can be applied to data analysis.
- The techniques are illustrated with fits of simple models to simulated datasets.
- Throughout, the authors flag interpretational and technical pitfalls of which they believe authors, reviewers, and readers should be aware.
2 Background
- And the mean behavior of computational simulations was compared to these reports, in fact, a more central issue in learning is how behavior (or the underlying neural activity) changes trial by trial in response to feedback.
- This example points to another important feature of this approach, which is that the theories purport to quantify, trial-by-trial, variables such as the reward expected for a choice (and the “prediction error,” or difference between the received and expected rewards).
- These hypotheses may be tested against one another on the basis of their fit to the data.
- The second part, which the authors will call the observation model, describes how the model’s internal variables are reflected in observed data: for instance, how expected values drive choice or how prediction errors produce neural spiking.
3 Parameter estimation
- Model parameters can characterize a variety of scientifically interesting quantities, from how quickly subjects learn (Behrens et al., 2007) to how sensitive they are to different rewards and punishments (Tom et al., 2007).
- Here the authors consider how to obtain statistical results about parameters’ values from data.
- The authors first consider the general statistical rationale underlying the problem; then develop the details for an example RL model before considering various pragmatic factors of actually performing these analyses on data.
- That is, the posterior probability distribution over the free parameters, given the data, is proportional to the product of two factors, (1) the likelihood of the data, given the free parameters, and (2) the prior probability of the parameters.
- This equation famously shows how to start with a theory of how parameters produce data, and invert it into an theory by which data reveal the parameters that produced it.
3.1 Maximum likelihood estimation for RL
- The authors may see how the general ideas play out in a simple reinforcement learning setting.
- Since their focus here is on the methodology for estimation given a model, a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments).
- Evaluated at the peak point θ̂M, the elements of the Hessian are larger the more rapidly the likelihood function is dropping off away from it in different directions, which corresponds to a more reliable estimate of the parameters.
- The diagonal terms of H−1 correspond to variances for each parameter separately, and their square roots measure one standard error on the parameter.
- The Q learning model has a more moderate coupling between the parameters.
3.2 Pragmatics
- At the heart of optimizing the likelihood function is computing it.
- Second, and not unrelated, discretizing the parameters too coarsely, or searching within an inappropriate range, can lead to poor results; worse yet, since the parameters are typically coupled , a poor search on one will also corrupt the estimates for other parameters.
- Thus, it makes some sense to constrain the parameter within this range.
- There may, for instance, be features of the data that can only be captured within the model in question by adopting seemingly nonsensical parameters.
3.3 Intersubject variability and random effects
- For population-level questions, treating parameters as fixed effects and thereby conflating within- and between-subject variability can lead to serious problems such as overstating the true significance of results.
- CN | µα, µβ, σα, σβ)P(µα, µβ, σα, σβ) (7) Estimating population parameters in a hierarchical model: Equation 7 puts us in in a position, in principle, to estimate the population parameters from the set of all subjects’ choices, using maximum likelihood or maximum a posteriori methods exactly as discussed for individual subjects in the previous section.
- Moreover, assuming the distributions P(αi | µα, σα) and P(βi | µβ, σβ) are Gaussian, then finding the population parameters for these expressions is just the familiar problem of estimating a Gaussian distribution from samples.
3.5 Extensions
- The basic procedures outlined above admit of many extensions.
- In fact, this observation model (augmented with a hierarchical random effects model over the regression weights, such as β1, across the population) is identical to the general linear model used in standard fMRI analysis packages such as SPM.
- Now the authors may also compute a second timeseries ∂δt/∂α: the partial derivative of the prediction error timeseries with respect to α.
- Parameters µα1 and so on can be estimated to determine what the modes are that best fit the data; π1 controls the predominance of subject type 1; and the question how many types of subjects do the data support is a model selection question, answerable by the methods discussed in Section 4.
- In all, fast acquisition followed by stable choice of the better option might be modeled with a decrease over trials in the learning rate, perhaps combined with an increase in the softmax temperature.
4 Model comparison
- So far, the authors have assumed a fixed model of the data and sought to estimate its free parameters.
- In particular, models like that of Equation 2 and its alternatives constitute different hypotheses about the mechanisms or algorithms that the brain uses to solve RL problems.
- The methods discussed below address this problem.
- In some cases, questions of interest might be framed either in terms of parameter estimation or model selection, and thus addressed using either the methods of the previous or the current situation.
4.1 Examples from RL
- The authors illustrate the issues of model selection using some simple alternatives to the model of choice behavior discussed thus far.
- In fact, the practical ingredients for model evaluation are basically the same as those for parameter estimation; as before, what is needed is simply to compute data likelihoods under a model, optimize parameters, and estimate Hessians.
- Depending on β (assumed to be fixed), these asymptotic learned values may imply less-than-complete preference for the better option over the worse.
- As already noted, this introduces some difficulty in comparing them.
4.2 Classical model comparison
- Conversely, a model fits well exactly to the extent that it captures the repeatable aspects of the data, allowing good predictions of additional datasets.
- In contrast, it has rarely been used in studies of reinforcement learning (though see Camerer and Ho, 1999).
- Let us consider again the likelihood of a single dataset, using best-fitting parameters.
4.3 Bayesian model comparison in theory
- The key quantity here is P(D | M), known as the model evidence: the probability of the data under the model.
- Importantly, this expression does not make reference to any particular parameter settings such as θ̂M: since in asking how well a model predicts data the authors are not given any particular parameters.
- This means that a more flexible model (one with more parameters that is able to achieve good fit to many data sets with different particular parameter settings) must correspondingly assign lower P(D | M) to all of them since a fixed probability of 1 is divided among them all.
- The result of a Bayesian model comparison is a statistical claim about the relative fit of one model over another.
- Kass and Raftery (1995) present a table of conventions for interpreting Bayes factors; note that their logs are taken in base-10 rather than base-e.
4.4 Bayesian model comparison in practice
- The theory of Bayesian model selection is very useful conceptual framework; for instance, it clarifies why the maximum likelihood score is an inappropriate metric for model comparison.
- The authors have mostly ignored priors thus far, because their subjective nature arguably makes them problematic in the context of objective scientific communication.
- Thus, most of the methods discussed below do require assuming a prior over parameters.
- See MacKay (2003) and Bishop (2006) for discussion of more elaborate sampling techniques that attempt to cope with this situation.
- Similarly, H is the Hessian of the function being optimized (minus the sum of the first two terms of Equation 17), evaluated at the MAP point, not the Hessian of just the log likelihood.
4.5 Summary and recommendations
- Models may be compared to one another on the basis of the likelihood they assign to data; however, if this likelihood is computed at parameters chosen to optimize it, the measure must be corrected for overfitting to allow a fair comparison between models with different numbers of parameters.
- In practice, when models are nested, the authors suggest using a likelihood ratio test, since this permits reporting a classical p-value and is well accepted.
- If one is willing to define a prior, and defend it, the authors suggest exploring the Laplace approximation, which is almost as simple but far better founded.
- Even if two candidate models have the same number of parameters — and thus scores like BIC are equivalent to just comparing raw likelihoods — the complexity penalty implied by Equation 15 may not actually be the same between them if the two sets of parameters are differently constrained, either a priori or by the data.
- As in Figure 5, this more accurate assessment can have salutary effects.
4.6 Model comparison and populations
- So far the authors have described model comparison mostly in the abstract, with applications to choice data at the single subject level.
- There are a number of possibilities, of which the simplest will often suffice.
- These aggregates can then be compared between two models to compute a Bayes factor over the population.
- Finally, one could take the identity of the model as varying over subjects, i.e. as a random effect (Stephan et al., 2009).
- This involves adding another level to the hierarchy of Figure 2, according to which, for each subject, one of a set of models is drawn with some according to a multinomial distribution (given by new free parameters) and then the model’s parameters and the data are drawn as before.
5 Pitfalls and alternatives
- The authors close this tutorial by identifying some pitfalls, caveats, and concerns with these methods that they think it is important for readers to appreciate.
- As mentioned, even if learning parameters such as temperature and learning rate are actually independent from one another (i.e., in terms of their distribution across a population), the estimates of those parameters from data may be correlated due to their having similar expected effects on observable data .
- For choice data, exponentiating this average log likelihood, exp(L/T) produces a probability that is easily interpreted relative to the chance level.
- This points again to the fact that these methods are suited to drawing relative conclusions comparing multiple hypotheses (the data support model A over model B).
Did you find this useful? Give us your feedback
Citations
1,411 citations
672 citations
410 citations
Cites background from "Trial-by-trial data analysis using ..."
...…parameters were re-estimated applying prior information about the likely range of parameters (the prior being derived from the previous stage) to regularize estimates and avoid extreme (implausible) - or -values due to the inherent noisiness of the maximum likelihood estimation (Daw, 2009)....
[...]
...In addition, three patients with schizophrenia received long-term anti-depressant medication because of previous episodes of depressive illness: sertraline 50 mg, paroxetine 50 mg and citalopram 20 mg, per day....
[...]
...…(Pessiglione et al., 2006; Murray et al., 2007), a single set of parameters was fitted across all groups and subjects since it has been noticed that multi-subject functional MRI results are more robust if a single set of parameters is used to generate regressors for all subjects (Daw, 2009)....
[...]
319 citations
316 citations
Cites result from "Trial-by-trial data analysis using ..."
...…we also confirmed that the results reported below hold for a simple hierarchical fitting procedure in which summary statistics for each parameter were estimated across the entire group of subjects and which then acted as priors for the estimation of individual subject parameters (Daw, 2011)....
[...]
References
47,133 citations
"Trial-by-trial data analysis using ..." refers background in this paper
...The most common is the Akaike information criterion (AIC; Akaike, 1974), log(P(D | M, θ̂M))− n....
[...]
38,681 citations
37,989 citations
"Trial-by-trial data analysis using ..." refers background or methods in this paper
...…the striatum; Delgado et al., 2000; Knutson et al., 2000; McClure et al., 2003; O’Doherty et al., 2003) resembles a “prediction error” signal used in a number of computational algorithms for reinforcement learning (RL, i.e. trial and error learning in decision problems; Sutton and Barto, 1998)....
[...]
..., 2003) resembles a “prediction error” signal used in a number of computational algorithms for reinforcement learning (RL, i.e. trial and error learning in decision problems; Sutton and Barto, 1998)....
[...]
...Since our focus here is on the methodology for estimation given a model, a full review of the many candidate models is beyond the scope of the present article (see Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998 for exhaustive treatments)....
[...]
36,760 citations
"Trial-by-trial data analysis using ..." refers methods in this paper
...BIC and cousins: A simpler approximation, which can be obtained from Equation 17 in a limit of large data, is the Bayesian Information Criterion (BIC; Schwarz, 1978)....
[...]
22,840 citations