Using connectome-based predictive modeling to predict
individual behavior from brain connectivity
Xilin Shen
1
, Emily S. Finn
2
, Dustin Scheinost
1
, Monica D. Rosenberg
3
, Marvin M. Chun
2,3,4
,
Xenophon Papademetris
1,5
, and R Todd Constable
1,2,6,*
1
Department of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven CT,
USA
2
Interdepartmental Neuroscience Program, Yale School of Medicine, New Haven CT, USA
3
Department of Psychology, Yale University, New Haven CT, USA
4
Department of Psychology, Yale University, New Haven CT, USA
5
Department of Biomedical Engineering, Yale University, New Haven CT, USA
6
Department of Neurosurgery, Yale School of Medicine, New Haven CT, USA
Abstract
Neuroimaging is a fast developing research area where anatomical and functional images of
human brains are collected using techniques such as functional magnetic resonance imaging
(fMRI), diffusion tensor imaging (DTI), and electroencephalography (EEG). Technical advances
and large-scale datasets have allowed for the development of models capable of predicting
individual differences in traits and behavior using brain connectivity measures derived from
neuroimaging data. Here, we present connectome-based predictive modeling (CPM), a data-driven
protocol for developing predictive models of brain-behavior relationships from connectivity data
using cross-validation. This protocol includes the following steps: 1) feature selection, 2) feature
summarization, 3) model building, and 4) assessment of prediction significance. We also include
suggestions for visualizing the most predictive features (i.e., brain connections). The final result
should be a generalizable model that takes brain connectivity data as input and generates
predictions of behavioral measures in novel subjects, accounting for a significant amount of the
variance in these measures. It has been demonstrated that the CPM protocol performs equivalently
or better than most of the existing approaches in brain-behavior prediction. However, because
CPM focuses on linear modeling and a purely data-driven driven approach, neuroscientists with
limited or no experience in machine learning or optimization would find it easy to implement the
protocols. Depending on the volume of data to be processed, the protocol can take 10–100 minutes
*
Corresponding Author: R. Todd Constable, todd.constable@yale.edu.
Author contributions statements. XS, ESF, DS, XP, and RTC conceptualized the study. XS developed this protocol with help from ESF
and DS. ESF developed the prediction framework with help from XS and MDR. ESF, XP, and XS contributed previously unpublished
tools. XP developed the online visualization tools with help from XS and DS. XP, MMC, and RTC provided support and guidance
with data interpretation. All authors made significant comments on the manuscript.
Supplementary Information
Supplementary Table 1
Supplementary Table 2
HHS Public Access
Author manuscript
Nat Protoc
. Author manuscript; available in PMC 2018 March 01.
Published in final edited form as:
Nat Protoc
. 2017 March ; 12(3): 506–518. doi:10.1038/nprot.2016.178.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
for model building, 1–48 hours for permutation testing, and 10–20 minutes for visualization of
results.
INTRODUCTION
Establishing the relationship between individual differences in brain structure and function
and individual differences in behavior is a major goal of modern neuroscience. Historically,
many neuroimaging studies of individual differences have focused on establishing
correlational relationships between brain measurements and cognitive traits such as
intelligence, memory, and attention, or disease symptoms.
Note, however, that the term “predicts” is often used loosely as a synonym for “correlates
with”—for example, it is common to say that brain propertyדpredicts” behavioral variable
y, where×may be an fMRI-derived measure of univariate activity or functional connectivity,
and y may be a measure of task performance, symptom severity or another continuous
variable. Yet, in the strict sense of the word, this is not prediction but rather correlation.
Correlation or similar regression models tend to overfit the data and, as a result, often fail to
generalize to novel data. The vast majority of brain-behavior studies do not preform cross-
validation, which makes it difficult to evaluate the generalizability of the results. In the worst
case, Kriegeskorte et al
1
demonstrated that circularity in selection and selective analyses
leads to completely erroneous results. Proper cross-validation is key to ensure independence
between feature selection and prediction/classification, thus eliminating spurious effects and
incorrect population-level inferences
2
. There are at least two important reasons to test the
predictive power of brain-behavior correlations discovered in the course of basic
neuroimaging research:
1.
From the standpoint of scientific rigor, cross-validation is a more conservative
way to infer the presence of a brain-behavior relationship than correlation. Cross-
validation is designed to protect against overfitting by testing the strength of the
relationship in a novel sample, increasing the likelihood of replication in future
studies.
2.
From a practical standpoint, establishing predictive power is necessary to
translate neuroimaging findings into tools with practical utility
3
. In part, fMRI
has struggled as a diagnostic tool due to low generalizability of results to novel
subjects. Testing and reporting performance in independent samples will
facilitate evaluation of a result’s generalizability and eventual development of
useful neuroimaging-based biomarkers with real-world applicability.
Nevertheless, the design and construction of predictive models remains a challenge.
Recently, we have developed connectome-based predictive modeling (CPM) with built-in
cross validation, a method for extracting and summarizing the most relevant features from
brain connectivity data in order to construct predictive models
4
. Using both resting-state
functional magnetic resonance imaging (fMRI) and task-based fMRI, we have shown that
cognitive traits, such as fluid intelligence and sustained attention, can be successfully
predicted in novel subjects using this method
4,5
. Although CPM was developed with fMRI-
Shen et al. Page 2
Nat Protoc
. Author manuscript; available in PMC 2018 March 01.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
derived functional connectivity as the input, we believe it could be adapted to work with
structural connectivity data measured with diffusion tensor imaging (DTI) or related
methods, or functional connectivity data derived from other modalities such as
electroencephalography (EEG).
Here, we present a protocol for developing predictive models of brain-behavior relationships
from connectivity data using CPM, which includes the following steps: 1) feature selection,
2) feature summarization, 3) model building and application, and 4) assessment of prediction
significance. We also include suggestions for visualization of results. This protocol is
designed to serve as a framework illustrating how to construct and test predictive models,
and to encourage investigators to perform these types of analyses.
Development of the protocol
In this protocol, we describe an algorithm to build predictive models based on a set of
single-subject connectivity matrices, and test these models using cross-validation on novel
data (shown as a schematic in Figure 1). We also discuss a number of options in model
building, including selecting features from pre-defined networks rather than from the whole
brain. We address the issue of how to assess the significance of the predictive power using
permutation tests. Finally, we provide examples of how to visualize the features—in this
case, brain connections—that contribute the most predictive power. This protocol has been
designed for users familiar with connectivity analysis and neuroimaging data processing.
Data preprocessing and related issues are out of the scope of this protocol as the methods
presented in Finn
et al
4
and Rosenberg
et al
5
generalize to any set of connectivity matrices.
Therefore, we assume individual data has been fully preprocessed and the input to this
protocol is a set of M by M connectivity matrices, where M represents the number of distinct
brain regions, or nodes, under consideration, and each element of the matrix is a continuous
value representing the strength of the connection between two nodes.
Applications of the method
Human neuroimaging studies routinely collect behavioral variables along with structural and
functional imaging. Additionally, open-source datasets including the Human Connectome
Project (HCP)
6
, the NKI-Rockland sample
7
, the ADHD-200
8
, and the Philadelphia
Neurodevelopmental Cohort (PNC)
9
include a large sample of subjects (N>500) with both
imaging data and many behavioral variables. Therefore, vast amounts of data exist to explore
which brain connections predict individual differences in behavior. Further, as demonstrated
in Rosenberg
et al
5
, these open-source datasets can be pooled or combined with local
datasets to test whether a predictive model generalizes across different scanners, different
subject populations, and even different measures of the underlying phenotype of interest.
We have applied the CPM protocol in our research and demonstrated robust relationships
between brain connectivity and fluid intelligence in Finn
et al
4
and between brain
connectivity and sustained attention in Rosenberg
et al
5
. Here, we aim to provide a user-
friendly guide for performing prediction of a behavioral variable in novel subjects using
connectivity data. The models described in this protocol offer a rigorous way to establish a
brain-behavior relationship using cross-validation.
Shen et al. Page 3
Nat Protoc
. Author manuscript; available in PMC 2018 March 01.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
Comparison with other methods
The strengths of CPM include its use of linear operations and its purely data-driven
approach. Linear operations allow for fast computation (for example, roughly 60 seconds to
run leave-one-subject-out cross-validation on 100 subjects), easy software implementation
(<100 lines of Matlab code), and straightforward interpretation of feature weights. Although
state-of-the-art brain parcellation methods typically divide the brain into ~300 regions
resulting in ~45,000 unique connections, or edges
10–13
, many hypothesis-driven approaches
focus on a single edge, region, or network of interest. These approaches ignore a large
number of connections and may limit predictive power. In contrast, CPM searches for most
relevant features (edges) across the whole brain and summarizes these selected features for
prediction.
The simplest and most popular method for establishing brain-behavior relationships using
neuroimaging data is correlation or regression models
14
. As mentioned in the introduction,
these methods often overfit the data and limit generalizability to novel data. Often these
correlational relationships are tested on
a priori
regions of interest, but may also be tested in
a whole-brain, data-driven manner. Importantly, using a cross-validated approach helps
guard against the potential for false positives inherent in a whole-brain, data-driven analysis,
and eschews the need for traditional correction for multiple comparisons.
The most directly comparable method to CPM may be the multivariate-prediction and
univariate-regression method used by the HCP Netmats MegaTrawl release
15
(https://
db.humanconnectome.org/megatrawl/index.html). This set of algorithms uses independent
component analysis and partial correlation to generate connectivity matrices from resting-
state fMRI data. These matrices are then related to behavior using elastic-net feature
selection and prediction and 10-fold cross-validation (inner loop for parameter optimization
and outer loop for prediction evaluation). The main differences between this approach and
the proposed CPM approach are: (1) use of group-wise ICA to derive subject-specific
functional brain subunits (and associated time courses) versus use of an existing functional
brain atlas registered to each subject; (2) use of partial correlation versus Pearson correlation
to measure connectivity; (3) use of elastic net algorithm versus Pearson correlation with the
behavioral measure to select meaningful edges; (4) use of elastic net algorithm for predicting
versus use of a linear model on mean connectivity strength. This approach is
computationally more complex and requires substantial expertise on optimization. We focus
on the use of a purely linear model that can be easily implemented with basic programming
skills. While no direct comparison has been made, both methods perform similarly for
predicting fluid intelligence (see Finn
et al
4
and Smith et al
16
).
Another alternative method for developing predictive models from brain connectivity data is
support vector regression (SVR)
17
, an extension of the support vector machine classification
framework to continuous data. In this approach, rather than performing mass univariate
calculations to select relevant features (edges) and combining these into a single statistic for
each subject, a supervised learning algorithm considers all features simultaneously, and
generates a model that assigns different weights to different features in order to best
approximate each observation (distinct behavioral measurement) in the training set. Features
from the test subject(s) are then combined using the same weights and the trained model
Shen et al. Page 4
Nat Protoc
. Author manuscript; available in PMC 2018 March 01.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript
outputs a predicted behavioral score. See Dosenbach et al
18
for an example of SVR applied
to functional connectivity data to predict subject age. A comparison between CPM and SVR
in terms of performance and running time is provided in the Supplemental Information and
Supplemental Table S1.
Finally, many studies have used similar machine-learning techniques in a classification
framework to distinguish healthy control participants from patients using connectivity data.
Reliable classification of patients has been shown in several disorders including ADHD
19
,
autism
20,21
, schizophrenia
22
, Alzheimer’s
23
, and depression
24
. A fundamental difference
between most classification methods and CPM (or the multivariate-prediction and
univariate-regression method or the SVR based methods described above) is that in
classification the outcome variable is discrete (often binary) instead of continuous.
Prediction of individual differences in a continuous measure across a healthy sample is
considerably more challenging then binary classification of disease state. Variations in
behavior among healthy participants generally have substantially lower effect size than
differences due to pathology. In addition, accurate prediction of continuous variables
requires accurate modeling over the whole range of the variable, whereas accurate binary
classification largely requires accurate grouping of participants near the margin. In the case
where subsets of participants are distributed far from the margin, the correct classification of
these subsets is often guaranteed.
While SVR and related multivariate methods can provide good predictive power, in our
experience, predictions generated using CPM are often as good or better than those
generated using SVR, and CPM has at least two advantages over multivariate methods. First,
from a practical standpoint, CPM is simpler to implement and requires less expertise in
machine learning. This makes it more accessible for the general neuroimaging community. It
is our hope that in providing this protocol, we can encourage researchers to perform cross-
validated analyses of the brain-behavior relationships they discover, which will set more
rigorous statistical standards for the field and improve replicability across studies.
The second major advantage of the CPM approach compared to multivariate methods is that
the predictive networks obtained by CPM can be clearly interpreted. It is a frequently
overlooked problem in the literature that interpreting weights generated by multivariate
regression models—even linear ones—is not straightforward
25
. For example, researchers
often erroneously equate large weights with greater importance, and it is even harder to
interpret nonlinear models. CPM allows researchers to rigorously test the predictive value of
a brain-behavior relationship while still providing a one-to-one mapping back to the original
feature space so that researchers can visualize and investigate the underlying brain
connections contributing to the model. This is critical for comparing results with existing
literature, generating new hypothesis about network structure and function, and advancing
our understanding of functional brain organization in general.
Limitations
CPM is based on linear relationships typically with a slope and an intercept (i.e., y=mx+b).
These models may not be optimal for capturing complex, non-linear relationships between
connectivity and behavior. Higher order polynomial terms could be added to the model (i.e.,
Shen et al. Page 5
Nat Protoc
. Author manuscript; available in PMC 2018 March 01.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript