Using connectome-based predictive modeling to predict individual behavior from brain connectivity.

doi:10.1038/NPROT.2016.178

Using connectome-based predictive modeling to predict

individual behavior from brain connectivity

Xilin Shen

1

, Emily S. Finn

2

, Dustin Scheinost

1

, Monica D. Rosenberg

3

, Marvin M. Chun

2,3,4

,

Xenophon Papademetris

1,5

, and R Todd Constable

1,2,6,*

1

Department of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven CT,

USA

2

Interdepartmental Neuroscience Program, Yale School of Medicine, New Haven CT, USA

3

Department of Psychology, Yale University, New Haven CT, USA

4

Department of Psychology, Yale University, New Haven CT, USA

5

Department of Biomedical Engineering, Yale University, New Haven CT, USA

6

Department of Neurosurgery, Yale School of Medicine, New Haven CT, USA

Abstract

Neuroimaging is a fast developing research area where anatomical and functional images of

human brains are collected using techniques such as functional magnetic resonance imaging

(fMRI), diffusion tensor imaging (DTI), and electroencephalography (EEG). Technical advances

and large-scale datasets have allowed for the development of models capable of predicting

individual differences in traits and behavior using brain connectivity measures derived from

neuroimaging data. Here, we present connectome-based predictive modeling (CPM), a data-driven

protocol for developing predictive models of brain-behavior relationships from connectivity data

using cross-validation. This protocol includes the following steps: 1) feature selection, 2) feature

summarization, 3) model building, and 4) assessment of prediction significance. We also include

suggestions for visualizing the most predictive features (i.e., brain connections). The final result

should be a generalizable model that takes brain connectivity data as input and generates

predictions of behavioral measures in novel subjects, accounting for a significant amount of the

variance in these measures. It has been demonstrated that the CPM protocol performs equivalently

or better than most of the existing approaches in brain-behavior prediction. However, because

CPM focuses on linear modeling and a purely data-driven driven approach, neuroscientists with

limited or no experience in machine learning or optimization would find it easy to implement the

protocols. Depending on the volume of data to be processed, the protocol can take 10–100 minutes

*

Corresponding Author: R. Todd Constable, todd.constable@yale.edu.

Author contributions statements. XS, ESF, DS, XP, and RTC conceptualized the study. XS developed this protocol with help from ESF

and DS. ESF developed the prediction framework with help from XS and MDR. ESF, XP, and XS contributed previously unpublished

tools. XP developed the online visualization tools with help from XS and DS. XP, MMC, and RTC provided support and guidance

with data interpretation. All authors made significant comments on the manuscript.

Supplementary Information

Supplementary Table 1

Supplementary Table 2

HHS Public Access

Author manuscript

Nat Protoc

. Author manuscript; available in PMC 2018 March 01.

Published in final edited form as:

Nat Protoc

. 2017 March ; 12(3): 506–518. doi:10.1038/nprot.2016.178.

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

for model building, 1–48 hours for permutation testing, and 10–20 minutes for visualization of

results.

INTRODUCTION

Establishing the relationship between individual differences in brain structure and function

and individual differences in behavior is a major goal of modern neuroscience. Historically,

many neuroimaging studies of individual differences have focused on establishing

correlational relationships between brain measurements and cognitive traits such as

intelligence, memory, and attention, or disease symptoms.

Note, however, that the term “predicts” is often used loosely as a synonym for “correlates

with”—for example, it is common to say that brain property×“predicts” behavioral variable

y, where×may be an fMRI-derived measure of univariate activity or functional connectivity,

and y may be a measure of task performance, symptom severity or another continuous

variable. Yet, in the strict sense of the word, this is not prediction but rather correlation.

Correlation or similar regression models tend to overfit the data and, as a result, often fail to

generalize to novel data. The vast majority of brain-behavior studies do not preform cross-

validation, which makes it difficult to evaluate the generalizability of the results. In the worst

case, Kriegeskorte et al

1

demonstrated that circularity in selection and selective analyses

leads to completely erroneous results. Proper cross-validation is key to ensure independence

between feature selection and prediction/classification, thus eliminating spurious effects and

incorrect population-level inferences

2

. There are at least two important reasons to test the

predictive power of brain-behavior correlations discovered in the course of basic

neuroimaging research:

1.

From the standpoint of scientific rigor, cross-validation is a more conservative

way to infer the presence of a brain-behavior relationship than correlation. Cross-

validation is designed to protect against overfitting by testing the strength of the

relationship in a novel sample, increasing the likelihood of replication in future

studies.

2.

From a practical standpoint, establishing predictive power is necessary to

translate neuroimaging findings into tools with practical utility

3

. In part, fMRI

has struggled as a diagnostic tool due to low generalizability of results to novel

subjects. Testing and reporting performance in independent samples will

facilitate evaluation of a result’s generalizability and eventual development of

useful neuroimaging-based biomarkers with real-world applicability.

Nevertheless, the design and construction of predictive models remains a challenge.

Recently, we have developed connectome-based predictive modeling (CPM) with built-in

cross validation, a method for extracting and summarizing the most relevant features from

brain connectivity data in order to construct predictive models

4

. Using both resting-state

functional magnetic resonance imaging (fMRI) and task-based fMRI, we have shown that

cognitive traits, such as fluid intelligence and sustained attention, can be successfully

predicted in novel subjects using this method

4,5

. Although CPM was developed with fMRI-

Shen et al. Page 2

Nat Protoc

. Author manuscript; available in PMC 2018 March 01.

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

derived functional connectivity as the input, we believe it could be adapted to work with

structural connectivity data measured with diffusion tensor imaging (DTI) or related

methods, or functional connectivity data derived from other modalities such as

electroencephalography (EEG).

Here, we present a protocol for developing predictive models of brain-behavior relationships

from connectivity data using CPM, which includes the following steps: 1) feature selection,

2) feature summarization, 3) model building and application, and 4) assessment of prediction

significance. We also include suggestions for visualization of results. This protocol is

designed to serve as a framework illustrating how to construct and test predictive models,

and to encourage investigators to perform these types of analyses.

Development of the protocol

In this protocol, we describe an algorithm to build predictive models based on a set of

single-subject connectivity matrices, and test these models using cross-validation on novel

data (shown as a schematic in Figure 1). We also discuss a number of options in model

building, including selecting features from pre-defined networks rather than from the whole

brain. We address the issue of how to assess the significance of the predictive power using

permutation tests. Finally, we provide examples of how to visualize the features—in this

case, brain connections—that contribute the most predictive power. This protocol has been

designed for users familiar with connectivity analysis and neuroimaging data processing.

Data preprocessing and related issues are out of the scope of this protocol as the methods

presented in Finn

et al

4

and Rosenberg

et al

5

generalize to any set of connectivity matrices.

Therefore, we assume individual data has been fully preprocessed and the input to this

protocol is a set of M by M connectivity matrices, where M represents the number of distinct

brain regions, or nodes, under consideration, and each element of the matrix is a continuous

value representing the strength of the connection between two nodes.

Applications of the method

Human neuroimaging studies routinely collect behavioral variables along with structural and

functional imaging. Additionally, open-source datasets including the Human Connectome

Project (HCP)

6

, the NKI-Rockland sample

7

, the ADHD-200

8

, and the Philadelphia

Neurodevelopmental Cohort (PNC)

9

include a large sample of subjects (N>500) with both

imaging data and many behavioral variables. Therefore, vast amounts of data exist to explore

which brain connections predict individual differences in behavior. Further, as demonstrated

in Rosenberg

et al

5

, these open-source datasets can be pooled or combined with local

datasets to test whether a predictive model generalizes across different scanners, different

subject populations, and even different measures of the underlying phenotype of interest.

We have applied the CPM protocol in our research and demonstrated robust relationships

between brain connectivity and fluid intelligence in Finn

et al

4

and between brain

connectivity and sustained attention in Rosenberg

et al

5

. Here, we aim to provide a user-

friendly guide for performing prediction of a behavioral variable in novel subjects using

connectivity data. The models described in this protocol offer a rigorous way to establish a

brain-behavior relationship using cross-validation.

Shen et al. Page 3

Nat Protoc

. Author manuscript; available in PMC 2018 March 01.

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Comparison with other methods

The strengths of CPM include its use of linear operations and its purely data-driven

approach. Linear operations allow for fast computation (for example, roughly 60 seconds to

run leave-one-subject-out cross-validation on 100 subjects), easy software implementation

(<100 lines of Matlab code), and straightforward interpretation of feature weights. Although

state-of-the-art brain parcellation methods typically divide the brain into ~300 regions

resulting in ~45,000 unique connections, or edges

10–13

, many hypothesis-driven approaches

focus on a single edge, region, or network of interest. These approaches ignore a large

number of connections and may limit predictive power. In contrast, CPM searches for most

relevant features (edges) across the whole brain and summarizes these selected features for

prediction.

The simplest and most popular method for establishing brain-behavior relationships using

neuroimaging data is correlation or regression models

14

. As mentioned in the introduction,

these methods often overfit the data and limit generalizability to novel data. Often these

correlational relationships are tested on

a priori

regions of interest, but may also be tested in

a whole-brain, data-driven manner. Importantly, using a cross-validated approach helps

guard against the potential for false positives inherent in a whole-brain, data-driven analysis,

and eschews the need for traditional correction for multiple comparisons.

The most directly comparable method to CPM may be the multivariate-prediction and

univariate-regression method used by the HCP Netmats MegaTrawl release

15

(https://

db.humanconnectome.org/megatrawl/index.html). This set of algorithms uses independent

component analysis and partial correlation to generate connectivity matrices from resting-

state fMRI data. These matrices are then related to behavior using elastic-net feature

selection and prediction and 10-fold cross-validation (inner loop for parameter optimization

and outer loop for prediction evaluation). The main differences between this approach and

the proposed CPM approach are: (1) use of group-wise ICA to derive subject-specific

functional brain subunits (and associated time courses) versus use of an existing functional

brain atlas registered to each subject; (2) use of partial correlation versus Pearson correlation

to measure connectivity; (3) use of elastic net algorithm versus Pearson correlation with the

behavioral measure to select meaningful edges; (4) use of elastic net algorithm for predicting

versus use of a linear model on mean connectivity strength. This approach is

computationally more complex and requires substantial expertise on optimization. We focus

on the use of a purely linear model that can be easily implemented with basic programming

skills. While no direct comparison has been made, both methods perform similarly for

predicting fluid intelligence (see Finn

et al

4

and Smith et al

16

).

Another alternative method for developing predictive models from brain connectivity data is

support vector regression (SVR)

17

, an extension of the support vector machine classification

framework to continuous data. In this approach, rather than performing mass univariate

calculations to select relevant features (edges) and combining these into a single statistic for

each subject, a supervised learning algorithm considers all features simultaneously, and

generates a model that assigns different weights to different features in order to best

approximate each observation (distinct behavioral measurement) in the training set. Features

from the test subject(s) are then combined using the same weights and the trained model

Shen et al. Page 4

Nat Protoc

. Author manuscript; available in PMC 2018 March 01.

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

outputs a predicted behavioral score. See Dosenbach et al

18

for an example of SVR applied

to functional connectivity data to predict subject age. A comparison between CPM and SVR

in terms of performance and running time is provided in the Supplemental Information and

Supplemental Table S1.

Finally, many studies have used similar machine-learning techniques in a classification

framework to distinguish healthy control participants from patients using connectivity data.

Reliable classification of patients has been shown in several disorders including ADHD

19

,

autism

20,21

, schizophrenia

22

, Alzheimer’s

23

, and depression

24

. A fundamental difference

between most classification methods and CPM (or the multivariate-prediction and

univariate-regression method or the SVR based methods described above) is that in

classification the outcome variable is discrete (often binary) instead of continuous.

Prediction of individual differences in a continuous measure across a healthy sample is

considerably more challenging then binary classification of disease state. Variations in

behavior among healthy participants generally have substantially lower effect size than

differences due to pathology. In addition, accurate prediction of continuous variables

requires accurate modeling over the whole range of the variable, whereas accurate binary

classification largely requires accurate grouping of participants near the margin. In the case

where subsets of participants are distributed far from the margin, the correct classification of

these subsets is often guaranteed.

While SVR and related multivariate methods can provide good predictive power, in our

experience, predictions generated using CPM are often as good or better than those

generated using SVR, and CPM has at least two advantages over multivariate methods. First,

from a practical standpoint, CPM is simpler to implement and requires less expertise in

machine learning. This makes it more accessible for the general neuroimaging community. It

is our hope that in providing this protocol, we can encourage researchers to perform cross-

validated analyses of the brain-behavior relationships they discover, which will set more

rigorous statistical standards for the field and improve replicability across studies.

The second major advantage of the CPM approach compared to multivariate methods is that

the predictive networks obtained by CPM can be clearly interpreted. It is a frequently

overlooked problem in the literature that interpreting weights generated by multivariate

regression models—even linear ones—is not straightforward

25

. For example, researchers

often erroneously equate large weights with greater importance, and it is even harder to

interpret nonlinear models. CPM allows researchers to rigorously test the predictive value of

a brain-behavior relationship while still providing a one-to-one mapping back to the original

feature space so that researchers can visualize and investigate the underlying brain

connections contributing to the model. This is critical for comparing results with existing

literature, generating new hypothesis about network structure and function, and advancing

our understanding of functional brain organization in general.

Limitations

CPM is based on linear relationships typically with a slope and an intercept (i.e., y=mx+b).

These models may not be optimal for capturing complex, non-linear relationships between

connectivity and behavior. Higher order polynomial terms could be added to the model (i.e.,

Shen et al. Page 5

Nat Protoc

. Author manuscript; available in PMC 2018 March 01.

Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Using connectome-based predictive modeling to predict individual behavior from brain connectivity.

Citations

Robust prediction of individual creative ability from brain functional connectivity

Task-induced brain state manipulation improves prediction of individual traits.

A decade of test-retest reliability of functional connectivity: A systematic review and meta-analysis

Influences on the Test-Retest Reliability of Functional Connectivity MRI and its Relationship with Behavioral Utility.

Ten simple rules for predictive modeling of individual differences in neuroimaging.

References

A study of cross-validation and bootstrap for accuracy estimation and model selection

The organization of the human cerebral cortex estimated by intrinsic functional connectivity

The WU-Minn Human Connectome Project: An Overview

Support Vector Regression Machines

Functional network organization of the human brain

Related Papers (5)

Functional connectome fingerprinting: identifying individuals using patterns of brain connectivity

A neuromarker of sustained attention from whole-brain functional connectivity.

Groupwise whole-brain parcellation from resting-state fMRI data for network node identification.

The WU-Minn Human Connectome Project: An Overview

The minimal preprocessing pipelines for the Human Connectome Project.