A Guide for Population-based Analysis of the Adolescent Brain Cognitive Development (ABCD) Study Baseline Data

doi:10.1101/2020.02.10.942011

Posted Content•DOI•

A Guide for Population-based Analysis of the Adolescent Brain Cognitive Development (ABCD) Study Baseline Data

Steven G. Heeringa¹, Patricia A. Berglund¹•Institutions (1)

10 Feb 2020-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: This guide will present results of an empirical investigation of the ABCD baseline data that compares the statistical efficiency of multi-level modeling and distribution-free design-based approaches—both weighted and unweighted--to analyses of theABCD baselineData.

read less

Abstract: ABCD is a longitudinal, observational study of U.S. children, ages 9-10 at baseline, recruited at random from the household populations in defined catchment areas for each of 21 study sites. The 21 geographic locations that comprise the ABCD research sites are nationally distributed and generally represent the range of demographic and socio-economic diversity of the U.S. birth cohorts that comprise the ABCD study population. The clustering of participants and the potential for selection bias in study site selection and enrollment are features of the ABCD observational study design that are informative for statistical estimation and inference. Both multi-level modeling and robust survey design-based methods can be used to account for clustering of sampled ABCD children in the 21 study sites. Covariate controls in analytical models and propensity weighting methods that calibrate ABCD weighted distributions to nationally-representative controls from the American Community Survey (ACS) can be employed in analysis to account for known informative sample design features or to attenuate potential demographic and socio-economic selection bias in the national sampling and recruitment of eligible children. This guide will present results of an empirical investigation of the ABCD baseline data that compares the statistical efficiency of multi-level modeling and distribution-free design-based approaches—both weighted and unweighted--to analyses of the ABCD baseline data. Specific recommendations will be provided for researchers on robust, efficient approaches to both descriptive and multivariate analyses of the ABCD baseline data.

...read moreread less

Summary (3 min read)

Jump to: [I. Introduction] – [2. Population orientation to ABCD analysis] – [5. Properties of the ABCD Baseline Sample Cohort in Comparison to ACS] – [6. Weighting the ABCD Sample to ACS Population Controls] – [7. Comparison of Analysis Methods] – [7.B.3 Three-level LMM vs. Design-based Population weighted LS and Robust SEs] and [8. Summary: Recommendations for research analysts]

I. Introduction

The Adolescent Brain Cognitive Development Study (ABCD) is a prospective cohort study of a baseline sample of U.S. children born during the period 2006-2008.
Eligible children, ages 9-10, were recruited from the household populations in defined catchment areas for each of 21 study sites during the roughly two year period beginning September 2016 and ending in October of 2018.
This methodological paper describes alternative approaches to analysis of the rich array of social, behavioral, environmental, genetic and summary-level neuroimaging data that is collected in the ABCD study.
Features of the ABCD design and data that are statistically "informative" and complicate population estimation and inference are the subject of Section 3.

2. Population orientation to ABCD analysis

As described in Garavan et al. (2018) , within each of the 21 ABCD study sites, a probability sample of the public and private schools was selected as the basis for the recruitment of the majority of eligible children to the ABCD baseline cohort.
The process of obtaining school cooperation and then parental consent could selectively impact the final characteristics of the sample that was actually observed.
The following sections will describe two approaches, propensity-based weighting and use of appropriate covariate controls in modeling, that aim to address potential selectivity that may have entered the ABCD cohort through the site election or school/parental consent gateways to actual study participation.

5. Properties of the ABCD Baseline Sample Cohort in Comparison to ACS

The unweighted distribution of reported annual family incomes for the ABCD baseline cohort differs from the nationally representative ACS estimates for the U.S. population of 9, 10 year olds.
In nominal dollars, the family incomes of the ABCD children are higher on average than ACS estimates for the comparable U.S.
The ABCD Passive Data Work Group is currently in the process of acquiring external data on school, community and environmental characteristics that can be linked to individual child data and used to analyze the role that these contextual effects may have on the current status and development trajectories of the children in the ABCD baseline cohort.
At this stage, the propensity-based population weighting methodology described in the next section does not incorporate calibration based on detailed characteristics of children's residences, schools or communities.

6. Weighting the ABCD Sample to ACS Population Controls

Following the step of trimming the extremes of the weight distribution, the R Rake iterative proportional fitting algorithm was used to "rake" the trimmed initial weights to exact ACS population counts for the marginal categories of: age (9,10), sex(female, male), and race/ethnicity (Hispanic, Black, White, Asian and all Other persons)-see Table 3 .
Figure 1 is a histogram display of the frequency distribution of the final population weights for the ABCD baseline children.
Figure 2 provides a boxplot comparison of the distribution of weights separately for boys and girls.
The Figure 3 boxplots of weights by family income category show a very different pattern.
Compared to the national population, children from families with lower incomes are underrepresented and the population weights for children in these lower income categories have higher average values and a greater variance than the weights for the children from higher income families.

7. Comparison of Analysis Methods

The comparative results for these two regression models suggest that when the special ABCD twin sample data are pooled with the general population sample and a LMM approach is used it is important to apply the three level DEAP model that includes a level two contribution for clustering within family unit.
When a two level model is applied to these pooled data and family level clustering is ignored, the results of these example analyses suggest the parameter estimates will be attenuated and estimated standard errors will be seriously overestimated.
If the two level LMM is fitted using only data for the general population sample (excluding the special twin sample cases), the resulting parameters estimates and standard errors are more consistent with those for the three level model.

7.B.3 Three-level LMM vs. Design-based Population weighted LS and Robust SEs

As noted above, the unweighted LMM and weighted design-based approaches compared in Table 7 aim to capture/model the complex variance structure of clustering and non-independence of the baseline observations for the ABCD child cohort.
The design-based estimation approaches employ the population weights described in Section 6 above and use a weighted least squares (WLS) methodology to estimate the population regression parameters.
Unlike the LMM approach, the components of variance associated with each level of clustering are estimated as a single weighted aggregate for the residual variance and not as individual components of variance attributable to each level of the clustering.
Poisson Regression-Comparison of model fitting methods, also known as 7.C Generalized Linear Model.
Here again, as in the previous comparisons based on the linear regression model, the three-level DEAP LMM and the design-based estimation for the pooled data show minor differences in the estimated relative risks and confidence intervals but the magnitude of these differences would not be judged to be substantively important.

8. Summary: Recommendations for research analysts

Researchers are encouraged to consider each of the informative features of the ABCD (clustering, sample selectivity, twin sample pooling) as they may apply to their analytic aims.
Sensitivity analyses such as those underlying the comparisons in Section 7 should provide good insight into the degree to which results for descriptive estimates or fitted models are influenced by clustering, weighting and twin sample pooling.

Did you find this useful? Give us your feedback

Figures (12)

Table 2: Logistic Regression Model for ABCD Sample Propensity Scores.

Table 3: U.S. Population Totals for Final Raking Step in ABCD Population Weight Calculation.

Figure 1: Distribution of ABCD Baseline Population Weights.

Table 7 aim to capture/model the complex variance structure of clustering and non-independence of the baseline observations for the ABCD child cohort. The design-based estimation approaches employ the population weights described in Section 6 above and use a weighted least squares (WLS) methodology to estimate the population regression parameters. Under the design-based approaches, a Taylor Series Approximation (or sandwich estimator) is used to compute robust estimates of standard errors. However, unlike the LMM approach, the components of variance associated with each level of clustering are estimated as a single weighted aggregate for the residual variance and not as individual components of variance attributable to each level of the clustering. The three-level LMM used here does not include population weighting in estimating the regression parameters. The three-level LMM does produce estimates of the variance

Figure 3: Distributions of ABCD Analysis Weights by Family Income Category

Figure 2: Distribution of ABCD Baseline Population Weights by Sex of Child

Table 9: Poisson Regression for Count of Lifetime ER Visits. Source: ABCD Baseline.

Table 4. ABCD Baseline Weighted and Propensity Weighted Estimates of Population Demographics

Table 1: ABCD Baseline Cohort Demographic and Socio-Economic Characteristics (Unweighted).

Table 8: Poisson Regression of Lifetime ER Visit Counts. Source: ABCD Baseline.

Content maybe subject to copyright Report



1



A Guide for Population-based Analysis of the Adolescent Brain Cognitive

Development (ABCD) Study Baseline Data

Steven G. Heeringa and Patricia A. Berglund.

Institute for Social Research, University of Michigan

June, 2019

Abstract: ABCD is a longitudinal, observational study of U.S. children, ages 9-10 at baseline,

recruited at random from the household populations in defined catchment areas for each of 21

study sites. The 21 geographic locations that comprise the ABCD research sites are nationally

distributed and generally represent the range of demographic and socio-economic diversity of the

U.S. birth cohorts that comprise the ABCD study population. The clustering of participants and

the potential for selection bias in study site selection and enrollment are features of the ABCD

observational study design that are informative for statistical estimation and inference. Both

multi-level modeling and robust survey design-based methods can be used to account for

clustering of sampled ABCD children in the 21 study sites. Covariate controls in analytical

models and propensity weighting methods that calibrate ABCD weighted distributions to

nationally-representative controls from the American Community Survey (ACS) can be

employed in analysis to account for known informative sample design features or to attenuate

potential demographic and socio-economic selection bias in the national sampling and

recruitment of eligible children. This guide will present results of an empirical investigation of

the ABCD baseline data that compares the statistical efficiency of multi-level modeling and

distribution-free design-based approaches—both weighted and unweighted--to analyses of the

ABCD baseline data. Specific recommendations will be provided for researchers on robust,

efficient approaches to both descriptive and multivariate analyses of the ABCD baseline data.

I. Introduction

The Adolescent Brain Cognitive Development Study (ABCD) is a prospective cohort study of a

baseline sample of U.S. children born during the period 2006-2008. Eligible children, ages 9-

10, were recruited from the household populations in defined catchment areas for each of 21

study sites during the roughly two year period beginning September 2016 and ending in October

of 2018. Within study sites, consenting parents and assenting children were primarily recruited

through a probability sample of public and private schools augmented to a small extent by

special recruitment through summer camp programs and community volunteers. Approximately

9500 eligible, single-born children and 1600 eligible twins completed the ABCD baseline

imaging studies and assessments. The sample design and procedures employed in the

recruitment of the baseline sample are described in detail in Garavan, et al. (2018).

This methodological paper describes alternative approaches to analysis of the rich array of social,

behavioral, environmental, genetic and summary-level neuroimaging data that is collected in the

.CC-BY-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 10, 2020. ; https://doi.org/10.1101/2020.02.10.942011doi: bioRxiv preprint



2



ABCD study. Section 2 will attempt to frame a response to the broad question, why should

ABCD analysts be concerned about estimation and inference for the population of U.S.

children—does external validity matter? Features of the ABCD design and data that are

statistically “informative” and complicate population estimation and inference are the subject of

Section 3. Section 4 will attempt to address the specific methodological question, “If inference

to the U.S. population is important, what are the appropriate choices of methods for estimating

population characteristics and relationships based on the ABCD data?”, describing both model-

based and design-based approaches to ABCD estimation and inference. A summary of the

general demographic and socio-economic characteristics for the ABCD baseline cohort before

any weighting adjustments are applied is presented in Section 5. Section 6 describes the

propensity-based weighting adjustment methodology that is used to calibrate the baseline sample

cohort to key demographic and socio-economic distributions for U.S. children ages 9 and 10

estimated from the American Community Survey (ACS). Section 7 presents results of an

empirical investigation of the ABCD baseline data that compares the statistical efficiency of

multi-level modeling and distribution-free design-based approaches—both weighted and

unweighted--to analyses of the ABCD baseline data. The paper concludes with specific

recommendations for researchers on approaches to both descriptive and multivariate analyses of

the ABCD baseline data. Appendices to this paper will contain illustrations of recommended

command syntax for analysis of the ABCD data using the major software packages.

2. Population orientation to ABCD analysis

As defined in Garavan et al.(2018), the label “population neuroscience “ when applied to

observational studies such as ABCD refers to the application of epidemiological research

practices including large-scale representative samples to assessments of target populations. It is

a study in neuroscience in that it focuses on brain and neurological system development,

morphology and function. It is a population study in that observational data are gathered in

such a way that they can be used to understand real population distributions and the biological,

familial, social and environmental factors that can govern how individuals actually live and

grow in today’s society.

From the outset, ABCD’s primary sponsor, the National Institute of Drug Abuse (NIDA) and

the ABCD scientific investigators were motivated to develop a baseline sample that reflected the

sociodemographic variation present in the U.S. population of 9 and 10 year-old children. ABCD

is an observational study sharing many aspects of its longitudinal design with existing

population-based survey programs such as the National Longitudinal Study of Adolescent to

Adult Health (Add Health,https://www.cpc.unc.edu/projects/addhealth), the Early Childhood

Longitudinal Surveys (ECLS, https://nces.ed.gov/ecls/) or the Child Development Supplement

(CDS,http://src.isr.umich.edu/src/child‐development/home.html) to the Panel Study of Income

Dynamics (PSID).

.CC-BY-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 10, 2020. ; https://doi.org/10.1101/2020.02.10.942011doi: bioRxiv preprint



3



Population representativeness or more correctly, absence of uncorrected selective or informative

bias in the subject pool, is important in achieving external validity—the ability to generalize

specific results of the study to the world at large. However, even with good, representative

samples of populations, failure to measure or control key factors or to recognize important

moderating and or mediating relationships can impact external validity of study findings. The

ABCD data are observational and although propensity-based methods may be used to control for

characteristics of “treated” and “control” participants, in the strictest sense insights gained from

the data—even in longitudinal studies such as ABCD—will be associative.

The ABCD baseline recruitment effort worked very hard to maintain a nationally distributed set

of controls on the age, sex and race/ethnicity of the children in the study. In year 2, additional

monitoring and targeted recruitment were put in place to raise the proportion of children from

lower income families. The predominantly probability sampling methodology for recruiting

children within each study site was intended to randomize over confounding factors that were not

explicitly controlled (or subsequently reflected in the propensity weighting). Nevertheless,

school consent and parental consent were strong forces that certainly may have altered the

effectiveness of the randomization over these uncontrolled confounders.

The purpose of covariate adjustments in models or the propensity weighting described below in

Section 6 is in fact to control specific sources of selection bias and restore unbiasedness to

descriptive and analytical estimates of the population characteristics and relationships. For many

measures of substantive interest, the success of this effort will never be fully known except in

rare cases where comparative national benchmarks exist (e.g. children's height) from

administrative records or very large surveys or population censuses. The effectiveness of

weighting adjustments to eliminate bias in population estimates depends of course on the

relationship of the substantive variable of interest (e.g. amygdala volume) to the variables that

were explicitly used to derive the propensity weights, namely age, sex, race/ethnicity, family

type, parental employment status, family size and Census region. These are the types of

variables that are available and are identically measured in a national source (American

Community Survey) and ABCD. It would have been ideal to have detailed population level data

on many other characteristics that may be highly correlated with the ABCD variable of interest

(e.g. the child's parents' amygdala volume when mom and dad were age 9,10). Only rarely and

in large two-phase studies will we ever have population level statistical controls of this nature for

a small group such as 9,10 year olds.

"Representative" is a strong adjective to apply to any data set. The accuracy of the descriptor

will vary by variable, by subpopulation and by the extent to which the weighting methodology or

model covariates capture factors that truly affect the outcome of interest (both in terms of the

variables and their functional relationship to the outcome). All forms of statistical estimation

and inference make assumptions. No study gets an uncontestable stamp of approval on the

unbiasedness of their survey estimates. In both approaches—propensity weighting or covariate

.CC-BY-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 10, 2020. ; https://doi.org/10.1101/2020.02.10.942011doi: bioRxiv preprint



4



adjustment in modeling—it is easy to overlook a selective factor that influences the outcome or

modifies the effect of other variables. That is an inherent challenge in population inference from

a national study such as ABCD. The position that we take here is that multilevel models that

include appropriate statistical controls for demographic and socio-economic factors or propensity

weighted estimates of descriptive statistics from the ABCD baseline are in fact publishable

estimates for the population of U.S. children so long as authors acknowledge the design and

accurately describe the underlying methodology and its assumptions.

3. Properties of the ABCD design and data to consider in analysis.

This section describes three features of the ABCD design that must be considered in any analysis

of the baseline data.

Clustering and non-independence of observations: Cohort recruitment for the ABCD study

design was distinguished by the constraint that eligible children must live within reasonable

travel distance (e.g. 50 miles) of a major medical center or research facility where MRI and

fMRI imaging could be performed. The geographically-clustered observations on individual

children are not independent and the intraclass (“intra-site”) correlations for the many variables

must be accounted for to correctly estimate variances of descriptive estimates and analytical

model parameters. Correlations among the ABCD observations for individual children are also

introduced by other sources of clustering in the ABCD recruitment and measurement protocols:

selection of multiple students from schools, multiple children (including twins) recruited from

the same family, multiple children imaged on the same MRI scanner.

Selection bias in site choice and within-site subject enrollment: While the 21 geographic

locations that comprise the ABCD research sites are nationally distributed and generally

represent the range of demographic and socio-economic diversity of the U.S. birth cohorts that

comprise the ABCD study population, in the restricted sense they do not constitute the primary

stage of a multi-stage probability sample such as those employed in major population-based

epidemiological surveys. To achieve population representativeness for statistical analyses, a

mechanism (e.g. modeling site characteristics, assuming pseudo-randomization) is needed to

calibrate the broader geographic, demographic and socio-economic characteristics of the set of

21 sites to the larger U.S. population framework (Olsen et al., 2013).

As described in Garavan et al. (2018), within each of the 21 ABCD study sites, a probability

sample of the public and private schools was selected as the basis for the recruitment of the

majority of eligible children to the ABCD baseline cohort. Although this school-based

recruitment approach within each site introduced randomization to the sample of students who

could be recruited to ABCD, the process of obtaining school cooperation and then parental

.CC-BY-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 10, 2020. ; https://doi.org/10.1101/2020.02.10.942011doi: bioRxiv preprint



5



consent could selectively impact the final characteristics of the sample that was actually

observed. The following sections will describe two approaches, propensity-based weighting and

use of appropriate covariate controls in modeling, that aim to address potential selectivity that

may have entered the ABCD cohort through the site election or school/parental consent gateways

to actual study participation.

Special twin supplement: A final feature of the ABCD design that deserves attention in the

analysis of the baseline cohort data is the special oversample of twin pairs in four of the 21

ABCD sites. Although twins were eligible to be recruited in all sites that used the school-based

recruitment sampling methodology, in the four special twin sites supplemental samples of 150-

250 twin pairs per site were enrolled in ABCD using samples selected from state registries

(Garavan et al., 2018). These special samples of twin pairs can be distinguished in the final

baseline cohort of n=11,874 children; however, the study has chosen not to explicitly segregate

these twin data from the general population sample of single births and incidental twins recruited

through the school-based sampling protocol.

By a default decision of the study team, the propensity-based population weighting methodology

described in Section 6 and incorporated in the ABCD Data Exploration and Analysis Portal

(DEAP) descriptive estimation does assume a pooled analysis of the general and special twin

samples. Section 7 will apply multiple analytic approaches to investigate this assumption that

the special twin samples are in fact “exchangeable” with the ABCD general population sample.

4. Design-based and model-based approaches to ABCD analysis

Analysts may choose several approaches to estimation and inference that address the challenges

posed by the clustering, selection bias and special twin sample properties of the ABCD data.

The first approach is to assume that the multi-stage sample selection for ABCD follows a quasi-

probability design and employ design-based methodology similar to that typically used to

analyze large probability sample epidemiological surveys such as the U.S. National Health and

Nutrition Examination Survey (NHANES). Designed-based analysis will employ population

weighting to estimate population statistics and model parameters and non-parametric methods

(Taylor Series Linearization, Jackknife, and Bootstrap) to compute robust estimates of standard

errors. Any quasi-probability approach to analysis the ABCD data requires a minimum of two

things: 1) assignment of cases to ultimate cluster (UC) groupings to account for non-

independence of observations; and 2) modeling to derive case-specific analysis weights that

account for differential selection factors and permits the observed sample to be “mapped” to the

U.S. population of interest (Heeringa, et al., 2017).

.CC-BY-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted February 10, 2020. ; https://doi.org/10.1101/2020.02.10.942011doi: bioRxiv preprint

HTML Viewer

A Guide for Population-based Analysis of the Adolescent Brain Cognitive Development (ABCD) Study Baseline Data

Summary (3 min read)

I. Introduction

2. Population orientation to ABCD analysis

5. Properties of the ABCD Baseline Sample Cohort in Comparison to ACS

6. Weighting the ABCD Sample to ACS Population Controls

7. Comparison of Analysis Methods

7.B.3 Three-level LMM vs. Design-based Population weighted LS and Robust SEs

8. Summary: Recommendations for research analysts

Figures (12)

Citations

References

"A Guide for Population-based Analys..." refers background in this paper

"A Guide for Population-based Analys..." refers methods in this paper

"A Guide for Population-based Analys..." refers methods in this paper

Related Papers (5)

Trending Questions (2)