scispace - formally typeset
Search or ask a question

Showing papers in "Genetic Epidemiology in 2017"


Journal ArticleDOI
TL;DR: It is shown that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value, and it was also substantially faster and more accurate than the recently proposed LDpred.
Abstract: Polygenic scores (PGS) summarize the genetic contribution of a person's genotype to a disease or phenotype. They can be used to group participants into different risk categories for diseases, and are also used as covariates in epidemiological analyses. A number of possible ways of calculating PGS have been proposed, and recently there is much interest in methods that incorporate information available in published summary statistics. As there is no inherent information on linkage disequilibrium (LD) in summary statistics, a pertinent question is how we can use LD information available elsewhere to supplement such analyses. To answer this question, we propose a method for constructing PGS using summary statistics and a reference panel in a penalized regression framework, which we call lassosum. We also propose a general method for choosing the value of the tuning parameter in the absence of validation data. In our simulations, we showed that pseudovalidation often resulted in prediction accuracy that is comparable to using a dataset with validation phenotype and was clearly superior to the conservative option of setting the tuning parameter of lassosum to its lowest value. We also showed that lassosum achieved better prediction accuracy than simple clumping and P-value thresholding in almost all scenarios. It was also substantially faster and more accurate than the recently proposed LDpred.

251 citations


Journal ArticleDOI
TL;DR: A multiethnic polygenic risk score that combines training data from European samples andTraining data from the target population is introduced to reduce the gap in polygenicrisk prediction accuracy between European and non‐European target populations.
Abstract: Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multiethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (Neff= 40k) and Latino training data in small sample size (Neff= 8k). Here, we attained a >70% relative improvement in prediction accuracy (from R2= 0.027 to 0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (Neff= 40k) and South Asian (Neff= 16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N = 113k) and African (N = 2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.

234 citations


Journal ArticleDOI
TL;DR: Two novel IV methods, a fractional polynomial method and a piecewise linear method, were used to investigate the shape of relationship of body mass index with systolic blood pressure and diastolicBlood pressure.
Abstract: Mendelian randomization, the use of genetic variants as instrumental variables (IV), can test for and estimate the causal effect of an exposure on an outcome. Most IV methods assume that the function relating the exposure to the expected value of the outcome (the exposure-outcome relationship) is linear. However, in practice, this assumption may not hold. Indeed, often the primary question of interest is to assess the shape of this relationship. We present two novel IV methods for investigating the shape of the exposure-outcome relationship: a fractional polynomial method and a piecewise linear method. We divide the population into strata using the exposure distribution, and estimate a causal effect, referred to as a localized average causal effect (LACE), in each stratum of population. The fractional polynomial method performs metaregression on these LACE estimates. The piecewise linear method estimates a continuous piecewise linear function, the gradient of which is the LACE estimate in each stratum. Both methods were demonstrated in a simulation study to estimate the true exposure-outcome relationship well, particularly when the relationship was a fractional polynomial (for the fractional polynomial method) or was piecewise linear (for the piecewise linear method). The methods were used to investigate the shape of relationship of body mass index with systolic blood pressure and diastolic blood pressure.

146 citations


Journal ArticleDOI
TL;DR: An approach based on summarized data only (genetic association and correlation estimates) that uses principal components analysis to form instruments that gives estimates that are less precise than those from variable selection approaches, but are more robust to seemingly arbitrary choices in the variable selection step.
Abstract: Mendelian randomization uses genetic variants to make causal inferences about the effect of a risk factor on an outcome. With fine-mapped genetic data, there may be hundreds of genetic variants in a single gene region any of which could be used to assess this causal relationship. However, using too many genetic variants in the analysis can lead to spurious estimates and inflated Type 1 error rates. But if only a few genetic variants are used, then the majority of the data is ignored and estimates are highly sensitive to the particular choice of variants. We propose an approach based on summarized data only (genetic association and correlation estimates) that uses principal components analysis to form instruments. This approach has desirable theoretical properties: it takes the totality of data into account and does not suffer from numerical instabilities. It also has good properties in simulation studies: it is not particularly sensitive to varying the genetic variants included in the analysis or the genetic correlation matrix, and it does not have greatly inflated Type 1 error rates. Overall, the method gives estimates that are less precise than those from variable selection approaches (such as using a conditional analysis or pruning approach to select variants), but are more robust to seemingly arbitrary choices in the variable selection step. Methods are illustrated by an example using genetic associations with testosterone for 320 genetic variants to assess the effect of sex hormone related pathways on coronary artery disease risk, in which variable selection approaches give inconsistent inferences.

84 citations


Journal ArticleDOI
TL;DR: It is determined that future sequencing efforts in >2,000 samples of European, Asian, or admixed ancestry should set genome‐wide significance at approximately P = 5 × 10−9, and studies of African samples should apply a more stringent genome‐ wide significance threshold of P = 1 × 1−9.
Abstract: Genome-wide association studies (GWAS) of common disease have been hugely successful in implicating loci that modify disease risk. The bulk of these associations have proven robust and reproducible, in part due to community adoption of statistical criteria for claiming significant genotype-phenotype associations. As the cost of sequencing continues to drop, assembling large samples in global populations is becoming increasingly feasible. Sequencing studies interrogate not only common variants, as was true for genotyping-based GWAS, but variation across the full allele frequency spectrum, yielding many more (independent) statistical tests. We sought to empirically determine genome-wide significance thresholds for various analysis scenarios. Using whole-genome sequence data, we simulated sequencing-based disease studies of varying sample size and ancestry. We determined that future sequencing efforts in >2,000 samples of European, Asian, or admixed ancestry should set genome-wide significance at approximately P = 5 × 10-9 , and studies of African samples should apply a more stringent genome-wide significance threshold of P = 1 × 10-9 . Adoption of a revised multiple test correction will be crucial in avoiding irreproducible claims of association.

61 citations


Journal ArticleDOI
TL;DR: This work evaluated the performance of commonly used mediation testing methods for the indirect effect in genome‐wide mediation studies using simulation studies, and performed an epigenome-wide mediation association study in the Normative Aging Study, analyzing DNAm as a mediator of the effect of pack‐years on FEV1.
Abstract: Mediation analysis helps researchers assess whether part or all of an exposure's effect on an outcome is due to an intermediate variable. The indirect effect can help in designing interventions on the mediator as opposed to the exposure and better understanding the outcome's mechanisms. Mediation analysis has seen increased use in genome-wide epidemiological studies to test for an exposure of interest being mediated through a genomic measure such as gene expression or DNA methylation (DNAm). Testing for the indirect effect is challenged by the fact that the null hypothesis is composite. We examined the performance of commonly used mediation testing methods for the indirect effect in genome-wide mediation studies. When there is no association between the exposure and the mediator and no association between the mediator and the outcome, we show that these common tests are overly conservative. This is a case that will arise frequently in genome-wide mediation studies. Caution is hence needed when applying the commonly used mediation tests in genome-wide mediation studies. We evaluated the performance of these methods using simulation studies, and performed an epigenome-wide mediation association study in the Normative Aging Study, analyzing DNAm as a mediator of the effect of pack-years on FEV1 .

61 citations


Journal ArticleDOI
TL;DR: A formal statistical framework for quantifying the evidence of generalization is provided that accounts for the (in)consistency between the directions of associations in the discovery and follow‐up studies and finds that it is often beneficial to use a more lenient P‐value threshold than the genome‐wide significance threshold.
Abstract: In genome-wide association studies (GWAS), "generalization" is the replication of genotype-phenotype association in a population with different ancestry than the population in which it was first identified. Current practices for declaring generalizations rely on testing associations while controlling the family-wise error rate (FWER) in the discovery study, then separately controlling error measures in the follow-up study. This approach does not guarantee control over the FWER or false discovery rate (FDR) of the generalization null hypotheses. It also fails to leverage the two-stage design to increase power for detecting generalized associations. We provide a formal statistical framework for quantifying the evidence of generalization that accounts for the (in)consistency between the directions of associations in the discovery and follow-up studies. We develop the directional generalization FWER (FWERg ) and FDR (FDRg ) controlling r-values, which are used to declare associations as generalized. This framework extends to generalization testing when applied to a published list of Single Nucleotide Polymorphism-(SNP)-trait associations. Our methods control FWERg or FDRg under various SNP selection rules based on P-values in the discovery study. We find that it is often beneficial to use a more lenient P-value threshold than the genome-wide significance threshold. In a GWAS of total cholesterol in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), when testing all SNPs with P-values <5×10-8 (15 genomic regions) for generalization in a large GWAS of whites, we generalized SNPs from 15 regions. But when testing all SNPs with P-values <6.6×10-5 (89 regions), we generalized SNPs from 27 regions.

41 citations


Journal ArticleDOI
TL;DR: The multivariate microbiome regression‐based kernel association test (MMiRKAT) is proposed for testing association between multiple continuous outcomes and overall microbiome composition, where the kernel used in MMiKAT is based on Bray‐Curtis or UniFrac distance.
Abstract: High-throughput sequencing technologies have enabled large-scale studies of the role of the human microbiome in health conditions and diseases. Microbial community level association test, as a critical step to establish the connection between overall microbiome composition and an outcome of interest, has now been routinely performed in many studies. However, current microbiome association tests all focus on a single outcome. It has become increasingly common for a microbiome study to collect multiple, possibly related, outcomes to maximize the power of discovery. As these outcomes may share common mechanisms, jointly analyzing these outcomes can amplify the association signal and improve statistical power to detect potential associations. We propose the multivariate microbiome regression-based kernel association test (MMiRKAT) for testing association between multiple continuous outcomes and overall microbiome composition, where the kernel used in MMiRKAT is based on Bray-Curtis or UniFrac distance. MMiRKAT directly regresses all outcomes on the microbiome profiles via a semiparametric kernel machine regression framework, which allows for covariate adjustment and evaluates the association via a variance-component score test. Because most of the current microbiome studies have small sample sizes, a novel small-sample correction procedure is implemented in MMiRKAT to correct for the conservativeness of the association test when the sample size is small or moderate. The proposed method is assessed via simulation studies and an application to a real data set examining the association between host gene expression and mucosal microbiome composition. We demonstrate that MMiRKAT is more powerful than large sample based multivariate kernel association test, while controlling the type I error. A free implementation of MMiRKAT in R language is available at http://research.fhcrc.org/wu/en.html.

38 citations


Journal ArticleDOI
TL;DR: This article looks at the different sources of data and the importance of unstructured data, current and potential future uses in drug discovery, development, and monitoring as well as in public and personal healthcare; including examples of good practice and recent developments.
Abstract: The use of data analytics across the entire healthcare value chain, from drug discovery and development through epidemiology to informed clinical decision for patients or policy making for public health, has seen an explosion in the recent years. The increase in quantity and variety of data available together with the improvement of storing capabilities and analytical tools offer numerous possibilities to all stakeholders (manufacturers, regulators, payers, healthcare providers, decision makers, researchers) but most importantly, it has the potential to improve general health outcomes if we learn how to exploit it in the right way. This article looks at the different sources of data and the importance of unstructured data. It goes on to summarize current and potential future uses in drug discovery, development, and monitoring as well as in public and personal healthcare; including examples of good practice and recent developments. Finally, we discuss the main practical and ethical challenges to unravel the full potential of big data in healthcare and conclude that all stakeholders need to work together towards the common goal of making sense of the available data for the common good.

38 citations


Journal ArticleDOI
TL;DR: A multivariate distance‐based test is proposed to evaluate the association between key phenotypic variables and microbial interdependence utilizing the repeatedly measured microbiome data to probe the interdependent relationship among microbial species through longitudinal study.
Abstract: Human microbiome is the collection of microbes living in and on the various parts of our body. The microbes living on our body in nature do not live alone. They act as integrated microbial community with massive competing and cooperating and contribute to our human health in a very important way. Most current analyses focus on examining microbial differences at a single time point, which do not adequately capture the dynamic nature of the microbiome data. With the advent of high-throughput sequencing and analytical tools, we are able to probe the interdependent relationship among microbial species through longitudinal study. Here, we propose a multivariate distance-based test to evaluate the association between key phenotypic variables and microbial interdependence utilizing the repeatedly measured microbiome data. Extensive simulations were performed to evaluate the validity and efficiency of the proposed method. We also demonstrate the utility of the proposed test using a well-designed longitudinal murine experiment and a longitudinal human study. The proposed methodology has been implemented in the freely distributed open-source R package and Python code.

37 citations


Journal ArticleDOI
TL;DR: This work proposes a conditional analysis for genome‐wide association study (GWAS) consortium studies, offering formulas of necessary calculations to fit a joint linear regression model for multiple quantitative traits and illustrating possible usefulness of conditional analysis by contrasting its result differences from those of standard marginal analyses.
Abstract: There has been an increasing interest in joint association testing of multiple traits for possible pleiotropic effects. However, even in the presence of pleiotropy, most of the existing methods cannot distinguish direct and indirect effects of a genetic variant, say single-nucleotide polymorphism (SNP), on multiple traits, and a conditional analysis of a trait adjusting for other traits is perhaps the simplest and most common approach to addressing this question. However, without individual-level genotypic and phenotypic data but with only genome-wide association study (GWAS) summary statistics, as typical with most large-scale GWAS consortium studies, we are not aware of any existing method for such a conditional analysis. We propose such a conditional analysis, offering formulas of necessary calculations to fit a joint linear regression model for multiple quantitative traits. Furthermore, our method can also accommodate conditional analysis on multiple SNPs in addition to on multiple quantitative traits, which is expected to be useful for fine mapping. We provide numerical examples based on both simulated and real GWAS data to demonstrate the effectiveness of our proposed approach, and illustrate possible usefulness of conditional analysis by contrasting its result differences from those of standard marginal analyses.

Journal ArticleDOI
TL;DR: A new likelihood‐based genotype‐calling approach that exploits all reads and estimates the per‐base error rates by incorporating phred scores through a logistic regression model, which demonstrates that PhredEM performs better than either GATK or SeqEM, and is an improved, robust, and widely applicable genotype calling approach for NGS studies.
Abstract: A fundamental challenge in analyzing next-generation sequencing (NGS) data is to determine an individual's genotype accurately, as the accuracy of the inferred genotype is essential to downstream analyses. Correctly estimating the base-calling error rate is critical to accurate genotype calls. Phred scores that accompany each call can be used to decide which calls are reliable. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too high threshold may lose data, while a too low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The approach, which we call PhredEM, uses the expectation-maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. It also includes a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be nonmonomorphic require application of the EM algorithm. Like GATK, PhredEM can be used together with a linkage-disequilibrium-based method such as Beagle, which can further improve genotype calling as a refinement step. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project and the 1000 Genomes project. The results demonstrate that PhredEM performs better than either GATK or SeqEM, and that PhredEM is an improved, robust, and widely applicable genotype-calling approach for NGS studies. The relevant software is freely available.

Journal ArticleDOI
TL;DR: It is observed that testing association for all variants imputed from any panel results in higher power to detect association than the alternative strategy of including only one version of each genetic variant, selected for having the highest imputation quality metric.
Abstract: The accuracy of genotype imputation depends upon two factors: the sample size of the reference panel and the genetic similarity between the reference panel and the target samples. When multiple reference panels are not consented to combine together, it is unclear how to combine the imputation results to optimize the power of genetic association studies. We compared the accuracy of 9,265 Norwegian genomes imputed from three reference panels-1000 Genomes phase 3 (1000G), Haplotype Reference Consortium (HRC), and a reference panel containing 2,201 Norwegian participants from the population-based Nord Trondelag Health Study (HUNT) from low-pass genome sequencing. We observed that the population-matched reference panel allowed for imputation of more population-specific variants with lower frequency (minor allele frequency (MAF) between 0.05% and 0.5%). The overall imputation accuracy from the population-specific panel was substantially higher than 1000G and was comparable with HRC, despite HRC being 15-fold larger. These results recapitulate the value of population-specific reference panels for genotype imputation. We also evaluated different strategies to utilize multiple sets of imputed genotypes to increase the power of association studies. We observed that testing association for all variants imputed from any panel results in higher power to detect association than the alternative strategy of including only one version of each genetic variant, selected for having the highest imputation quality metric. This was particularly true for lower frequency variants (MAF < 1%), even after adjusting for the additional multiple testing burden.

Journal ArticleDOI
TL;DR: The most significant interaction was observed between rs6029315 in MAFB and rs6681355 in IRF6 in case‐parent trios of European ancestry, which remained significant after correcting for multiple comparisons, however, no significant interaction had been detected in trio of Asian ancestry.
Abstract: Nonsyndromic cleft lip with or without cleft palate (NSCL/P) is the most common craniofacial birth defect in humans, affecting 1 in 700 live births. This malformation has a complex etiology where multiple genes and several environmental factors influence risk. At least a dozen different genes have been confirmed to be associated with risk of NSCL/P in previous studies. However, all the known genetic risk factors cannot fully explain the observed heritability of NSCL/P, and several authors have suggested gene-gene (G × G) interaction may be important in the etiology of this complex and heterogeneous malformation. We tested for G × G interactions using common single nucleotide polymorphic (SNP) markers from targeted sequencing in 13 regions identified by previous studies spanning 6.3 Mb of the genome in a study of 1,498 NSCL/P case-parent trios. We used the R-package trio to assess interactions between polymorphic markers in different genes, using a 1 degree of freedom (1df) test for screening, and a 4 degree of freedom (4df) test to assess statistical significance of epistatic interactions. To adjust for multiple comparisons, we performed permutation tests. The most significant interaction was observed between rs6029315 in MAFB and rs6681355 in IRF6 (4df P = 3.8 × 10-8 ) in case-parent trios of European ancestry, which remained significant after correcting for multiple comparisons. However, no significant interaction was detected in trios of Asian ancestry.

Journal ArticleDOI
TL;DR: This study proposes Multivariate Association Analysis using Score Statistics (MAAUSS), to identify rare variants associated with multiple phenotypes, based on the widely used sequence kernel association test (SKAT) for a single phenotype, which successfully conserved type 1 error rates and had a higher power than the existing methods.
Abstract: Although genome-wide association studies (GWAS) have now discovered thousands of genetic variants associated with common traits, such variants cannot explain the large degree of "missing heritability," likely due to rare variants. The advent of next generation sequencing technology has allowed rare variant detection and association with common traits, often by investigating specific genomic regions for rare variant effects on a trait. Although multiple correlated phenotypes are often concurrently observed in GWAS, most studies analyze only single phenotypes, which may lessen statistical power. To increase power, multivariate analyses, which consider correlations between multiple phenotypes, can be used. However, few existing multivariant analyses can identify rare variants for assessing multiple phenotypes. Here, we propose Multivariate Association Analysis using Score Statistics (MAAUSS), to identify rare variants associated with multiple phenotypes, based on the widely used sequence kernel association test (SKAT) for a single phenotype. We applied MAAUSS to whole exome sequencing (WES) data from a Korean population of 1,058 subjects to discover genes associated with multiple traits of liver function. We then assessed validation of those genes by a replication study, using an independent dataset of 3,445 individuals. Notably, we detected the gene ZNF620 among five significant genes. We then performed a simulation study to compare MAAUSS's performance with existing methods. Overall, MAAUSS successfully conserved type 1 error rates and in many cases had a higher power than the existing methods. This study illustrates a feasible and straightforward approach for identifying rare variants correlated with multiple phenotypes, with likely relevance to missing heritability.

Journal ArticleDOI
TL;DR: The results demonstrate the existence of modifiers for which type of OFC develops and suggest plausible elements responsible for phenotypic heterogeneity, further elucidating the complex genetic architecture of OFCs.
Abstract: Orofacial clefts (OFCs) are common, complex birth defects with extremely heterogeneous phenotypic presentations. Two common subtypes-cleft lip alone (CL) and CL plus cleft palate (CLP)-are typically grouped into a single phenotype for genetic analysis (i.e., CL with or without cleft palate, CL/P). However, mounting evidence suggests there may be unique underlying pathophysiology and/or genetic modifiers influencing expression of these two phenotypes. To this end, we performed a genome-wide scan for genetic modifiers by directly comparing 450 CL cases with 1,692 CLP cases from 18 recruitment sites across 13 countries from North America, Central or South America, Asia, Europe, and Africa. We identified a region on 16q21 that is strongly associated with different cleft type (P = 5.611 × 10-8 ). We also identified significant evidence of gene-gene interactions between this modifier locus and two recognized CL/P risk loci: 8q21 and 9q22 (FOXE1) (P = 0.012 and 0.023, respectively). Single nucleotide polymorphism (SNPs) in the 16q21 modifier locus demonstrated significant association with CL over CLP. The marker alleles on 16q21 that increased risk for CL were found at highest frequencies among individuals with a family history of CL (P = 0.003). Our results demonstrate the existence of modifiers for which type of OFC develops and suggest plausible elements responsible for phenotypic heterogeneity, further elucidating the complex genetic architecture of OFCs.

Journal ArticleDOI
TL;DR: This study demonstrates the suitability of various methods for performing causal inference under several biologically plausible scenarios and presents a simulation study assessing the performance of the methods under different conditions.
Abstract: Genome wide association studies (GWAS) have been very successful over the last decade at identifying genetic variants associated with disease phenotypes. However, interpretation of the results obtained can be challenging. Incorporation of further relevant biological measurements (e.g. ‘omics’ data) measured in the same individuals for whom we have genotype and phenotype data may help us to learn more about the mechanism and pathways through which causal genetic variants affect disease. We review various methods for causal inference that can be used for assessing the relationships between genetic variables, other biological measures, and phenotypic outcome, and present a simulation study assessing the performance of the methods under different conditions. In general, the methods we considered did well at inferring the causal structure for data simulated under simple scenarios. However, the presence of an unknown and unmeasured common environmental effect could lead to spurious inferences, with the methods we considered displaying varying degrees of robustness to this confounder. The use of causal inference techniques to integrate omics and GWAS data has the potential to improve biological understanding of the pathways leading to disease. Our study demonstrates the suitability of various methods for performing causal inference under several biologically plausible scenarios.

Journal ArticleDOI
TL;DR: The method can substantially improve power while controlling for type I error rate and is based on the insight that batch effects on a given variant can be assessed by comparing odds ratio estimates using internal controls only vs. using combined control samples of internal and external controls.
Abstract: Due to the drop in sequencing cost, the number of sequenced genomes is increasing rapidly. To improve power of rare-variant tests, these sequenced samples could be used as external control samples in addition to control samples from the study itself. However, when using external controls, possible batch effects due to the use of different sequencing platforms or genotype calling pipelines can dramatically increase type I error rates. To address this, we propose novel summary statistics based single and gene- or region-based rare-variant tests that allow the integration of external controls while controlling for type I error. Our approach is based on the insight that batch effects on a given variant can be assessed by comparing odds ratio estimates using internal controls only vs. using combined control samples of internal and external controls. From simulation experiments and the analysis of data from age-related macular degeneration and type 2 diabetes studies, we demonstrate that our method can substantially improve power while controlling for type I error rate.

Journal ArticleDOI
TL;DR: The key idea is to data adaptively weight each variable pair based on its empirical association, alleviating the effects of noise accumulation in high‐dimensional data, and thus maintaining the power for both dense and sparse alternative hypotheses.
Abstract: Testing for association between two random vectors is a common and important task in many fields, however, existing tests, such as Escoufier's RV test, are suitable only for low-dimensional data, not for high-dimensional data. In moderate to high dimensions, it is necessary to consider sparse signals, which are often expected with only a few, but not many, variables associated with each other. We generalize the RV test to moderate-to-high dimensions. The key idea is to data adaptively weight each variable pair based on its empirical association. As the consequence, the proposed test is adaptive, alleviating the effects of noise accumulation in high-dimensional data, and thus maintaining the power for both dense and sparse alternative hypotheses. We show the connections between the proposed test with several existing tests, such as a generalized estimating equations-based adaptive test, multivariate kernel machine regression (KMR), and kernel distance methods. Furthermore, we modify the proposed adaptive test so that it can be powerful for nonlinear or nonmonotonic associations. We use both real data and simulated data to demonstrate the advantages and usefulness of the proposed new test. The new test is freely available in R package aSPC on CRAN at https://cran.r-project.org/web/packages/aSPC/index.html and https://github.com/jasonzyx/aSPC.

Journal ArticleDOI
Chao Xu1, Kehao Wu1, Ji-Gang Zhang1, Hui Shen1, Hong-Wen Deng1 
TL;DR: The results show that with appropriate allocation of sequencing effort, two‐stage sequencing is an effective approach for conducting GAS, and practical guidelines for investigators to plan the optimum sequencing‐based GAS including two‐ stage sequencing design given their specific constraints of sequencing investment are provided.
Abstract: Next-generation sequencing-based genetic association study (GAS) is a powerful tool to identify candidate disease variants and genomic regions. Although low-coverage sequencing offers low cost but inadequacy in calling rare variants, high coverage is able to detect essentially every variant but at a high cost. Two-stage sequencing may be an economical way to conduct GAS without losing power. In two-stage sequencing, an affordable number of samples are sequenced at high coverage as the reference panel, then to impute in a larger sample is sequenced at low coverage. As unit sequencing costs continue to decrease, investigators can now conduct GAS with more flexible sequencing depths. Here, we systematically evaluate the effect of the read depth and sample size on the variant discovery power and association power for study designs using low-coverage, high-coverage, and two-stage sequencing. We consider 12 low-coverage, 12 high-coverage, and 51 two-stage design scenarios with the read depth varying from 0.5× to 80×. With state-of-the-art simulation and analysis packages and in-house scripts, we simulate the complete study process from DNA sequencing to SNP (single nucleotide polymorphism) calling and association testing. Our results show that with appropriate allocation of sequencing effort, two-stage sequencing is an effective approach for conducting GAS. We provide practical guidelines for investigators to plan the optimum sequencing-based GAS including two-stage sequencing design given their specific constraints of sequencing investment.

Journal ArticleDOI
TL;DR: Two approaches to formally test the pleiotropic relationship in multiple scenarios are provided and are evaluated under various simulation scenarios and apply to the COPDGene study, a case‐control study of chronic obstructive pulmonary disease in current and former smokers.
Abstract: Through genome-wide association studies, numerous genes have been shown to be associated with multiple phenotypes. To determine the overlap of genetic susceptibility of correlated phenotypes, one can apply multivariate regression or dimension reduction techniques, such as principal components analysis, and test for the association with the principal components of the phenotypes rather than the individual phenotypes. However, as these approaches test whether there is a genetic effect for at least one of the phenotypes, a significant test result does not necessarily imply pleiotropy. Recently, a method called Pleiotropy Estimation and Test Bootstrap (PET-B) has been proposed to specifically test for pleiotropy (i.e., that two normally distributed phenotypes are both associated with the single nucleotide polymorphism of interest). Although the method examines the genetic overlap between the two quantitative phenotypes, the extension to binary phenotypes, three or more phenotypes, and rare variants is not straightforward. We provide two approaches to formally test this pleiotropic relationship in multiple scenarios. These approaches depend on permuting the phenotypes of interest and comparing the set of observed P-values to the set of permuted P-values in relation to the origin (e.g., a vector of zeros) either using the Hausdorff metric or a cutoff-based approach. These approaches are appropriate for categorical and quantitative phenotypes, more than two phenotypes, common variants and rare variants. We evaluate these approaches under various simulation scenarios and apply them to the COPDGene study, a case-control study of chronic obstructive pulmonary disease in current and former smokers.

Journal ArticleDOI
TL;DR: A high‐dimensional robust regression approach to infer the regulatory relationships between GEs and CNAs is developed and has competitive performance compared to the nonrobust benchmark and the robust LAD (least absolute deviation) approach.
Abstract: Gene expression (GE) levels have important biological and clinical implications. They are regulated by copy number alterations (CNAs). Modeling the regulatory relationships between GEs and CNAs facilitates understanding disease biology and can also have values in translational medicine. The expression level of a gene can be regulated by its cis-acting as well as trans-acting CNAs, and the set of trans-acting CNAs is usually not known, which poses a high-dimensional selection and estimation problem. Most of the existing studies share a common limitation in that they cannot accommodate long-tailed distributions or contamination of GE data. In this study, we develop a high-dimensional robust regression approach to infer the regulatory relationships between GEs and CNAs. A high-dimensional regression model is used to accommodate the effects of both cis-acting and trans-acting CNAs. A density power divergence loss function is used to accommodate long-tailed GE distributions and contamination. Penalization is adopted for regularized estimation and selection of relevant CNAs. The proposed approach is effectively realized using a coordinate descent algorithm. Simulation shows that it has competitive performance compared to the nonrobust benchmark and the robust LAD (least absolute deviation) approach. We analyze TCGA (The Cancer Genome Atlas) data on cutaneous melanoma and study GE-CNA regulations in the RAP (regulation of apoptosis) pathway, which further demonstrates the satisfactory performance of the proposed approach.

Journal ArticleDOI
TL;DR: This work investigates multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes and demonstrates that MLC is a well‐powered and robust choice among existing methods across a broad range of gene structures.
Abstract: By jointly analyzing multiple variants within a gene, instead of one at a time, gene-based multiple regression can improve power, robustness, and interpretation in genetic association analysis. We investigate multiple linear combination (MLC) test statistics for analysis of common variants under realistic trait models with linkage disequilibrium (LD) based on HapMap Asian haplotypes. MLC is a directional test that exploits LD structure in a gene to construct clusters of closely correlated variants recoded such that the majority of pairwise correlations are positive. It combines variant effects within the same cluster linearly, and aggregates cluster-specific effects in a quadratic sum of squares and cross-products, producing a test statistic with reduced degrees of freedom (df) equal to the number of clusters. By simulation studies of 1000 genes from across the genome, we demonstrate that MLC is a well-powered and robust choice among existing methods across a broad range of gene structures. Compared to minimum P-value, variance-component, and principal-component methods, the mean power of MLC is never much lower than that of other methods, and can be higher, particularly with multiple causal variants. Moreover, the variation in gene-specific MLC test size and power across 1000 genes is less than that of other methods, suggesting it is a complementary approach for discovery in genome-wide analysis. The cluster construction of the MLC test statistics helps reveal within-gene LD structure, allowing interpretation of clustered variants as haplotypic effects, while multiple regression helps to distinguish direct and indirect associations.

Journal ArticleDOI
TL;DR: A gene‐based segregation test that quantifies the uncertainty of the filtering approach, constructed using the probability of segregation events under the null hypothesis of Mendelian transmission is proposed and applied to whole‐exome sequencing data from 49 extended pedigrees with severe, early‐onset chronic obstructive pulmonary disease (COPD) in the Boston Early‐Onset COPD study (BEOCOPD).
Abstract: Whole-exome sequencing using family data has identified rare coding variants in Mendelian diseases or complex diseases with Mendelian subtypes, using filters based on variant novelty, functionality, and segregation with the phenotype within families. However, formal statistical approaches are limited. We propose a gene-based segregation test (GESE) that quantifies the uncertainty of the filtering approach. It is constructed using the probability of segregation events under the null hypothesis of Mendelian transmission. This test takes into account different degrees of relatedness in families, the number of functional rare variants in the gene, and their minor allele frequencies in the corresponding population. In addition, a weighted version of this test allows incorporating additional subject phenotypes to improve statistical power. We show via simulations that the GESE and weighted GESE tests maintain appropriate type I error rate, and have greater power than several commonly used region-based methods. We apply our method to whole-exome sequencing data from 49 extended pedigrees with severe, early-onset chronic obstructive pulmonary disease (COPD) in the Boston Early-Onset COPD study (BEOCOPD) and identify several promising candidate genes. Our proposed methods show great potential for identifying rare coding variants of large effect and high penetrance for family-based sequencing data. The proposed tests are implemented in an R package that is available on CRAN (https://cran.r-project.org/web/packages/GESE/).

Journal ArticleDOI
TL;DR: An ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators is developed and can be more effective than analyzing GE data alone.
Abstract: Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.

Journal ArticleDOI
TL;DR: A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD.
Abstract: In genome-wide association studies (GWAS) genetic loci that influence complex traits are localized by inspecting associations between genotypes of genetic markers and the values of the trait of interest. On the other hand, admixture mapping, which is performed in case of populations consisting of a recent mix of two ancestral groups, relies on the ancestry information at each locus (locus-specific ancestry). Recently it has been proposed to jointly model genotype and locus-specific ancestry within the framework of single marker tests. Here, we extend this approach for population-based GWAS in the direction of multimarker models. A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD. Simulation studies and a real data example illustrate the advantages of this new approach compared to single-marker analysis or modern model selection strategies based on separately analyzing genotype and ancestry data, as well as to single-marker analysis combining genotypic and ancestry information. Depending on the signal strength, our procedure automatically chooses whether genotypic or locus-specific ancestry markers are added to the model. This results in a good compromise between the power to detect causal mutations and the precision of their localization. The proposed method has been implemented in R and is available at http://www.math.uni.wroc.pl/~mbogdan/admixtures/.

Journal ArticleDOI
TL;DR: A two‐stage statistical framework to assess skewed XCI and evaluate gene‐level patterns of XCI for an individual sample by integration of RNA sequence, copy number alteration, and genotype data is developed and applied to data from tumors of ovarian cancer patients.
Abstract: X-chromosome inactivation (XCI) epigenetically silences transcription of an X chromosome in females; patterns of XCI are thought to be aberrant in women's cancers, but are understudied due to statistical challenges. We develop a two-stage statistical framework to assess skewed XCI and evaluate gene-level patterns of XCI for an individual sample by integration of RNA sequence, copy number alteration, and genotype data. Our method relies on allele-specific expression (ASE) to directly measure XCI and does not rely on male samples or paired normal tissue for comparison. We model ASE using a two-component mixture of beta distributions, allowing estimation for a given sample of the degree of skewness (based on a composite likelihood ratio test) and the posterior probability that a given gene escapes XCI (using a Bayesian beta-binomial mixture model). To illustrate the utility of our approach, we applied these methods to data from tumors of ovarian cancer patients. Among 99 patients, 45 tumors were informative for analysis and showed evidence of XCI skewed toward a particular parental chromosome. For 397 X-linked genes, we observed tumor XCI patterns largely consistent with previously identified consensus states based on multiple normal tissue types. However, 37 genes differed in XCI state between ovarian tumors and the consensus state; 17 genes aberrantly escaped XCI in ovarian tumors (including many oncogenes), whereas 20 genes were unexpectedly inactivated in ovarian tumors (including many tumor suppressor genes). These results provide evidence of the importance of XCI in ovarian cancer and demonstrate the utility of our two-stage analysis.

Journal ArticleDOI
TL;DR: Novel statistical tests to test the association between rare and common variants in a genomic region and a complex trait of interest based on cross‐validation prediction error (PE) are proposed and it is shown that PE‐TOW and PE‐WS are consistently more powerful than TOW and WS, respectively and PE is the most powerful test when causal variants contain both common and rare variants.
Abstract: Despite the extensive discovery of disease-associated common variants, much of the genetic contribution to complex traits remains unexplained. Rare variants may explain additional disease risk or trait variability. Although sequencing technology provides a supreme opportunity to investigate the roles of rare variants in complex diseases, detection of these variants in sequencing-based association studies presents substantial challenges. In this article, we propose novel statistical tests to test the association between rare and common variants in a genomic region and a complex trait of interest based on cross-validation prediction error (PE). We first propose a PE method based on Ridge regression. Based on PE, we also propose another two tests PE-WS and PE-TOW by testing a weighted combination of variants with two different weighting schemes. PE-WS is the PE version of the test based on the weighted sum statistic (WS) and PE-TOW is the PE version of the test based on the optimally weighted combination of variants (TOW). Using extensive simulation studies, we are able to show that (1) PE-TOW and PE-WS are consistently more powerful than TOW and WS, respectively, and (2) PE is the most powerful test when causal variants contain both common and rare variants.

Journal ArticleDOI
TL;DR: An efficient and fast spatial‐clustering algorithm is proposed for the association analysis of whole‐genome sequencing studies and a region in the ITGB3 gene that potentially harbors disease susceptibility loci for Alzheimer's disease is identified.
Abstract: For the association analysis of whole-genome sequencing (WGS) studies, we propose an efficient and fast spatial-clustering algorithm. Compared to existing analysis approaches for WGS data, that define the tested regions either by sliding or consecutive windows of fixed sizes along variants, a meaningful grouping of nearby variants into consecutive regions has the advantage that, compared to sliding window approaches, the number of tested regions is likely to be smaller. In comparison to consecutive, fixed-window approaches, our approach is likely to group nearby variants together. Given existing biological evidence that disease-associated mutations tend to physically cluster in specific regions along the chromosome, the identification of meaningful groups of nearby located variants could thus lead to a potential power gain for association analysis. Our algorithm defines consecutive genomic regions based on the physical positions of the variants, assuming an inhomogeneous Poisson process and groups together nearby variants. As parameters are estimated locally, the algorithm takes the differing variant density along the chromosome into account and provides locally optimal partitioning of variants into consecutive regions. An R-implementation of the algorithm is provided. We discuss the theoretical advances of our algorithm compared to existing, window-based approaches and show the performance and advantage of our introduced algorithm in a simulation study and by an application to Alzheimer's disease WGS data. Our analysis identifies a region in the ITGB3 gene that potentially harbors disease susceptibility loci for Alzheimer's disease. The region-based association signal of ITGB3 replicates in an independent data set and achieves formally genome-wide significance. Software Implementation: An implementation of the algorithm in R is available at: https://github.com/heidefier/cluster_wgs_data.

Journal ArticleDOI
TL;DR: This work proposes Longitudinal SNP‐set/sequence kernel association test (LSKAT), a robust, mixed‐effects method for association testing of rare and common variants with longitudinal quantitative phenotypes and applies the LSKAT and LBT methods to detect association with longitudinally measured body mass index in the Framingham Heart Study.
Abstract: Many genetic epidemiological studies collect repeated measurements over time. This design not only provides a more accurate assessment of disease condition, but allows us to explore the genetic influence on disease development and progression. Thus, it is of great interest to study the longitudinal contribution of genes to disease susceptibility. Most association testing methods for longitudinal phenotypes are developed for single variant, and may have limited power to detect association, especially for variants with low minor allele frequency. We propose Longitudinal SNP-set/sequence kernel association test (LSKAT), a robust, mixed-effects method for association testing of rare and common variants with longitudinal quantitative phenotypes. LSKAT uses several random effects to account for the within-subject correlation in longitudinal data, and allows for adjustment for both static and time-varying covariates. We also present a longitudinal trait burden test (LBT), where we test association between the trait and the burden score in linear mixed models. In simulation studies, we demonstrate that LBT achieves high power when variants are almost all deleterious or all protective, while LSKAT performs well in a wide range of genetic models. By making full use of trait values from repeated measures, LSKAT is more powerful than several tests applied to a single measurement or average over all time points. Moreover, LSKAT is robust to misspecification of the covariance structure. We apply the LSKAT and LBT methods to detect association with longitudinally measured body mass index in the Framingham Heart Study, where we are able to replicate association with a circadian gene NR1D2.