scispace - formally typeset
Search or ask a question

Showing papers by "Xiang Zhou published in 2018"


Journal ArticleDOI
TL;DR: A key feature of the VIPER method is its ability to preserve gene expression variability across cells after imputation, which helps to facilitate accurate transcriptome quantification at the single-cell level.
Abstract: We develop a method, VIPER, to impute the zero values in single-cell RNA sequencing studies to facilitate accurate transcriptome quantification at the single-cell level. VIPER is based on nonnegative sparse regression models and is capable of progressively inferring a sparse set of local neighborhood cells that are most predictive of the expression levels of the cell of interest for imputation. A key feature of our method is its ability to preserve gene expression variability across cells after imputation. We illustrate the advantages of our method through several well-designed real data-based analytical experiments.

100 citations


Journal ArticleDOI
TL;DR: In this article, the authors used deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a).
Abstract: Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle that contains apolipoprotein(a), encoded by LPA, and is a highly heritable, causal risk factor for cardiovascular diseases that varies in concentrations across ancestries. Here, we use deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a). We observe that genetic determinants between Europeans and Africans have several unique determinants. The common variant rs12740374 associated with Lp(a) cholesterol is an eQTL for SORT1 and independent of LDL cholesterol. Observed associations of aggregates of rare non-coding variants are largely explained by LPA structural variation, namely the LPA kringle IV 2 (KIV2)-CN. Finally, we find that LPA risk genotypes confer greater relative risk for incident atherosclerotic cardiovascular diseases compared to directly measured Lp(a), and are significantly associated with measures of subclinical atherosclerosis in African Americans.

72 citations


Journal ArticleDOI
TL;DR: In this paper, the authors propose a framework that provides an effect size analog for each explanatory variable in Bayesian kernel regression models when the kernel is shift-invariant, for example, the Gaussian kernel.
Abstract: Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propose a novel framework that provides an effect size analog for each explanatory variable in Bayesian kernel regression models when the kernel is shift-invariant — for example, the Gaussian kernel. We use function analytic properties of shift-invariant reproducing kernel Hilbert spaces (RKHS) to define a linear vector space that: (i) captures nonlinear structure, and (ii) can be projected onto the original explanatory variables. This projection onto the original explanatory variables serves as an analog of effect sizes. The specific function analytic property we use is that shift-invariant kernel functions can be approximated via random Fourier bases. Based ...

43 citations


Journal ArticleDOI
TL;DR: The widely used linear mixed model is extended to incorporate multiple SNP functional annotations from omics studies with GWAS summary statistics to facilitate the identification of trait-relevant tissues, with which to further construct powerful association tests.
Abstract: Genome-wide association studies (GWASs) have identified many disease associated loci, the majority of which have unknown biological functions. Understanding the mechanism underlying trait associations requires identifying trait-relevant tissues and investigating associations in a trait-specific fashion. Here, we extend the widely used linear mixed model to incorporate multiple SNP functional annotations from omics studies with GWAS summary statistics to facilitate the identification of trait-relevant tissues, with which to further construct powerful association tests. Specifically, we rely on a generalized estimating equation based algorithm for parameter inference, a mixture modeling framework for trait-tissue relevance classification, and a weighted sequence kernel association test constructed based on the identified trait-relevant tissues for powerful association analysis. We refer to our analytic procedure as the Scalable Multiple Annotation integration for trait-Relevant Tissue identification and usage (SMART). With extensive simulations, we show how our method can make use of multiple complementary annotations to improve the accuracy for identifying trait-relevant tissues. In addition, our procedure allows us to make use of the inferred trait-relevant tissues, for the first time, to construct more powerful SNP set tests. We apply our method for an in-depth analysis of 43 traits from 28 GWASs using tissue-specific annotations in 105 tissues derived from ENCODE and Roadmap. Our results reveal new trait-tissue relevance, pinpoint important annotations that are informative of trait-tissue relationship, and illustrate how we can use the inferred trait-relevant tissues to construct more powerful association tests in the Wellcome trust case control consortium study.

30 citations


Journal ArticleDOI
TL;DR: iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling and has the potential to reveal causal relationship between traits.
Abstract: Motivation Genome-wide association studies (GWASs) have identified many genetic loci associated with complex traits. A substantial fraction of these identified loci is associated with multiple traits-a phenomena known as pleiotropy. Identification of pleiotropic associations can help characterize the genetic relationship among complex traits and can facilitate our understanding of disease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide single nucleic polymorphisms (SNPs) together. Results We develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially non-informative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate penalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP in terms of both high association mapping power and accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project. Availability and implementation iMAP is freely available at http://www.xzlab.org/software.html. Supplementary information Supplementary data are available at Bioinformatics online.

29 citations


Posted ContentDOI
14 Nov 2018-bioRxiv
TL;DR: This work develops a Bayesian inference method using continuous shrinkage priors to extend previous causal mediation analysis techniques to a high-dimensional setting and identified DNA methylation regions that may actively mediate the effect of socioeconomic status (SES) on cardiometabolic outcome.
Abstract: Causal mediation analysis aims to examine the role of a mediator or a group of mediators that lie in the pathway between an exposure and an outcome. Recent biomedical studies often involve a large number of potential mediators based on high-throughput technologies. Most of the current analytic methods focus on settings with one or a moderate number of potential mediators. With the expanding growth of omics data, joint analysis of molecular-level genomics data with epidemiological data through mediation analysis is becoming more common. However, such joint analysis requires methods that can simultaneously accommodate high-dimensional mediators and that are currently lacking. To address this problem, we develop a Bayesian inference method using continuous shrinkage priors to extend previous causal mediation analysis techniques to a high-dimensional setting. Simulations demonstrate that our method improves the power of global mediation analysis compared to simpler alternatives and has decent performance to identify true non-null mediators. We also construct tests for natural indirect effects using a permutation procedure. The Bayesian method helps us to understand the structure of the composite null hypotheses. We applied our method to Multi-Ethnic Study of Atherosclerosis (MESA) and identified DNA methylation regions that may actively mediate the effect of socioeconomic status (SES) on cardiometabolic outcome.

23 citations


Posted ContentDOI
31 Jan 2018-bioRxiv
TL;DR: iMAP integrates a large number of SNP functional annotations to substantially improve association mapping power, and is capable of selecting informative annotations from a large, potentially noninformative set, and develops an efficient expectation maximization algorithm based on an approximate penalized regression algorithm.
Abstract: Motivation: Genome-wide association studies (GWASs) have identified many genetic loci associ-ated with complex traits. A substantial fraction of these identified loci are associated with multiple traits -- a phenomena known as pleiotropy. Identification of pleiotropic associations can help char-acterize the genetic relationship among complex traits and can facilitate our understanding of dis-ease etiology. Effective pleiotropic association mapping requires the development of statistical methods that can jointly model multiple traits with genome-wide SNPs together. Results: We develop a joint modeling method, which we refer to as the integrative MApping of Pleiotropic association (iMAP). iMAP models summary statistics from GWASs, uses a multivariate Gaussian distribution to account for phenotypic correlation, simultaneously infers genome-wide SNP association pattern using mixture modeling, and has the potential to reveal causal relationship between traits. Importantly, iMAP integrates a large number of SNP functional annotations to sub-stantially improve association mapping power, and, with a sparsity-inducing penalty, is capable of selecting informative annotations from a large, potentially noninformative set. To enable scalable inference of iMAP to association studies with hundreds of thousands of individuals and millions of SNPs, we develop an efficient expectation maximization algorithm based on an approximate pe-nalized regression algorithm. With simulations and comparisons to existing methods, we illustrate the benefits of iMAP both in terms of high association mapping power and in terms of accurate estimation of genome-wide SNP association patterns. Finally, we apply iMAP to perform a joint analysis of 48 traits from 31 GWAS consortia together with 40 tissue-specific SNP annotations generated from the Roadmap Project.

9 citations


Journal ArticleDOI
TL;DR: A pvalue-assisted subset testing for associations (pASTA) framework that generalizes the previously proposed association analysis based on subsets (ASSET) method by incorporating gene-environment (G-E) interactions into the testing procedure and allows researchers to determine the most probable subset of traits that exhibit genetic associations in addition to the enhancement of power.
Abstract: Objectives: Classical methods for combining summary data from genome-wide association studies only use marginal genetic effects, and power can be compromised in the presence of heterogeneity. We aim to enhance the discovery of novel associated loci in the presence of heterogeneity of genetic effects in subgroups defined by an environmental factor. Methods: We present a pvalue-assisted subset testing for associations (pASTA) framework that generalizes the previously proposed association analysis based on subsets (ASSET) method by incorporating gene-environment (G-E) interactions into the testing procedure. We conduct simulation studies and provide two data examples. Results: Simulation studies show that our proposal is more powerful than methods based on marginal associations in the presence of G-E interactions and maintains comparable power even in their absence. Both data examples demonstrate that our method can increase power to detect overall genetic associations and identify novel studies/phenotypes that contribute to the association. Conclusions: Our proposed method can be a useful screening tool to identify candidate single nucleotide polymorphisms that are potentially associated with the trait(s) of interest for further validation. It also allows researchers to determine the most probable subset of traits that exhibit genetic associations in addition to the enhancement of power.

8 citations


Posted ContentDOI
04 Jan 2018-bioRxiv
TL;DR: The widely used linear mixed model is extended to incorporate multiple SNP functional annotations from omics studies with GWAS summary statistics to facilitate the identification of trait-relevant tissues, with which to further construct powerful association tests.
Abstract: Genome-wide association studies (GWASs) have identified many disease associated loci, the majority of which have unknown biological functions. Understanding the mechanism underlying trait associations requires identifying trait-relevant tissues and investigating associations in a trait-specific fashion. Here, we extend the widely used linear mixed model to incorporate multiple SNP functional annotations from omics studies with GWAS summary statistics to facilitate the identification of trait-relevant tissues, with which to further construct powerful association tests. Specifically, we rely on a generalized estimating equation based algorithm for parameter inference, a mixture modeling framework for trait-tissue relevance classification, and a weighted sequence kernel association test constructed based on the identified trait-relevant tissues for powerful association analysis. We refer to our analytic procedure as the Scalable Multiple Annotation integration for trait-Relevant Tissue identification and usage (SMART). With extensive simulations, we show how our method can make use of multiple complementary annotations to improve the accuracy for identifying trait-relevant tissues. In addition, our procedure allows us to make use of the inferred trait-relevant tissues, for the first time, to construct more powerful SNP set tests. We apply our method for an in-depth analysis of 43 traits from 28 GWASs using tissue-specific annotations in 105 tissues derived from ENCODE and Roadmap. Our results reveal new trait-tissue relevance, pinpoint important annotations that are informative of trait-tissue relationship, and illustrate how we can use the inferred trait-relevant tissues to construct more powerful association tests in the Wellcome trust case control consortium study.

8 citations


Journal ArticleDOI
TL;DR: A novel two-step approach using omics data to conduct genome-wide searches for gene-arsenic interactions that will enhance the understanding of disease etiology and the ability to develop interventions targeted at susceptible sub-populations is described.
Abstract: Identifying gene-environment interactions is a central challenge in the quest to understand susceptibility to complex, multi-factorial diseases. Developing an understanding of how inter-individual variability in inherited genetic variation alters the effects of environmental exposures will enhance our knowledge of disease mechanisms and improve our ability to predict disease and target interventions to high-risk sub-populations. Limited progress has been made identifying gene-environment interactions in the epidemiological setting using existing statistical approaches for genome-wide searches for interaction. In this paper, we describe a novel two-step approach using omics data to conduct genome-wide searches for gene-environment interactions. Using existing genome-wide SNP data from a large Bangladeshi cohort study specifically designed to assess the effect of arsenic exposure on health, we evaluated gene-arsenic interactions by first conducting genome-wide searches for SNPs that modify the effect of arsenic on molecular phenotypes (gene expression and DNA methylation features). Using this set of SNPs showing evidence of interaction with arsenic in relation to molecular phenotypes, we then tested SNP-arsenic interactions in relation to skin lesions, a hallmark characteristic of arsenic toxicity. With the emergence of additional omics data in the epidemiologic setting, our approach may have the potential to boost power for genome-wide interaction research, enabling the identification of interactions that will enhance our understanding of disease etiology and our ability to develop interventions targeted at susceptible sub-populations.

6 citations


Posted Content
TL;DR: In this paper, the authors proposed a scalable algorithm for learning high-dimensional linear mixed models with sublinear computational complexity dependence on the dimension of the covariates, and provided theoretical guarantees for their learning algorithms.
Abstract: Linear mixed models (LMMs) are used extensively to model dependecies of observations in linear regression and are used extensively in many application areas. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do not offer theoretical guarantees. We present scalable algorithms for learning high-dimensional LMMs with sublinear computational complexity dependence on $p$. Key to our approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform. We provide theoretical guarantees for our learning algorithms, demonstrating the robustness of parameter estimation. Finally, we complement the theory with experiments on large synthetic and real data.

Proceedings Article
01 Mar 2018
TL;DR: Key to the approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform which provide theoretical guarantees for the learning algorithms, demonstrating the robustness of parameter estimation.
Abstract: Linear mixed models (LMMs) are used extensively to model dependecies of observations in linear regression and are used extensively in many application areas. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do not offer theoretical guarantees. We present scalable algorithms for learning high-dimensional LMMs with sublinear computational complexity dependence on $p$. Key to our approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform. We provide theoretical guarantees for our learning algorithms, demonstrating the robustness of parameter estimation. Finally, we complement the theory with experiments on large synthetic and real data.

Posted ContentDOI
19 Oct 2018-bioRxiv
TL;DR: The results suggest that lower birth weight is causally associated with an increased risk of CAD, MI and T2D in later life, supporting the fetal origins of adult diseases hypothesis.
Abstract: Background: It has long been hypothesized that birth weight has a profound long-term impact on individual predisposition to various diseases at adulthood: a hypothesis commonly referred to as the fetal origins of adult diseases. However, it is not fully clear to what extent the fetal origins of adult diseases hypothesis holds and it is also not completely known what types of adult diseases are causally affected by birth weight. Determining the causal impact of birth weight on various adult diseases through traditional randomised intervention studies is a challenging task. Methods: Mendelian randomisation was employed and multiple genetic variants associated with birth weight were used as instruments to explore the relationship between 21 adult diseases and 38 other complex traits from 37 large-scale genome-wide association studies up to 340,000 individuals of European ancestry. Causal effects of birth weight were estimated using inverse-variance weighted methods. The identified causal relationships between birth weight and adult diseases were further validated through extensive sensitivity analyses and simulations. Results: Among the 21 adult diseases, three were identified to be inversely causally affected by birth weight with a statistical significance level passing the Bonferroni corrected significance threshold. The measurement unit of birth weight was defined as its standard deviation (i.e. 488 grams), and one unit lower birth weight was causally related to an increased risk of coronary artery disease (CAD), myocardial infarction (MI), type 2 diabetes (T2D) and BMI-adjusted T2D, with the estimated odds ratios of 1.34 [95% confidence interval (CI) 1.17 - 1.53, p = 1.54E-5], 1.30 (95% CI 1.13 - 1.51, p = 3.31E-4), 1.41 (95% CI 1.15 - 1.73, p = 1.11E-3) and 1.54 (95% CI 1.25 - 1.89, p = 6.07E-5), respectively. All these identified causal associations were robust across various sensitivity analyses that guard against various confounding due to pleiotropy or maternal effects as well as inverse causation. In addition, analysis on 38 additional complex traits found that the inverse causal association between birth weight and CAD/MI/T2D was not likely to be mediated by other risk factors such as blood-pressure related traits and adult weight. Conclusions: The results suggest that lower birth weight is causally associated with an increased risk of CAD, MI and T2D in later life, supporting the fetal origins of adult diseases hypothesis. Keywords: Birth weight, Adult diseases, Mendelian randomisation, Causal association, Genome wide association study, Type 2 diabetes, Coronary artery disease, Myocardial infarction

Posted ContentDOI
29 Jun 2018-bioRxiv
TL;DR: It is shown that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data and is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods.
Abstract: Motivation: Genomic sequencing studies, including RNA sequencing and bisulfite sequencing studies, are becoming increasingly common and increasingly large. Large genomic sequencing studies open doors for accurate molecular trait heritability estimation and powerful differential analysis. Heritability estimation and differential analysis in sequencing studies requires the development of statistical methods that can properly account for the count nature of the sequencing data and that are computationally efficient for large data sets. Results: Here, we develop such a method, PQLseq (Penalized Quasi-Likelihood for sequencing count data), to enable effective and efficient heritability estimation and differential analysis using the generalized linear mixed model framework. With extensive simulations and comparisons to previous methods, we show that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data. In addition, we show that PQLseq is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods. Finally, we apply PQLseq to perform gene expression heritability estimation and differential expression analysis in a large RNA sequencing study in the Hutterites.

Posted ContentDOI
19 Nov 2018-bioRxiv
TL;DR: This Mendelian randomization study provides no evidence for the fetal origins of diseases hypothesis for adult asthma, implying that the impact of birth weight on asthma in years of children and adolescents does not persist into adult and previous findings may be biased by confounders.
Abstract: The association between lower birth weight and childhood asthma is well established by observational studies. However, it remains unclear whether the influence of lower birth weight on asthma can persist into adulthood. Here, we conducted a Mendelian randomization analysis to assess the causal relationship of birth weight on the risk of adult asthma. Specifically, we carefully selected genetic instruments based on summary statistics obtained from large-scale genome-wide association meta-analyses of birth weight (up to ~160,000 individuals) and adult asthma (up to ~62,000 individuals). We performed Mendelian randomization using two separate approaches: a genetic risk score approach and a two-sample inverse-variance weighted (IVW) approach. With 37 genetic instruments for birth weight, we estimated the causal effect per one standard deviation (SD) change of birth weight to be an odds ratio (OR) of 1.00 (95% CI 0.98~1.03, p=0.737) using the genetic risk score method. We did not observe nonlinear relationship or gender difference for the estimated causal effect. In addition, with the IVW method, we estimated the causal effect of birth weight on adult asthma was observed (OR=1.02, 95% CI 0.84~1.24, p=0.813). Additionally, the iMAP method provides no additional genome-wide evidence supporting the causal effects of birth weight on adult asthma. The result of the IVW method was robust against various sensitivity analyses, and MR-PRESSO and the Egger regression showed that no instrument outliers and no horizontal pleiotropy were likely to bias the results. Overall, this Mendelian randomization study provides no evidence for the fetal origins of diseases hypothesis for adult asthma, implying that the impact of birth weight on asthma in years of children and adolescents does not persist into adult and previous findings may be biased by confounders.

Posted ContentDOI
23 May 2018-bioRxiv
TL;DR: A p-value Assisted Subset Testing for Associations (pASTA) framework that generalizes the previously proposed association analysis based on subsets (ASSET) method by incorporating gene-environment (G-E) interactions into the testing procedure and allows researchers to determine the most probable subset of traits that exhibit genetic associations in addition to the enhancement of power.
Abstract: Objectives: Classical methods for combining summary data from genome-wide association studies (GWAS) only use marginal genetic effects and power can be compromised in the presence of heterogeneity. We aim to enhance the discovery of novel associated loci in the presence of heterogeneity of genetic effects in sub-groups defined by an environmental factor. Methods: We present a p-value Assisted Subset Testing for Associations (pASTA) framework that generalizes the previously proposed association analysis based on subsets (ASSET) method by incorporating gene-environment (G-E) interactions into the testing procedure. We conduct simulation studies and provide two data examples. Results: Simulation studies show that our proposal is more powerful than methods based on marginal associations in the presence of G-E interactions and maintains comparable power even in their absence. Both data examples demonstrate that our method can increase power to detect overall genetic associations and identify novel studies/phenotypes that contribute to the association. Conclusions: Our proposed method can be a useful screening tool to identify candidate single nucleotide polymorphisms (SNPs) that are potentially associated with the trait(s) of interest for further validation. It also allows researchers to determine the most probable subset of traits that exhibit genetic associations in addition to the enhancement of power.

Posted ContentDOI
19 Oct 2018-bioRxiv
TL;DR: This study provides important evidence supporting the causal role of higher LDL on increasing the risk of ALS, paving ways for the development of preventative strategies for reducing the disease burden of ALS across multiple nations.
Abstract: Amyotrophic lateral sclerosis (ALS) is a late-onset fatal neurodegenerative disorder that is predicted to increase across the globe by ~70% in the following decades. Understanding the disease causal mechanism underlying ALS and identifying modifiable risks factors for ALS hold the key for the development of effective preventative and treatment strategies. Here, we investigate the causal effects of four blood lipid traits that include high density lipoprotein (HDL), low density lipoprotein (LDL), total cholesterol (TC), and triglycerides (TG) on the risk of ALS. By leveraging instrument variables from multiple large-scale genome-wide association studies in both European and East Asian populations, we carry out one of the largest and most comprehensive Mendelian randomization analyses performed to date on the causal relationship between lipids and ALS. Among the four lipids, we found that only LDL is causally associated with ALS and that higher LDL level increases the risk of ALS in both the European and East Asian populations. Specifically, the odds ratio of ALS per one standard deviation (i.e. 39.0 mg/dL) increase of LDL is estimated to be 1.14 (95% CI 1.05 - 1.24, p = 1.38E-3) in the European and population and 1.06 (95% CI 1.00 - 1.12, p = 0.044) in the East Asian population. The identified causal relationship between LDL and ALS is robust with respect to the choice of statistical methods and is validated through extensive sensitivity analyses that guard against various model assumption violations. Our study provides important evidence supporting the causal role of higher LDL on increasing the risk of ALS, paving ways for the development of preventative strategies for reducing the disease burden of ALS across multiple nations.