scispace - formally typeset
Search or ask a question

Showing papers by "Xiang Zhou published in 2019"


Journal ArticleDOI
TL;DR: This work compares 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes and provides important guidelines for choosing dimensionality Reduction methods for sc RNA-seq data analysis.
Abstract: Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. We aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used dimensionality reduction methods for scRNA-seq studies. Specifically, we compare 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different dimensionality reduction methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluate the computational scalability of different dimensionality reduction methods by recording their computational cost. Based on the comprehensive evaluation results, we provide important guidelines for choosing dimensionality reduction methods for scRNA-seq data analysis. We also provide all analysis scripts used in the present study at www.xzlab.org/reproduce.html.

120 citations


Journal ArticleDOI
TL;DR: This study provides important evidence supporting the causal role of higher LDL on increasing the risk of ALS, paving ways for the development of preventative strategies for reducing the disease burden of ALS across multiple nations.
Abstract: Amyotrophic lateral sclerosis (ALS) is a late-onset fatal neurodegenerative disorder that is predicted to increase across the globe by ~70% in the following decades. Understanding the disease causal mechanism underlying ALS and identifying modifiable risks factors for ALS hold the key for the development of effective preventative and treatment strategies. Here, we investigate the causal effects of four blood lipid traits that include high-density lipoprotein, low-density lipoprotein (LDL), total cholesterol and triglycerides on the risk of ALS. By leveraging instrument variables from multiple large-scale genome-wide association studies in both European and East Asian populations, we carry out one of the largest and most comprehensive Mendelian randomization analyses performed to date on the causal relationship between lipids and ALS. Among the four lipids, we found that only LDL is causally associated with ALS and that higher LDL level increases the risk of ALS in both the European and East Asian populations. Specifically, the odds ratio of ALS per 1 standard deviation (i.e. 39.0 mg/dL) increase of LDL is estimated to be 1.14 [95% confidence interval (CI), 1.05-1.24; P = 1.38E-3] in the European population and 1.06 (95% CI, 1.00-1.12; P = 0.044) in the East Asian population. The identified causal relationship between LDL and ALS is robust with respect to the choice of statistical methods and is validated through extensive sensitivity analyses that guard against various model assumption violations. Our study provides important evidence supporting the causal role of higher LDL on increasing the risk of ALS, paving ways for the development of preventative strategies for reducing the disease burden of ALS across multiple nations.

81 citations


Journal ArticleDOI
TL;DR: New evidence is provided supporting the causal neuroprotective role of T2D on ALS in the European population and empirically suggestive evidence of increasing risk of T1D onALS in the East Asian population.
Abstract: Associations between type 2 diabetes (T2D) and amyotrophic lateral sclerosis (ALS) were discovered in observational studies in both European and East Asian populations. However, whether such associations are causal remains largely unknown. We employed a two-sample Mendelian randomization approach to evaluate the causal relationship of T2D with the risk of ALS in both European and East Asian populations. Our analysis was implemented using summary statistics obtained from large-scale genome-wide association studies with ~660,000 individuals for T2D and ~81,000 individuals for ALS in the European population, and ~191,000 individuals for T2D and ~4100 individuals for ALS in the East Asian population. The causal relationship between T2D and ALS in both populations was estimated using the inverse-variance-weighted methods and was further validated through extensive complementary and sensitivity analyses. Using multiple instruments that were strongly associated with T2D, a negative association between T2D and ALS was identified in the European population with the odds ratio (OR) estimated to be 0.93 (95% CI 0.88–0.99, p = 0.023), while a positive association between T2D and ALS was observed in the East Asian population with OR = 1.28 (95% CI 0.99–1.62, p = 0.058). These results were robust against instrument selection, various modeling misspecifications, and estimation biases, with the Egger regression and MR-PRESSO ruling out the possibility of horizontal pleiotropic effects of instruments. However, no causal association was found between T2D-related exposures (including glycemic traits) and ALS in the European population. Our results provide new evidence supporting the causal neuroprotective role of T2D on ALS in the European population and provide empirically suggestive evidence of increasing risk of T2D on ALS in the East Asian population. Our results have an important implication on ALS pathology, paving ways for developing therapeutic strategies across multiple populations.

56 citations


Journal ArticleDOI
TL;DR: It is shown that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data and is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods.
Abstract: Motivation Genomic sequencing studies, including RNA sequencing and bisulfite sequencing studies, are becoming increasingly common and increasingly large. Large genomic sequencing studies open doors for accurate molecular trait heritability estimation and powerful differential analysis. Heritability estimation and differential analysis in sequencing studies requires the development of statistical methods that can properly account for the count nature of the sequencing data and that are computationally efficient for large datasets. Results Here, we develop such a method, PQLseq (Penalized Quasi-Likelihood for sequencing count data), to enable effective and efficient heritability estimation and differential analysis using the generalized linear mixed model framework. With extensive simulations and comparisons to previous methods, we show that PQLseq is the only method currently available that can produce unbiased heritability estimates for sequencing count data. In addition, we show that PQLseq is well suited for differential analysis in large sequencing studies, providing calibrated type I error control and more power compared to the standard linear mixed model methods. Finally, we apply PQLseq to perform gene expression heritability estimation and differential expression analysis in a large RNA sequencing study in the Hutterites. Availability and implementation PQLseq is implemented as an R package with source code freely available at www.xzlab.org/software.html and https://cran.r-project.org/web/packages/PQLseq/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

53 citations


Posted ContentDOI
21 Oct 2019-bioRxiv
TL;DR: The high power of SPARK allows us to identify new genes and pathways that reveal new biology in the data that otherwise cannot be revealed by existing approaches, up to ten times more powerful than existing approaches.
Abstract: Recent development of various spatially resolved transcriptomic techniques has enabled gene expression profiling on complex tissues with spatial localization information. Identifying genes that display spatial expression pattern in these studies is an important first step towards characterizing the spatial transcriptomic landscape. Detecting spatially expressed genes requires the development of statistical methods that can properly model spatial count data, provide effective type I error control, have sufficient statistical power, and are computationally efficient. Here, we developed such a method, SPARK. SPARK directly models count data generated from various spatial resolved transcriptomic techniques through generalized linear spatial models. With a new efficient penalized quasi-likelihood based algorithm, SPARK is scalable to data sets with tens of thousands of genes measured on tens of thousands of samples. Importantly, SPARK relies on newly developed statistical formulas for hypothesis testing, producing well-calibrated p-values and yielding high statistical power. We illustrate the benefits of SPARK through extensive simulations and in-depth analysis of four published spatially resolved transcriptomic data sets. In the real data applications, SPARK is up to ten times more powerful than existing approaches. The high power of SPARK allows us to identify new genes and pathways that reveal new biology in the data that otherwise cannot be revealed by existing approaches.

46 citations


Journal ArticleDOI
TL;DR: It is suggested that lower birth weight is causally associated with an increased risk of CAD, MI, and T2D in later life, supporting the fetal origins of adult diseases hypothesis.
Abstract: Purpose: Birth weight has a profound long-term impact on individual’s predisposition to various diseases at adulthood — a hypothesis commonly referred to as the fetal origins of adult diseases. However, it is not fully clear to what extent the fetal origins of adult diseases hypothesis holds and it is also not completely known what types of adult diseases are causally affected by birth weight. Materials and methods: Mendelian randomisation using multiple genetic instruments associated with birth weight was performed to explore the causal relationship between birth weight and adult diseases. The causal relationship between birth weight and 21 adult diseases as well as 38 other complex traits were examined based on data collected from 37 large-scale genome-wide association studies with up to 340,000 individuals of European ancestry. Causal effects of birth weight were estimated using inverse-variance weighted methods. The identified causal relationships between birth weight and adult diseases were further validated through extensive sensitivity analyses, bias calculation and simulations. Results: Among the 21 adult diseases, three were identified to be inversely causally affected by birth weight after the Bonferroni correction. The measurement unit of birth weight was defined as its standard deviation (i.e. 488 grams), and one unit lower birth weight was causally related to an increased risk of coronary artery disease (CAD), myocardial infarction (MI), type 2 diabetes (T2D) and BMI-adjusted T2D, with the estimated odds ratios of 1.34 [95% confidence interval (CI) 1.17 - 1.53], 1.30 (95% CI 1.13 - 1.51), 1.41 (95% CI 1.15 - 1.73) and 1.54 (95% CI 1.25 - 1.89), respectively. All these identified causal associations were robust across various sensitivity analyses that guard against various confounding due to pleiotropy or maternal effects as well as reverse causation. In addition, analysis on 38 additional complex traits did not identify candidate traits that may mediate the causal association between birth weight and CAD/MI/T2D. Conclusions: The results suggest that lower birth weight is causally associated with an increased risk of CAD, MI and T2D in later life, supporting the fetal origins of adult diseases hypothesis.

38 citations


Journal ArticleDOI
TL;DR: A collaborative mixed model (CoMM) is proposed to investigate the mechanistic role of associated variants in complex traits and indicates that by leveraging regulatory information, CoMM can effectively improve the power of prioritizing risk variants.
Abstract: Motivation Genome-wide association studies (GWASs) have been successful in identifying many genetic variants associated with complex traits. However, the mechanistic links between these variants and complex traits remain elusive. A scientific hypothesis is that genetic variants influence complex traits at the organismal level via affecting cellular traits, such as regulating gene expression and altering protein abundance. Although earlier works have already presented some scientific insights about this hypothesis and their findings are very promising, statistical methods that effectively harness multilayered data (e.g. genetic variants, cellular traits and organismal traits) on a large scale for functional and mechanistic exploration are highly demanding. Results In this study, we propose a collaborative mixed model (CoMM) to investigate the mechanistic role of associated variants in complex traits. The key idea is built upon the emerging scientific evidence that genetic effects at the cellular level are much stronger than those at the organismal level. Briefly, CoMM combines two models: the first model relating gene expression with genotype and the second model relating phenotype with predicted gene expression using the first model. The two models are fitted jointly in CoMM, such that the uncertainty in predicting gene expression has been fully accounted. To demonstrate the advantages of CoMM over existing methods, we conducted extensive simulation studies, and also applied CoMM to analyze 25 traits in NFBC1966 and Genetic Epidemiology Research on Aging (GERA) studies by integrating transcriptome information from the Genetic European in Health and Disease (GEUVADIS) Project. The results indicate that by leveraging regulatory information, CoMM can effectively improve the power of prioritizing risk variants. Regarding the computational efficiency, CoMM can complete the analysis of NFBC1966 dataset and GERA datasets in 2 and 18 min, respectively. Availability and implementation The developed R package is available at https://github.com/gordonliu810822/CoMM. Supplementary information Supplementary data are available at Bioinformatics online.

36 citations


Posted ContentDOI
03 Jul 2019-bioRxiv
TL;DR: PMR-Egger is presented, which relies on an MR likelihood framework that unifies many existing TWAS and MR methods, accommodates multiple correlated instruments, tests the causal effect of gene on trait in the presence of horizontal pleiotropic effects, and is scalable to hundreds of thousands of individuals.
Abstract: Integrating association results from both genome-wide association studies (GWASs) and expression quantitative trait locus (eQTL) mapping studies has the potential to shed light on the molecular mechanism underlying disease etiology. Several statistical methods have been recently developed to integrate GWASs with eQTL studies in the form of transcriptome-wide association studies (TWASs). These existing methods can all be viewed as two sample Mendelian randomization (MR) methods, which are also widely used in various GWASs for inferring the causal relationship among complex traits. Unfortunately, most existing TWAS and MR methods make an unrealistic modeling assumption that the instrumental variables do not exhibit horizontal pleiotropic effects. However, horizontal pleiotropic effects have been recently discovered to be wide spread across complex traits, and as we will show here, are also wide spread across gene expression traits. Therefore, allowing for no horizontal pleiotropic effects can be overall restrictive, and, as we will be show here, can lead to a substantial inflation of test statistics and subsequently false discoveries in TWAS applications. Here, we present a probabilistic MR method, which we refer to as PMR-Egger, for testing and controlling of horizontal pleiotropic effects in TWAS applications. PMR-Egger relies on a new MR likelihood framework that unifies many existing TWAS and MR methods, accommodates multiple correlated instruments, tests the causal effect of gene on trait in the presence of horizontal pleiotropy, and, with a newly developed parameter expansion version of the expectation maximization algorithm, is scalable to hundreds of thousands of individuals. With extensive simulations, we show that PMR-Egger provides calibrated type I error control for causal effect testing in the presence of horizontal pleiotropic effects, is reasonably robust for various types of horizontal pleiotropic effect mis-specifications, is more powerful than existing MR approaches, and, as a by-product, can directly test for horizontal pleiotropy. We illustrate the benefits of our method in applications to 39 diseases and complex traits obtained from three GWASs including the UK Biobank and show how PMR-Egger can lead to new biological discoveries through integrative analysis.

32 citations


Posted ContentDOI
17 May 2019-bioRxiv
TL;DR: This work compared 18 different DR methods on 30 publicly available scRNAseq data sets that cover a range of sequencing techniques and sample sizes and evaluated the performance of different DR Methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix.
Abstract: Background Dimensionality reduction (DR) is an indispensable analytic component for many areas of single cell RNA sequencing (scRNAseq) data analysis. Proper DR can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of DR in scRNAseq analysis and the vast number of DR methods developed for scRNAseq studies, however, few comprehensive comparison studies have been performed to evaluate the effectiveness of different DR methods in scRNAseq. Results Here, we aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used DR methods for scRNAseq studies. Specifically, we compared 18 different DR methods on 30 publicly available scRNAseq data sets that cover a range of sequencing techniques and sample sizes. We evaluated the performance of different DR methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluated the computational scalability of different DR methods by recording their computational cost. Conclusions Based on the comprehensive evaluation results, we provide important guidelines for choosing DR methods for scRNAseq data analysis. We also provide all analysis scripts used in the present study at www.xzlab.org/reproduce.html. Together, we hope that our results will serve as an important practical reference for practitioners to choose DR methods in the field of scRNAseq analysis.

28 citations


Journal ArticleDOI
TL;DR: It is found that combining all reads from the single cells and following GATK Best Practices resulted in the highest number of SNVs identified with a high concordance, and that SNV calling quality varies across different functional genomic regions.
Abstract: Integrating single-cell RNA sequencing (scRNA-seq) data with genotypes obtained from DNA sequencing studies facilitates the detection of functional genetic variants underlying cell type-specific gene expression variation. Unfortunately, most existing scRNA-seq studies do not come with DNA sequencing data; thus, being able to call single nucleotide variants (SNVs) from scRNA-seq data alone can provide crucial and complementary information, detection of functional SNVs, maximizing the potential of existing scRNA-seq studies. Here, we perform extensive analyses to evaluate the utility of two SNV calling pipelines (GATK and Monovar), originally designed for SNV calling in either bulk or single-cell DNA sequencing data. In both pipelines, we examined various parameter settings to determine the accuracy of the final SNV call set and provide practical recommendations for applied analysts. We found that combining all reads from the single cells and following GATK Best Practices resulted in the highest number of SNVs identified with a high concordance. In individual single cells, Monovar resulted in better quality SNVs even though none of the pipelines analyzed is capable of calling a reasonable number of SNVs with high accuracy. In addition, we found that SNV calling quality varies across different functional genomic regions. Our results open doors for novel ways to leverage the use of scRNA-seq for the future investigation of SNV function.

22 citations


Journal ArticleDOI
TL;DR: A statistical method is developed, IMAGE, for mQTL mapping in sequencing-based methylation studies that properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQ TL discovery.
Abstract: Identifying genetic variants that are associated with methylation variation—an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping—is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two bisulfite sequencing studies, in which IMAGE identifies more mQTL than existing approaches.

Posted ContentDOI
29 May 2019-bioRxiv
TL;DR: A novel probabilistic model is proposed, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data.
Abstract: Motivation Although genome-wide association studies (GWAS) have deepened our understanding of the genetic architecture of complex traits, the mechanistic links that underlie how genetic variants cause complex traits remains elusive. To advance our understanding of the underlying mechanistic links, various consortia have collected a vast volume of genomic data that enable us to investigate the role that genetic variants play in gene expression regulation. Recently, a collaborative mixed model (CoMM) [42] was proposed to jointly interrogate genome on complex traits by integrating both the GWAS dataset and the expression quantitative trait loci (eQTL) dataset. Although CoMM is a powerful approach that leverages regulatory information while accounting for the uncertainty in using an eQTL dataset, it requires individual-level GWAS data and cannot fully make use of widely available GWAS summary statistics. Therefore, statistically efficient methods that leverages transcriptome information using only summary statistics information from GWAS data are required. Results In this study, we propose a novel probabilistic model, CoMM-S2, to examine the mechanistic role that genetic variants play, by using only GWAS summary statistics instead of individual-level GWAS data. Similar to CoMM which uses individual-level GWAS data, CoMM-S2 combines two models: the first model examines the relationship between gene expression and genotype, while the second model examines the relationship between the phenotype and the predicted gene expression from the first model. Distinct from CoMM, CoMM-S2 requires only GWAS summary statistics. Using both simulation studies and real data analysis, we demonstrate that even though CoMM-S2 utilizes GWAS summary statistics, it has comparable performance as CoMM, which uses individual-level GWAS data. Contact jin.liu@duke-nus.edu.sg Availability and implementation The implement of CoMM-S2 is included in the CoMM package that can be downloaded from https://github.com/gordonliu8100822/CoMMhttps://github.com/gordonliu8100822/CoMM. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The proposed efficient genome-wide multivariate association algorithms (GMA) for longitudinal data has improved statistic power for association detection and computational speed and can be analyzed with thousands of individuals and more than ten thousand records within a few hours.
Abstract: Motivation Current dynamic phenotyping system introduces time as an extra dimension to genome-wide association studies (GWAS), which helps to explore the mechanism of dynamical genetic control for complex longitudinal traits. However, existing methods for longitudinal GWAS either ignore the covariance among observations of different time points or encounter computational efficiency issues. Results We herein developed efficient genome-wide multivariate association algorithms for longitudinal data. In contrast to existing univariate linear mixed model analyses, the proposed method has improved statistic power for association detection and computational speed. In addition, the new method can analyze unbalanced longitudinal data with thousands of individuals and more than ten thousand records within a few hours. The corresponding time for balanced longitudinal data is just a few minutes. Availability and implementation A software package to implement the efficient algorithm named GMA (https://github.com/chaoning/GMA) is available freely for interested users in relevant fields. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This study provides no evidence for the fetal origins of diseases hypothesis for adult asthma, implying that the impact of birth weight on asthma in years of children and adolescents does not persist into adult and previous findings may be biased by confounders.
Abstract: The association between lower birth weight and childhood asthma is well established. However, it remains unclear whether the influence of lower birth weight on asthma can persist into adulthood. We conducted a Mendelian randomization analysis to assess the causal relationship of birth weight (~140,000 individuals) on the risk of adult asthma (~62,000 individuals). We estimated the causal effect of birth weight to be 1.00 (95% CI 0.98~1.03, p = 0.737) using the genetic risk score method. We did not observe nonlinear relationship or gender difference for the estimated causal effect. With the inverse-variance weighted method, the causal effect of birth weight on adult asthma was estimated to be 1.02 (95% CI 0.84~1.24, p = 0.813). Additionally, the iMAP method provides no additional genome-wide evidence supporting the causal effects of birth weight on adult asthma. Our results were robust against various sensitivity analyses, and MR-PRESSO and MR-Egger regression showed that no instrument outliers and no horizontal pleiotropy were likely to bias the results. Overall, our study provides no evidence for the fetal origins of diseases hypothesis for adult asthma, implying that the impact of birth weight on asthma in years of children and adolescents does not persist into adult and previous findings may be biased by confounders.

Posted ContentDOI
17 Jul 2019-bioRxiv
TL;DR: The results provide empirical evidence supporting one hypothesis of the omnigenic model: that trait-relevant gene co-expression networks underlie disease etiology.
Abstract: Genome-wide association studies (GWASs) have identified many SNPs associated with various common diseases. Understanding the biological functions of these identified SNP associations requires identifying disease/trait relevant tissues or cell types. Here, we develop a network method, CoCoNet, to facilitate the identification of trait-relevant tissues or cell types. Different from existing approaches, CoCoNet incorporates tissue-specific gene co-expression networks constructed from either bulk or single cell RNA sequencing (RNAseq) studies with GWAS data for trait-tissue inference. In particular, CoCoNet relies on a covariance regression network model to express gene-level effect sizes for the given GWAS trait as a function of the tissue-specific co-expression adjacency matrix. With a composite likelihood-based inference algorithm, CoCoNet is scalable to tens of thousands of genes. We validate the performance of CoCoNet through extensive simulations. We apply CoCoNet for an in-depth analysis of four neurological disorders and four autoimmune diseases, where we integrate the corresponding GWASs with bulk RNAseq data from 38 tissues and single cell RNAseq data from 10 cell types. In the real data applications, we show how CoCoNet can help identify specific glial cell types relevant for neurological disorders and identify disease-targeted colon tissues as relevant for autoimmune diseases. Our results also provide empirical evidence supporting one hypothesis of the omnigenic model: that trait-relevant gene co-expression networks underlie disease etiology.

Journal ArticleDOI
TL;DR: In conclusion, epigenetic signatures for cigarette smoking that may have been missed in cross-sectional analyses are identified, providing insight into the epigenetic effect of smoking and highlighting the importance of longitudinal analysis in understanding the dynamic human epigenome.
Abstract: Changes in DNA methylation may be a potential mechanism that mediates the effects of smoking on physiological function and subsequent disease risk. Given the dynamic nature of the epigenome, longit...

Posted ContentDOI
22 Apr 2019-bioRxiv
TL;DR: A statistical method is developed, IMAGE, for mQTL mapping in sequencing-based methylation studies that properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQ TL discovery.
Abstract: DNA methylation is an important gene regulatory mechanism that contributes to the genotype-phenotype relationship. Identifying genetic variants that are associated with methylation variation – an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping -- is therefore important for understanding the biological mechanisms underlying genotype-trait associations, and for investigating the potential causal or mediating effects of DNA methylation on phenotypic outcomes. However, existing approaches for mQTL mapping do not fare well in high-throughput sequencing-based data sets, as these approaches do not directly model the count generating process in sequencing studies and fail to take advantage of allele-specific methylation patterns. Here, we develop a new statistical method, IMAGE, together with a scalable computational inference algorithm, for mQTL mapping in sequencing-based studies. Our method properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two large-scale bisulfite sequencing studies of wild baboons and wild wolves, in which IMAGE identifies 50%-64% more mQTL than existing approaches. In both cases, mQTL are significantly depleted in CpG islands but enriched in shelf and open sea regions, suggesting that genetic variation is most likely to contribute to DNA methylation variation in regions of the genome with dynamic methylation patterns.

Posted ContentDOI
29 Nov 2019-bioRxiv
TL;DR: This study serves as a biological reference for future single cell studies of toxicant or neuronal complications, to ultimately characterize the molecular basis by which Pb influences cognition and behavior.
Abstract: Background: Lead (Pb) exposure is ubiquitous and has permanent developmental effects on childhood intelligence and behavior and adulthood risk of dementia. The hippocampus is a key brain region involved in learning and memory, and its cellular composition is highly heterogeneous. Pb acts on the hippocampus by altering gene expression, but the cell type-specific responses are unknown. Objective: Examine the effects of perinatal Pb treatment on adult hippocampus gene expression, at the level of individual cells, in mice. Methods: In mice perinatally exposed to control water (n=4) or a human physiologically-relevant level (32 ppm in maternal drinking water) of Pb (n=4), two weeks prior to mating through weaning, we tested for gene expression and cellular differences in the hippocampus at 5-months of age. Analysis was performed using single cell RNA-sequencing of 5,258 cells from the hippocampus by 10x Genomics Chromium to 1) test for gene expression differences averaged across all cells by treatment; 2) compare cell cluster composition by treatment; and 3) test for gene expression and pathway differences within cell clusters by treatment. Results: Gene expression patterns revealed 12 cell clusters in the hippocampus, mapping to major expected cell types (e.g. microglia, astrocytes, neurons, oligodendrocytes). Perinatal Pb treatment was associated with 12.4% more oligodendrocytes (P=4.4x10-21) in adult mice. Across all cells, differential gene expression analysis by Pb treatment revealed cluster marker genes. Within cell clusters, differential gene expression with Pb treatment (q<0.05) was observed in endothelial, microglial, pericyte, and astrocyte cells. Pathways up-regulated with Pb treatment were protein folding in microglia (P=3.4x10-9) and stress response in oligodendrocytes (P=3.2x10-5). Conclusion: Bulk tissue analysis may be confounded by changes in cell type composition and may obscure effects within vulnerable cell types. This study serves as a biological reference for future single cell studies of toxicant or neuronal complications, to ultimately characterize the molecular basis by which Pb influences cognition and behavior.