scispace - formally typeset
Search or ask a question

Showing papers by "David R. Kelley published in 2021"


Journal ArticleDOI
TL;DR: In this article, a semisupervised, adversarial neural network is proposed to transfer cell identity annotations from one experiment to another by taking advantage of information in both labeled data sets and new, unlabeled data sets.
Abstract: Annotating cell identities is a common bottleneck in the analysis of single-cell genomics experiments. Here, we present scNym, a semisupervised, adversarial neural network that learns to transfer cell identity annotations from one experiment to another. scNym takes advantage of information in both labeled data sets and new, unlabeled data sets to learn rich representations of cell identity that enable effective annotation transfer. We show that scNym effectively transfers annotations across experiments despite biological and technical differences, achieving performance superior to existing methods. We also show that scNym models can synthesize information from multiple training and target data sets to improve performance. We show that in addition to high accuracy, scNym models are well calibrated and interpretable with saliency methods.

43 citations


Journal ArticleDOI
TL;DR: In this paper, the expression modifier score (EMS) is used as a prior for statistical fine-mapping of eQTLs to identify an additional 20,913 putative causal eQTs, and incorporated into co-localization analysis to identify 310 additional candidate genes across UK Biobank phenotypes.
Abstract: The large majority of variants identified by GWAS are non-coding, motivating detailed characterization of the function of non-coding variants. Experimental methods to assess variants' effect on gene expressions in native chromatin context via direct perturbation are low-throughput. Existing high-throughput computational predictors thus have lacked large gold standard sets of regulatory variants for training and validation. Here, we leverage a set of 14,807 putative causal eQTLs in humans obtained through statistical fine-mapping, and we use 6121 features to directly train a predictor of whether a variant modifies nearby gene expression. We call the resulting prediction the expression modifier score (EMS). We validate EMS by comparing its ability to prioritize functional variants with other major scores. We then use EMS as a prior for statistical fine-mapping of eQTLs to identify an additional 20,913 putatively causal eQTLs, and we incorporate EMS into co-localization analysis to identify 310 additional candidate genes across UK Biobank phenotypes.

29 citations


Journal ArticleDOI
TL;DR: In this article, the authors examined a dataset comprising ~2 million nuclei spanning E9.5-E13.5 of mouse embryonic development to quantify transcriptome-wide changes in alternative polyadenylation (APA).
Abstract: 3′ untranslated regions (3′ UTRs) post-transcriptionally regulate mRNA stability, localization, and translation rate. While 3′-UTR isoforms have been globally quantified in limited cell types using bulk measurements, their differential usage among cell types during mammalian development remains poorly characterized. In this study, we examine a dataset comprising ~2 million nuclei spanning E9.5–E13.5 of mouse embryonic development to quantify transcriptome-wide changes in alternative polyadenylation (APA). We observe a global lengthening of 3′ UTRs across embryonic stages in all cell types, although we detect shorter 3′ UTRs in hematopoietic lineages and longer 3′ UTRs in neuronal cell types within each stage. An analysis of RNA-binding protein (RBP) dynamics identifies ELAV-like family members, which are concomitantly induced in neuronal lineages and developmental stages experiencing 3′-UTR lengthening, as putative regulators of APA. By measuring 3′-UTR isoforms in an expansive single cell dataset, our work provides a transcriptome-wide and organism-wide map of the dynamic landscape of alternative polyadenylation during mammalian organogenesis. Alternative polyadenylation regulates localization, half-life and translation of mRNA isoforms. Here the authors investigate alternative polyadenylation using single cell RNA sequencing data from mouse embryos and identify 3’-UTR isoforms that are regulated across cell types and developmental time.

23 citations


Posted ContentDOI
08 Apr 2021-bioRxiv
TL;DR: In this article, a new deep learning architecture called Enformer is proposed to integrate long-range interactions (up to 100 kb away) in the genome, which can be used to predict gene expression prediction from DNA sequence.
Abstract: The next phase of genome biology research requires understanding how DNA sequence encodes phenotypes, from the molecular to organismal levels. How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequence through the use of a new deep learning architecture called Enformer that is able to integrate long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Notably, Enformer outperformed the best team on the critical assessment of genome interpretation (CAGI5) challenge for noncoding variant interpretation with no additional training. Furthermore, Enformer learned to predict promoter-enhancer interactions directly from DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of growing human disease associations to cell-type-specific gene regulatory mechanisms and provide a framework to interpret cis-regulatory evolution. To foster these downstream applications, we have made the pre-trained Enformer model openly available, and provide pre-computed effect predictions for all common variants in the 1000 Genomes dataset. One-sentence summary Improved noncoding variant effect prediction and candidate enhancer prioritization from a more accurate sequence to expression model driven by extended long-range interaction modelling.

20 citations


Journal ArticleDOI
TL;DR: This article performed single-cell RNA sequencing on muscle mononuclear cells from young and aged mice and profile muscle stem cells (MuSCs) and fibro-adipose progenitors (FAPs) after differentiation.

14 citations


Journal ArticleDOI
TL;DR: In this article, a deep learning architecture called Enformer was proposed to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input.
Abstract: How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution. By using a new deep learning architecture, Enformer leverages long-range information to improve prediction of gene expression on the basis of DNA sequence.

9 citations


Posted ContentDOI
10 Sep 2021-bioRxiv
TL;DR: In this article, a sequence-based convolutional neural network (SCASSet) was proposed to model scATAC data, leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model.
Abstract: 1 Abstract Single cell ATAC-seq (scATAC) shows great promise for studying cellular heterogeneity in epigenetic landscapes, but there remain significant challenges in the analysis of scATAC data due to the inherent high dimensionality and sparsity. Here we introduce scBasset, a sequence-based convolutional neural network method to model scATAC data. We show that by leveraging the DNA sequence information underlying accessibility peaks and the expressiveness of a neural network model, scBasset achieves state-of-the-art performance across a variety of tasks on scATAC and single cell multiome datasets, including cell type identification, scATAC profile denoising, data integration across assays, and transcription factor activity inference.

4 citations


Posted ContentDOI
22 Jan 2021-bioRxiv
TL;DR: In this article, the authors examined a dataset comprising ~2 million cells spanning E9.5-E13.5 of mouse embryonic development to quantify transcriptome-wide changes in alternative polyadenylation (APA).
Abstract: 39 untranslated regions (39 UTRs) post-transcriptionally regulate mRNA stability, localization, and translation rate. While 39-UTR isoforms have been globally quantified in limited cell types using bulk measurements, their differential usage among cell types during mammalian development remains poorly characterized. In this study, we examined a dataset comprising ~2 million cells spanning E9.5-E13.5 of mouse embryonic development to quantify transcriptome-wide changes in alternative polyadenylation (APA). We observe a global lengthening of 39 UTRs across embryonic stages in all cell types, although we detect shorter 39 UTRs in hematopoietic lineages and longer 39 UTRs in neuronal cell types within each stage. While the majority of individual genes possess 39 UTRs that lengthen with time, a subset appear to be spatiotemporally regulated through APA. By measuring 39-UTR isoforms in an expansive single cell dataset, our work provides a transcriptome-wide and organism-wide map of the dynamic landscape of alternative polyadenylation during mammalian organogenesis.

1 citations


Posted ContentDOI
04 May 2021-bioRxiv
TL;DR: In this article, the authors revisited Hayflick's original observation of RS in human fetal lung fibroblasts equipped with a battery of high dimensional modern techniques and analytical methods to deeply profile the process of RS across each aspect of the central dogma.
Abstract: Replicative senescence (RS) as a model has become the central focus of research into cellular aging in vitro. Despite decades of study, this process through which cells cease dividing is not fully understood in culture, and even much less so in vivo during development and with aging. Here, we revisit Hayflick’s original observation of RS in WI-38 human fetal lung fibroblasts equipped with a battery of high dimensional modern techniques and analytical methods to deeply profile the process of RS across each aspect of the central dogma and beyond. We applied and integrated RNA-seq, proteomics, metabolomics, and ATAC-seq to a high resolution RS time course. We found that the transcriptional changes that underlie RS manifest early, gradually increase, and correspond to a concomitant global increase in accessibility in nucleolar and lamin associated domains. During RS WI-38 fibroblast gene expression patterns acquire a striking resemblance to those of myofibroblasts in a process similar to the epithelial to mesenchymal transition (EMT). This observation is supported at the transcriptional, proteomic, and metabolomic levels of cellular biology. In addition, we provide evidence suggesting that this conversion is regulated by the transcription factors YAP1/TEAD1 and the signaling molecule TGF-β2.

1 citations