scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Genomics in 2019"


Posted Content
TL;DR: Novel CNN designs for accurate and simultaneous cancer/normal and cancer types prediction based on gene expression profiles, and unique model interpretation scheme to elucidate biologically relevance of cancer marker genes after eliminating the effects of tissue-of-origin are presented.
Abstract: Background Precise prediction of cancer types is vital for cancer diagnosis and therapy. Important cancer marker genes can be inferred through predictive model. Several studies have attempted to build machine learning models for this task however none has taken into consideration the effects of tissue of origin that can potentially bias the identification of cancer markers. Results In this paper, we introduced several Convolutional Neural Network (CNN) models that take unstructured gene expression inputs to classify tumor and non-tumor samples into their designated cancer types or as normal. Based on different designs of gene embeddings and convolution schemes, we implemented three CNN models: 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN. The models were trained and tested on combined 10,340 samples of 33 cancer types and 731 matched normal tissues of The Cancer Genome Atlas (TCGA). Our models achieved excellent prediction accuracies (93.9-95.0%) among 34 classes (33 cancers and normal). Furthermore, we interpreted one of the models, known as 1D-CNN model, with a guided saliency technique and identified a total of 2,090 cancer markers (108 per class). The concordance of differential expression of these markers between the cancer type they represent and others is confirmed. In breast cancer, for instance, our model identified well-known markers, such as GATA3 and ESR1. Finally, we extended the 1D-CNN model for prediction of breast cancer subtypes and achieved an average accuracy of 88.42% among 5 subtypes. The codes can be found at this https URL.

64 citations


Journal ArticleDOI
TL;DR: A comprehensive review of published genomic data visualization tools can be found in this article, where the authors propose taxonomies for data, visualization, and tasks involved in genomics data visualization.
Abstract: Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data-driven and many studies lack well-defined hypotheses. A key challenge in data-driven research is to discover unexpected patterns and to formulate hypotheses in an unbiased manner in vast amounts of genomic and other associated data. Over the past two decades, this has driven the development of numerous data visualization techniques and tools for visualizing genomic data. Based on a comprehensive literature survey, we propose taxonomies for data, visualization, and tasks involved in genomic data visualization. Furthermore, we provide a comprehensive review of published genomic visualization tools in the context of the proposed taxonomies.

33 citations


Posted Content
TL;DR: SneakySnake is introduced, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment and is efficient to implement on CPUs, GPUs, and FPGAs.
Abstract: Motivation: We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs, and FPGAs. Results: SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper, and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers's bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7x and 43.9x (>12x on average), respectively, with its CPU implementation, and by up to 413x and 689x (>400x on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979x (276.9x on average) and 91.7x (31.7x on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g., configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities. Availability: this https URL

28 citations


Posted Content
TL;DR: The ways in which issues of causality, stratification, gene-by-environment interactions, and divergence among groups all complicate the interpretation of among-population polygenic score differences are outlined.
Abstract: In the past decade, Genome-Wide Association Studies (GWAS) have delivered an increasingly broad view of the genetic basis of human phenotypic variation. One of the major developments from GWAS is polygenic scores, a genetic predictor of an individual's genetic predisposition towards a trait constructed from GWAS. The success of GWAS and polygenic scores seems to suggest that we will soon be able to settle debates about whether phenotypic differences among groups are driven in part by genetics. However, answering these questions is more complicated than it seems at first glance and touches on many old issues about the interpretation of human genetic variation. In this perspective piece, I outline the ways in which issues of causality, stratification, gene-by-environment interactions, and divergence among groups all complicate the interpretation of among-population polygenic score differences.

22 citations


Posted Content
TL;DR: A novel approach for differential abundance testing of compositional data, with a non-neglible amount of "zeros", which uses a set of reference taxa, which are non-differentially abundant, to identify which taxa differ in the microbiome community across groups.
Abstract: Identifying which taxa in our microbiota are associated with traits of interest is important for advancing science and health. However, the identification is challenging because the measured vector of taxa counts (by amplicon sequencing) is compositional, so a change in the abundance of one taxon in the microbiota induces a change in the number of sequenced counts across all taxa. The data is typically sparse, with zero counts present either due to biological variance or limited sequencing depth (technical zeros). For low abundance taxa, the chance for technical zeros is non-negligible. We show that existing methods designed to identify differential abundance for compositional data may have an inflated number of false positives due to improper handling of the zero counts. We introduce a novel non-parametric approach which provides valid inference even when the fraction of zero counts is substantial. Our approach uses a set of reference taxa that are non-differentially abundant, which can be estimated from the data or from outside information. We show the usefulness of our approach via simulations, as well as on three different data sets: a Crohn's disease study, the Human Microbiome Project, and an experiment with 'spiked-in' bacteria.

20 citations


Journal ArticleDOI
TL;DR: In this paper, the authors describe guidelines for a minimum set of metadata to sufficiently describe single-cell RNA-Seq experiments, ensuring reproducibility of data analyses, and propose a set of rules to ensure that the metadata is sufficient to describe the experiments.
Abstract: Single-cell RNA-Sequencing (scRNA-Seq) has undergone major technological advances in recent years, enabling the conception of various organism-level cell atlassing projects. With increasing numbers of datasets being deposited in public archives, there is a need to address the challenges of enabling the reproducibility of such data sets. Here, we describe guidelines for a minimum set of metadata to sufficiently describe scRNA-Seq experiments, ensuring reproducibility of data analyses.

19 citations


Journal ArticleDOI
TL;DR: In this paper, the authors carried out an integrated meta-analysis of the mutations including single-nucleotide variations (SNVs), the copy number variations (CNVs), RNA-seq and clinical data of patients with lung adenocarcinoma downloaded from The Cancer Genome Atlas (TCGA).
Abstract: Lung cancer is the leading cause of the largest number of deaths worldwide and lung adenocarcinoma (LUAD) is the most common form of lung cancer. In this study, we carried out an integrated meta-analysis of the mutations including single-nucleotide variations (SNVs), the copy number variations (CNVs), RNA-seq and clinical data of patients with LUAD downloaded from The Cancer Genome Atlas (TCGA). We integrated significant SNV and CNV genes, differentially expressed genes (DEGs) and the DEGs in active subnetworks to construct a prognosis signature. Cox proportional hazards model (LOOCV) with Lasso penalty was used to identify the best gene signature among different gene categories. The patients in both training and test data were clustered into high-risk and low-risk groups by using risk scores of the patients calculated based on selected gene signature. We generated a 12-gene signature (DEPTOR, ZBTB16, BCHE, MGLL, MASP2, TNNI2, RAPGEF3, SGK2, MYO1A, CYP24A1, PODXL2, CCNA1) for overall survival prediction. The survival time of high-risk and low-risk groups was significantly different. This 12-gene signature could predict prognosis and they are potential predictors for the survival of the patients with LUAD.

18 citations


Posted ContentDOI
TL;DR: This work improves on the state-of-the-art for ncRNA classification with a graph convolutional network model which achieves an accuracy of 85.73% and an F1-score of85.61% over 13 classes and represents the first successful application of graph Convolutional networks to RNA folding data.
Abstract: Non-coding RNA (ncRNA) are RNA sequences which don't code for a gene but instead carry important biological functions. The task of ncRNA classification consists in classifying a given ncRNA sequence into its family. While it has been shown that the graph structure of an ncRNA sequence folding is of great importance for the prediction of its family, current methods make use of machine learning classifiers on hand-crafted graph features. We improve on the state-of-the-art for this task with a graph convolutional network model which achieves an accuracy of 85.73% and an F1-score of 85.61% over 13 classes. Moreover, our model learns in an end-to-end fashion from the raw RNA graphs and removes the need for expensive feature extraction. To the best of our knowledge, this also represents the first successful application of graph convolutional networks to RNA folding data.

17 citations


Posted ContentDOI
TL;DR: GeNet as discussed by the authors exploits the known hierarchical structure between labels for training and achieves competitive precision and good recall with orders of magnitude less memory requirements, achieving over 90% accuracy in a challenging pathogen detection problem.
Abstract: We introduce GeNet, a method for shotgun metagenomic classification from raw DNA sequences that exploits the known hierarchical structure between labels for training. We provide a comparison with state-of-the-art methods Kraken and Centrifuge on datasets obtained from several sequencing technologies, in which dataset shift occurs. We show that GeNet obtains competitive precision and good recall, with orders of magnitude less memory requirements. Moreover, we show that a linear model trained on top of representations learned by GeNet achieves recall comparable to state-of-the-art methods on the aforementioned datasets, and achieves over 90% accuracy in a challenging pathogen detection problem. This provides evidence of the usefulness of the representations learned by GeNet for downstream biological tasks.

14 citations


Journal ArticleDOI
TL;DR: The first omic analysis applied to one of the flagship Levantine rock art sites: the Valltorta ravine (Castellón, Spain) is presented, providing the first description of the bacterial communities colonizing the rock art patina, which proved to be dominated by Firmicutes species and might have a protective effect on the paintings.
Abstract: The Iberian Mediterranean Basin is home to one of the largest groups of prehistoric rock art sites in Europe. Despite the cultural relevance of prehistoric Spanish Levantine rock art, pigment composition remains partially unknown, and the nature of the binders used for painting has yet to be disclosed. In this work, we present the first omic analysis applied to one of the flagship Levantine rock art sites: the Valltorta ravine (Castell{o}n, Spain). We used high-throughput sequencing to provide the first description of the bacterial communities colonizing the rock art patina, which proved to be dominated by Firmicutes species and might have a protective effect on the paintings. Proteomic analysis was also performed on rock art microsamples in order to determine the organic binders present in Levantine prehistoric rock art pigments. This information could shed light on the controversial dating of this UNESCO Cultural Heritage, and contribute to defining the chrono-cultural framework of the societies responsible for these paintings.

14 citations


Posted Content
TL;DR: This chapter describes methods for the statistical analysis of high-throughput phenotyping (HTP) data with the goal of enhancing the prediction accuracy of genomic selection (GS).
Abstract: The advent of plant phenomics, coupled with the wealth of genotypic data generated by next-generation sequencing technologies, provides exciting new resources for investigations into and improvement of complex traits. However, these new technologies also bring new challenges in quantitative genetics, namely, a need for the development of robust frameworks that can accommodate these high-dimensional data. In this chapter, we describe methods for the statistical analysis of high-throughput phenotyping (HTP) data with the goal of enhancing the prediction accuracy of genomic selection (GS). Following the Introduction in Section 1, Section 2 discusses field-based HTP, including the use of unmanned aerial vehicles and light detection and ranging, as well as how we can achieve increased genetic gain by utilizing image data derived from HTP. Section 3 considers extending commonly used GS models to integrate HTP data as covariates associated with the principal trait response, such as yield. Particular focus is placed on single-trait, multi-trait, and genotype by environment interaction models. One unique aspect of HTP data is that phenomics platforms often produce large-scale data with high spatial and temporal resolution for capturing dynamic growth, development, and stress responses. Section 4 discusses the utility of a random regression model for performing longitudinal GS. The chapter concludes with a discussion of some standing issues.

Posted Content
TL;DR: A ReRAM-based process-in-memory architecture, FindeR, is proposed to enhance the FM-Index EPM search throughput in genomic sequences and builds a reliable and energy-efficient Hamming distance unit to accelerate the computing kernel of FM- index search using commodity ReRAM chips without introducing extra CMOS logic.
Abstract: Genomics is the critical key to enabling precision medicine, ensuring global food security and enforcing wildlife conservation. The massive genomic data produced by various genome sequencing technologies presents a significant challenge for genome analysis. Because of errors from sequencing machines and genetic variations, approximate pattern matching (APM) is a must for practical genome analysis. Recent work proposes FPGA, ASIC and even process-in-memory-based accelerators to boost the APM throughput by accelerating dynamic-programming-based algorithms (e.g., Smith-Waterman). However, existing accelerators lack the efficient hardware acceleration for the exact pattern matching (EPM) that is an even more critical and essential function widely used in almost every step of genome analysis including assembly, alignment, annotation and compression. State-of-the-art genome analysis adopts the FM-Index that augments the space-efficient BWT with additional data structures permitting fast EPM operations. But the FM-Index is notorious for poor spatial locality and massive random memory accesses. In this paper, we propose a ReRAM-based process-in-memory architecture, FindeR, to enhance the FM-Index EPM search throughput in genomic sequences. We build a reliable and energy-efficient Hamming distance unit to accelerate the computing kernel of FM-Index search using commodity ReRAM chips without introducing extra CMOS logic. We further architect a full-fledged FM-Index search pipeline and improve its search throughput by lightweight scheduling on the NVDIMM. We also create a system library for programmers to invoke FindeR to perform EPMs in genome analysis. Compared to state-of-the-art accelerators, FindeR improves the FM-Index search throughput by $83\%\sim 30K\times$ and throughput per Watt by $3.5\times\sim 42.5K\times$.

Posted Content
TL;DR: A method to generate synthetic protein sequences which are predicted to be resistant to certain antibiotics by feeding a Wasserstein generative adversarial network (W-GAN) model a variant to the original generative adversary model.
Abstract: We introduce a method to generate synthetic protein sequences which are predicted to be resistant to certain antibiotics. We did this using 6,023 genes that were predicted to be resistant to antibiotics in the intestinal region of the human gut and were fed as input to a Wasserstein generative adversarial network (W-GAN) model a variant to the original generative adversarial model which has been known to perform efficiently when it comes to mimicking the distribution of the real data in order to generate new data which is similar in style to the original data which was fed as the training data

Posted Content
TL;DR: This work presents a class-conditional VAE-GAN to generate new human genomic sequences that can be used to train local ancestry inference (LAI) algorithms and evaluates the quality of the generated data by comparing the performance of a state-of-the-art LAI method when trained with generated versus real data.
Abstract: Local ancestry inference (LAI) allows identification of the ancestry of all chromosomal segments in admixed individuals, and it is a critical step in the analysis of human genomes with applications from pharmacogenomics and precision medicine to genome-wide association studies. In recent years, many LAI techniques have been developed in both industry and academic research. However, these methods require large training data sets of human genomic sequences from the ancestries of interest. Such reference data sets are usually limited, proprietary, protected by privacy restrictions, or otherwise not accessible to the public. Techniques to generate training samples that resemble real haploid sequences from ancestries of interest can be useful tools in such scenarios, since a generalized model can often be shared, but the unique human sample sequences cannot. In this work we present a class-conditional VAE-GAN to generate new human genomic sequences that can be used to train local ancestry inference (LAI) algorithms. We evaluate the quality of our generated data by comparing the performance of a state-of-the-art LAI method when trained with generated versus real data.

Posted Content
TL;DR: This work presents NO-BEARS, a novel algorithm for estimating gene regulatory networks, built on the basis of the NO-TEARS algorithm with two improvements, and introduces a polynomial regression loss to handle non-linearity in gene expressions.
Abstract: Constructing gene regulatory networks is a critical step in revealing disease mechanisms from transcriptomic data. In this work, we present NO-BEARS, a novel algorithm for estimating gene regulatory networks. The NO-BEARS algorithm is built on the basis of the NOTEARS algorithm with two improvements. First, we propose a new constraint and its fast approximation to reduce the computational cost of the NO-TEARS algorithm. Next, we introduce a polynomial regression loss to handle non-linearity in gene expressions. Our implementation utilizes modern GPU computation that can decrease the time of hours-long CPU computation to seconds. Using synthetic data, we demonstrate improved performance, both in processing time and accuracy, on inferring gene regulatory networks from gene expression data.

Posted Content
TL;DR: The single-cell eQTLGen consortium is founded, aimed at pinpointing disease-causing genetic variants and identifying the cellular contexts in which they affect gene expression, which can enable development of personalized medicine.
Abstract: In recent years, functional genomics approaches combining genetic information with bulk RNA-sequencing data have identified the downstream expression effects of disease-associated genetic risk factors through so-called expression quantitative trait locus (eQTL) analysis. Single-cell RNA-sequencing creates enormous opportunities for mapping eQTLs across different cell types and in dynamic processes, many of which are obscured when using bulk methods. The enormous increase in throughput and reduction in cost per cell now allow this technology to be applied to large-scale population genetics studies. Therefore, we have founded the single-cell eQTLGen consortium (sc-eQTLGen), aimed at pinpointing disease-causing genetic variants and identifying the cellular contexts in which they affect gene expression. Ultimately, this information can enable development of personalized medicine. Here, we outline the goals, approach, potential utility and early proofs-of-concept of the sc-eQTLGen consortium. We also provide a set of study design considerations for future single-cell eQTL studies.

Posted Content
TL;DR: This work uses a sequence-to-sequence autoencoder model to learn a latent representation of a fixed dimension for long and variable length DNA sequences in an unsupervised manner, and evaluates both quantitatively and qualitatively the learned latent representation for a supervised task of splice site classification.
Abstract: Recently several deep learning models have been used for DNA sequence based classification tasks. Often such tasks require long and variable length DNA sequences in the input. In this work, we use a sequence-to-sequence autoencoder model to learn a latent representation of a fixed dimension for long and variable length DNA sequences in an unsupervised manner. We evaluate both quantitatively and qualitatively the learned latent representation for a supervised task of splice site classification. The quantitative evaluation is done under two different settings. Our experiments show that these representations can be used as features or priors in closely related tasks such as splice site classification. Further, in our qualitative analysis, we use a model attribution technique Integrated Gradients to infer significant sequence signatures influencing the classification accuracy. We show the identified splice signatures resemble well with the existing knowledge.

Posted Content
TL;DR: Using systems biology approaches for the analysis of metabolomics and genetic data, several biological processes are integrated, which lead to findings that may functionally connect genetic variants with complex diseases.
Abstract: Background: Many genome-wide association studies have detected genomic regions associated with traits, yet understanding the functional causes of association often remains elusive. Utilizing systems approaches and focusing on intermediate molecular phenotypes might facilitate biologic understanding. Results: The availability of exome sequencing of two populations of African-Americans and European-Americans from the Atherosclerosis Risk in Communities study allowed us to investigate the effects of annotated loss-of-function (LoF) mutations on 122 serum metabolites. To assess the findings, we built metabolomic causal networks for each population separately and utilized structural equation modeling. We then validated our findings with a set of independent samples. By use of methods based on concepts of Mendelian randomization of genetic variants, we showed that some of the affected metabolites are risk predictors in the causal pathway of disease. For example, LoF mutations in the gene KIAA1755 were identified to elevate the levels of eicosapentaenoate (p-value=5E-14), an essential fatty acid clinically identified to increase essential hypertension. We showed that this gene is in the pathway to triglycerides, where both triglycerides and essential hypertension are risk factors of metabolomic disorder and heart attack. We also identified that the gene CLDN17, harboring loss-of-function mutations, had pleiotropic actions on metabolites from amino acid and lipid pathways. Conclusion: Using systems biology approaches for the analysis of metabolomics and genetic data, we integrated several biological processes, which lead to findings that may functionally connect genetic variants with complex diseases.

Posted Content
TL;DR: In this article, a Partially-observed Boolean dynamical systems (POBDS) model is used for modeling gene regulatory networks observed through noisy gene-expression data, and the exact optimal Bayesian classifier (OBC) is derived for binary classification of single-cell trajectories.
Abstract: Single-cell gene expression measurements offer opportunities in deriving mechanistic understanding of complex diseases, including cancer. However, due to the complex regulatory machinery of the cell, gene regulatory network (GRN) model inference based on such data still manifests significant uncertainty. The goal of this paper is to develop optimal classification of single-cell trajectories accounting for potential model uncertainty. Partially-observed Boolean dynamical systems (POBDS) are used for modeling gene regulatory networks observed through noisy gene-expression data. We derive the exact optimal Bayesian classifier (OBC) for binary classification of single-cell trajectories. The application of the OBC becomes impractical for large GRNs, due to computational and memory requirements. To address this, we introduce a particle-based single-cell classification method that is highly scalable for large GRNs with much lower complexity than the optimal solution. The performance of the proposed particle-based method is demonstrated through numerical experiments using a POBDS model of the well-known T-cell large granular lymphocyte (T-LGL) leukemia network with noisy time-series gene-expression data.

Journal ArticleDOI
TL;DR: It is proposed that a plausible adaptor for this code is the gene domain, that is, the genome segment delimited by topological insulators and comprising the gene and its enhancer regulatory sequences.
Abstract: We revisit the notion of gene regulatory code in embryonic development in the light of recent findings about genome spatial organisation. By analogy with the genetic code, we posit that the concept of code can only be used if the corresponding adaptor can clearly be identified. An adaptor is here defined as an intermediary physical entity mediating the correspondence between codewords and objects in a gratuitous and evolvable way. In the context of the gene regulatory code, the encoded objects are the gene expression levels, while the concentrations of specific transcription factors in the cell nucleus provide the codewords. The notion of code is meaningful in the absence of direct physicochemical relationships between the objects and the codewords, when the mediation by an adaptor is required. We propose that a plausible adaptor for this code is the gene domain, that is, the genome segment delimited by topological insulators and comprising the gene and its enhancer regulatory sequences. We review recent evidences, based on genome-wide chromosome conformation capture experiments, showing that preferential contact domains found in metazoan genomes are the physical traces of gene domains. Accordingly, genome 3D folding plays a direct role in shaping the developmental gene regulatory code.

Posted Content
TL;DR: The Factor Graph Neural Network model is developed that is interpretable and predictable by combining probabilistic graphical models with deep learning, and an attention mechanism is devised that can capture multi-scale hierarchical interactions among biological entities such as genes and Gene Ontology terms.
Abstract: While deep learning has achieved great success in many fields, one common criticism about deep learning is its lack of interpretability. In most cases, the hidden units in a deep neural network do not have a clear semantic meaning or correspond to any physical entities. However, model interpretability and explainability are crucial in many biomedical applications. To address this challenge, we developed the Factor Graph Neural Network model that is interpretable and predictable by combining probabilistic graphical models with deep learning. We directly encode biological knowledge such as Gene Ontology as a factor graph into the model architecture, making the model transparent and interpretable. Furthermore, we devised an attention mechanism that can capture multi-scale hierarchical interactions among biological entities such as genes and Gene Ontology terms. With parameter sharing mechanism, the unrolled Factor Graph Neural Network model can be trained with stochastic depth and generalize well. We applied our model to two cancer genomic datasets to predict target clinical variables and achieved better results than other traditional machine learning and deep learning models. Our model can also be used for gene set enrichment analysis and selecting Gene Ontology terms that are important to target clinical variables.

Posted Content
TL;DR: MinCall, an end2end basecaller model for the MinION, a deep learning model based on convolutional neural networks that achieves 91.4% median match rate on E. Coli dataset using R9 pore chemistry and 1D reads is presented.
Abstract: The Oxford Nanopore Technologies's MinION is the first portable DNA sequencing device. It is capable of producing long reads, over 100 kBp were reported. However, it has significantly higher error rate than other methods. In this study, we present MinCall, an end2end basecaller model for the MinION. The model is based on deep learning and uses convolutional neural networks (CNN) in its implementation. For extra performance, it uses cutting edge deep learning techniques and architectures, batch normalization and Connectionist Temporal Classification (CTC) loss. The best performing deep learning model achieves 91.4% median match rate on E. Coli dataset using R9 pore chemistry and 1D reads.

Posted Content
TL;DR: AirLift is proposed, a fast and comprehensive technique for moving alignments from one genome to another that can reduce the number of reads that need to be mapped from the entire read set by up to 99.9% and the overall execution time to remap the reads between the two most recent reference versions by 6.94x.
Abstract: As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping). However, if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads that need to be fully mapped to the new reference by up to 99.99\% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7x, 6.6x, and 2.8x for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.

Posted Content
TL;DR: In this paper, the authors adopt a transcriptome-based taxonomy of the cell types in the adult mammalian neocortex, which is configured to flexibly incorporate new data from multiple approaches, developmental stages and a growing number of species, enabling improvement and revision of the classification.
Abstract: To understand the function of cortical circuits it is necessary to classify their underlying cellular diversity. Traditional attempts based on comparing anatomical or physiological features of neurons and glia, while productive, have not resulted in a unified taxonomy of neural cell types. The recent development of single-cell transcriptomics has enabled, for the first time, systematic high-throughput profiling of large numbers of cortical cells and the generation of datasets that hold the promise of being complete, accurate and permanent. Statistical analyses of these data have revealed the existence of clear clusters, many of which correspond to cell types defined by traditional criteria, and which are conserved across cortical areas and species. To capitalize on these innovations and advance the field, we, the Copenhagen Convention Group, propose the community adopts a transcriptome-based taxonomy of the cell types in the adult mammalian neocortex. This core classification should be ontological, hierarchical and use a standardized nomenclature. It should be configured to flexibly incorporate new data from multiple approaches, developmental stages and a growing number of species, enabling improvement and revision of the classification. This community-based strategy could serve as a common foundation for future detailed analysis and reverse engineering of cortical circuits and serve as an example for cell type classification in other parts of the nervous system and other organs.

Posted Content
TL;DR: The hypothesis that via gene-environment interactions, fetal neuroinflammation and PS may reprogram glial immunometabolic phenotypes which impact neurodevelopment and neurobehavior is tested and the implications for ASD etiology, early detection, and novel therapeutic approaches are discussed.
Abstract: Fetal neuroinflammation and prenatal stress (PS) may contribute to lifelong neurological disabilities. Astrocytes and microglia play a pivotal role, but the mechanisms are poorly understood. Here, we test the hypothesis that via gene-environment interactions, fetal neuroinflammation and PS may reprogram glial immunometabolic phenotypes which impact neurodevelopment and neurobehavior. This glial-neuronal interplay increases the risk for clinical manifestation of autism spectrum disorder (ASD) in at-risk children. Drawing on genomic data from the recently published series of ovine and rodent glial transcriptome analyses with fetuses exposed to neuroinflammation or PS, we conducted a secondary analysis against the Simons Foundation Autism Research Initiative (SFARI) Gene database. We confirmed 21 gene hits. Using unsupervised statistical network analysis, we then identified six clusters of probable protein-protein interactions mapping onto the immunometabolic and stress response networks and epigenetic memory. These findings support our hypothesis. We discuss the implications for ASD etiology, early detection, and novel therapeutic approaches.

Posted Content
TL;DR: A greedy algorithm that is fast and scalable in the detection of a nested partition extracted from a dendrogram obtained from hierarchical clustering of a multivariate series and provides a hierarchically nested partition in much shorter time than currently widely used algorithms allowing to perform a statistically validated cluster analysis detection in very large systems.
Abstract: We develop a greedy algorithm that is fast and scalable in the detection of a nested partition extracted from a dendrogram obtained from hierarchical clustering of a multivariate series. Our algorithm provides a $p$-value for each clade observed in the hierarchical tree. The $p$-value is obtained by computing a number of bootstrap replicas of the dissimilarity matrix and by performing a statistical test on each difference between the dissimilarity associated with a given clade and the dissimilarity of the clade of its parent node. We prove the efficacy of our algorithm with a set of benchmarks generated by using a hierarchical factor model. We compare the results obtained by our algorithm with those of Pvclust. Pvclust is a widely used algorithm developed with a global approach originally motivated by phylogenetic studies. In our numerical experiments we focus on the role of multiple hypothesis test correction and on the robustness of the algorithms to inaccuracy and errors of datasets. We also apply our algorithm to a reference empirical dataset. We verify that our algorithm is much faster than Pvclust algorithm and has a better scalability both in the number of elements and in the number of records of the investigated multivariate set. Our algorithm provides a hierarchically nested partition in much shorter time than currently widely used algorithms allowing to perform a statistically validated cluster analysis detection in very large systems.

Posted Content
TL;DR: A novel method to classify human cells is presented in this work based on the transform-domain method on DNA methylation data that can speed up the computation time by more than one order of magnitude compared with whole original sequence classification, while maintaining comparable classification accuracy by the selected machine learning classifiers.
Abstract: A novel method to classify human cells is presented in this work based on the transform-domain method on DNA methylation data. DNA methylation profile variations are observed in human cells with the progression of disease stages, and the proposal is based on this DNA methylation variation to classify normal and disease cells including cancer cells. The cancer cell types investigated in this work cover hepatocellular (sample size n = 40), colorectal (n = 44), lung (n = 70) and endometrial (n = 87) cancer cells. A new pipeline is proposed integrating the DNA methylation intensity measurements on all the CpG islands by the transformation of Walsh-Hadamard Transform (WHT). The study reveals the three-step properties of the DNA methylation transform-domain data and the step values of association with the cell status. Further assessments have been carried out on the proposed machine learning pipeline to perform classification of the normal and cancer tissue cells. A number of machine learning classifiers are compared for whole sequence and WHT sequence classification based on public Whole-Genome Bisulfite Sequencing (WGBS) DNA methylation datasets. The WHT-based method can speed up the computation time by more than one order of magnitude compared with whole original sequence classification, while maintaining comparable classification accuracy by the selected machine learning classifiers. The proposed method has broad applications in expedited disease and normal human cell classifications by the epigenome and genome datasets.

Posted Content
TL;DR: Assessing whether graphs capture dependencies seen in gene expression data better than random finds that dependencies can be captured almost as well at random which suggests that, in terms of gene expression levels, the relevant information about the state of the cell is spread across many genes.
Abstract: Gene interaction graphs aim to capture various relationships between genes and represent decades of biology research. When trying to make predictions from genomic data, those graphs could be used to overcome the curse of dimensionality by making machine learning models sparser and more consistent with biological common knowledge. In this work, we focus on assessing whether those graphs capture dependencies seen in gene expression data better than random. We formulate a condition that graphs should satisfy to provide a good prior knowledge and propose to test it using a `Single Gene Inference' (SGI) task. We compare random graphs with seven major gene interaction graphs published by different research groups, aiming to measure the true benefit of using biologically relevant graphs in this context. Our analysis finds that dependencies can be captured almost as well at random which suggests that, in terms of gene expression levels, the relevant information about the state of the cell is spread across many genes.

Posted Content
TL;DR: The authors' analysis with random graphs finds that dependencies can be captured almost as well at random which suggests that, in terms of gene expression levels, the relevant information about the state of the cell is spread across many genes.
Abstract: Gene interaction graphs aim to capture various relationships between genes and can represent decades of biology research. When trying to make predictions from genomic data, those graphs could be used to overcome the curse of dimensionality by making machine learning models sparser and more consistent with biological common knowledge. In this work, we focus on assessing how well those graphs capture dependencies seen in gene expression data to evaluate the adequacy of the prior knowledge provided by those graphs. We propose a condition graphs should satisfy to provide good prior knowledge and test it using `Single Gene Inference' tasks. We also compare with randomly generated graphs, aiming to measure the true benefit of using biologically relevant graphs in this context, and validate our findings with five clinical tasks. We find some graphs capture relevant dependencies for most genes while being very sparse. Our analysis with random graphs finds that dependencies can be captured almost as well at random which suggests that, in terms of gene expression levels, the relevant information about the state of the cell is spread across many genes.

Posted Content
TL;DR: In this article, a hidden Markov model is proposed to describe the cycle of DNA methylation and demethylation over short time scales, and the model is applied to mouse embryonic stem cells.
Abstract: The understanding of mechanisms that control epigenetic changes is an important research area in modern functional biology. Epigenetic modifications such as DNA methylation are in general very stable over many cell divisions. DNA methylation can however be subject to specific and fast changes over a short time scale even in non-dividing (i.e. not-replicating) cells. Such dynamic DNA methylation changes are caused by a combination of active demethylation and de novo methylation processes which have not been investigated in integrated models. Here we present a hybrid (hidden) Markov model to describe the cycle of methylation and demethylation over (short) time scales. Our hybrid model decribes several molecular events either happening at deterministic points (i.e. describing mechanisms that occur only during cell division) and other events occurring at random time points. We test our model on mouse embryonic stem cells using time-resolved data. We predict methylation changes and estimate the efficiencies of the different modification steps related to DNA methylation and demethylation.