scispace - formally typeset
Search or ask a question

Showing papers in "Genome Biology in 2019"


Journal ArticleDOI
TL;DR: This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics.
Abstract: Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics. Each output is benchmarked on appropriate real or simulated datasets, and where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at https://github.com/davidemms/OrthoFinder.

2,376 citations


Journal ArticleDOI
TL;DR: Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold.
Abstract: Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.

2,261 citations


Journal ArticleDOI
TL;DR: It is proposed that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.
Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

1,898 citations


Journal ArticleDOI
TL;DR: The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance, and users should consider producing a custom model using a larger neural network and/or training data from the same species.
Abstract: Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences (‘polishing’) with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

1,488 citations


Journal ArticleDOI
TL;DR: Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions, which preserves the global topology of data, allow analyzing data at different resolutions, and result in much higher computational efficiency of the typical exploratory data analysis workflow.
Abstract: Single-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions ( https://github.com/theislab/paga ). PAGA maps preserve the global topology of data, allow analyzing data at different resolutions, and result in much higher computational efficiency of the typical exploratory data analysis workflow. We demonstrate the method by inferring structure-rich cell maps with consistent topology across four hematopoietic datasets, adult planaria and the zebrafish embryo and benchmark computational performance on one million neurons.

827 citations


Journal ArticleDOI
TL;DR: Cytoscape Automation (CA), which marries CyToscape to highly productive workflow systems, for example, Python/R in Jupyter/RStudio, is described, which exposes over 270 Cytoscapes core functions and 34 apps as REST-callable functions with standardized JSON interfaces backed by Swagger documentation.
Abstract: Cytoscape is one of the most successful network biology analysis and visualization tools, but because of its interactive nature, its role in creating reproducible, scalable, and novel workflows has been limited. We describe Cytoscape Automation (CA), which marries Cytoscape to highly productive workflow systems, for example, Python/R in Jupyter/RStudio. We expose over 270 Cytoscape core functions and 34 Cytoscape apps as REST-callable functions with standardized JSON interfaces backed by Swagger documentation. Independent projects to create and publish Python/R native CA interface libraries have reached an advanced stage, and a number of automation workflows are already published.

721 citations


Journal ArticleDOI
TL;DR: StringTie2 is a reference-guided transcriptome assembler that works with both short and long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies.
Abstract: RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

635 citations


Journal ArticleDOI
TL;DR: The utility of PrimalSeq is demonstrated by measuring Zika and West Nile virus diversity from varied sample types and the accumulation of genetic diversity is influenced by experimental and biological systems.
Abstract: How viruses evolve within hosts can dictate infection outcomes; however, reconstructing this process is challenging. We evaluate our multiplexed amplicon approach, PrimalSeq, to demonstrate how virus concentration, sequencing coverage, primer mismatches, and replicates influence the accuracy of measuring intrahost virus diversity. We develop an experimental protocol and computational tool, iVar, for using PrimalSeq to measure virus diversity using Illumina and compare the results to Oxford Nanopore sequencing. We demonstrate the utility of PrimalSeq by measuring Zika and West Nile virus diversity from varied sample types and show that the accumulation of genetic diversity is influenced by experimental and biological systems.

612 citations


Journal ArticleDOI
TL;DR: This work describes a new statistical method, EmptyDrops, based on detecting significant deviations from the expression profile of the ambient solution that retains distinct cell types that would have been discarded by existing methods in several real data sets.
Abstract: Droplet-based single-cell RNA sequencing protocols have dramatically increased the throughput of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets. Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that EmptyDrops has greater power than existing approaches while controlling the false discovery rate among detected cells. Our method also retains distinct cell types that would have been discarded by existing methods in several real data sets.

499 citations


Journal ArticleDOI
TL;DR: A large-scale RNA sequencing study is performed to experimentally identify genes that are downregulated by 25 miRNAs and an improved computational model for genome-wide miRNA target prediction is developed and validated.
Abstract: We perform a large-scale RNA sequencing study to experimentally identify genes that are downregulated by 25 miRNAs. This RNA-seq dataset is combined with public miRNA target binding data to systematically identify miRNA targeting features that are characteristic of both miRNA binding and target downregulation. By integrating these common features in a machine learning framework, we develop and validate an improved computational model for genome-wide miRNA target prediction. All prediction data can be accessed at miRDB ( http://mirdb.org ).

478 citations


Journal ArticleDOI
TL;DR: A comprehensive landscape of different modes of gene duplication across the plant kingdom is identified by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.
Abstract: The sharp increase of plant genome and transcriptome data provide valuable resources to investigate evolutionary consequences of gene duplication in a range of taxa, and unravel common principles underlying duplicate gene retention. We survey 141 sequenced plant genomes to elucidate consequences of gene and genome duplication, processes central to the evolution of biodiversity. We develop a pipeline named DupGen_finder to identify different modes of gene duplication in plants. Genes derived from whole-genome, tandem, proximal, transposed, or dispersed duplication differ in abundance, selection pressure, expression divergence, and gene conversion rate among genomes. The number of WGD-derived duplicate genes decreases exponentially with increasing age of duplication events—transposed duplication- and dispersed duplication-derived genes declined in parallel. In contrast, the frequency of tandem and proximal duplications showed no significant decrease over time, providing a continuous supply of variants available for adaptation to continuously changing environments. Moreover, tandem and proximal duplicates experienced stronger selective pressure than genes formed by other modes and evolved toward biased functional roles involved in plant self-defense. The rate of gene conversion among WGD-derived gene pairs declined over time, peaking shortly after polyploidization. To provide a platform for accessing duplicated gene pairs in different plants, we constructed the Plant Duplicate Gene Database. We identify a comprehensive landscape of different modes of gene duplication across the plant kingdom by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.

Journal ArticleDOI
TL;DR: Key challenges to understand clock mechanisms and biomarker utility are discussed, including dissecting the drivers and regulators of age-related changes in single-cell, tissue- and disease-specific models, as well as exploring other epigenomic marks, longitudinal and diverse population studies, and non-human models.
Abstract: Epigenetic clocks comprise a set of CpG sites whose DNA methylation levels measure subject age. These clocks are acknowledged as a highly accurate molecular correlate of chronological age in humans and other vertebrates. Also, extensive research is aimed at their potential to quantify biological aging rates and test longevity or rejuvenating interventions. Here, we discuss key challenges to understand clock mechanisms and biomarker utility. This requires dissecting the drivers and regulators of age-related changes in single-cell, tissue- and disease-specific models, as well as exploring other epigenomic marks, longitudinal and diverse population studies, and non-human models. We also highlight important ethical issues in forensic age determination and predicting the trajectory of biological aging in an individual.

Journal ArticleDOI
Changdai Gu1, Gi Bae Kim1, Won Jun Kim1, Hyun Uk Kim1, Sang Yup Lee1 
TL;DR: Current reconstructed GEMs are reviewed and discussed, including strain development for chemicals and materials production, drug targeting in pathogens, prediction of enzyme functions, pan-reactome analysis, modeling interactions among multiple cells or organisms, and understanding human diseases.
Abstract: Genome-scale metabolic models (GEMs) computationally describe gene-protein-reaction associations for entire metabolic genes in an organism, and can be simulated to predict metabolic fluxes for various systems-level metabolic studies. Since the first GEM for Haemophilus influenzae was reported in 1999, advances have been made to develop and simulate GEMs for an increasing number of organisms across bacteria, archaea, and eukarya. Here, we review current reconstructed GEMs and discuss their applications, including strain development for chemicals and materials production, drug targeting in pathogens, prediction of enzyme functions, pan-reactome analysis, modeling interactions among multiple cells or organisms, and understanding human diseases.

Journal ArticleDOI
TL;DR: A comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) is created that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements and will greatly facilitate TE annotation in eukaryotic genomes.
Abstract: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

Journal ArticleDOI
TL;DR: This work presents RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes and demonstrates the scalability and utility of the tool.
Abstract: We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO .

Journal ArticleDOI
TL;DR: This work benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers and found that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations.
Abstract: Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods’ sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub ( https://github.com/tabdelaal/scRNAseq_Benchmark ). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.

Journal ArticleDOI
TL;DR: These approaches are reviewed with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach.
Abstract: Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution—giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach.

Journal ArticleDOI
TL;DR: The lower accuracy of de novo assembly-based methods notwithstanding, they are useful for reconstructing fusion isoforms and tumor viruses, both of which are important in cancer research.
Abstract: Accurate fusion transcript detection is essential for comprehensive characterization of cancer transcriptomes. Over the last decade, multiple bioinformatic tools have been developed to predict fusions from RNA-seq, based on either read mapping or de novo fusion transcript assembly. We benchmark 23 different methods including applications we develop, STAR-Fusion and TrinityFusion, leveraging both simulated and real RNA-seq. Overall, STAR-Fusion, Arriba, and STAR-SEQR are the most accurate and fastest for fusion detection on cancer transcriptomes. The lower accuracy of de novo assembly-based methods notwithstanding, they are useful for reconstructing fusion isoforms and tumor viruses, both of which are important in cancer research.

Journal ArticleDOI
TL;DR: A non-canonical function of circRNA in modulating liver cancer cell growth through the Wnt pathway is illustrated, which can provide novel mechanistic insights into the underlying mechanisms of hepatocellular carcinoma.
Abstract: Circular RNAs are a class of regulatory RNA transcripts, which are ubiquitously expressed in eukaryotes. In the current study, we evaluate the function of a novel circRNA derived from the β-catenin gene locus, circβ-catenin. Circβ-catenin is predominantly localized in the cytoplasm and displays resistance to RNase-R treatment. We find that circβ-catenin is highly expressed in liver cancer tissues when compared to adjacent normal tissues. Silencing of circβ-catenin significantly suppresses malignant phenotypes in vitro and in vivo, and knockdown of this circRNA reduces the protein level of β-catenin without affecting its mRNA level. We show that circβ-catenin affects a wide spectrum of Wnt pathway-related genes, and furthermore, circβ-catenin produces a novel 370-amino acid β-catenin isoform that uses the start codon as the linear β-catenin mRNA transcript and translation is terminated at a new stop codon created by circularization. We find that this novel isoform can stabilize full-length β-catenin by antagonizing GSK3β-induced β-catenin phosphorylation and degradation, leading to activation of the Wnt pathway. Our findings illustrate a non-canonical function of circRNA in modulating liver cancer cell growth through the Wnt pathway, which can provide novel mechanistic insights into the underlying mechanisms of hepatocellular carcinoma.

Journal ArticleDOI
TL;DR: This work comprehensively evaluates the performance of 69 existing SV detection algorithms and enumerates potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories.
Abstract: Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy.

Journal ArticleDOI
TL;DR: The Network of Cancer Genes is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles, and annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality.
Abstract: The Network of Cancer Genes (NCG) is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles. These genes were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites. This represents a more than 1.5-fold content increase compared to the previous version. NCG also annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality. NCG is accessible at http://ncg.kcl.ac.uk/ .

Journal ArticleDOI
TL;DR: Simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance are proposed, which outperform the current practice in a downstream clustering assessment using ground truth datasets.
Abstract: Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.

Journal ArticleDOI
TL;DR: HINT-ATAC uses a position dependency model to learn the cleavage preferences of the transposase, and observes strand-specific cleavage patterns around transcription factor binding sites, which are determined by local nucleosome architecture.
Abstract: Transposase-Accessible Chromatin followed by sequencing (ATAC-seq) is a simple protocol for detection of open chromatin. Computational footprinting, the search for regions with depletion of cleavage events due to transcription factor binding, is poorly understood for ATAC-seq. We propose the first footprinting method considering ATAC-seq protocol artifacts. HINT-ATAC uses a position dependency model to learn the cleavage preferences of the transposase. We observe strand-specific cleavage patterns around transcription factor binding sites, which are determined by local nucleosome architecture. By incorporating all these biases, HINT-ATAC is able to significantly outperform competing methods in the prediction of transcription factor binding sites with footprints.

Journal ArticleDOI
TL;DR: The enzymatic machineries and modifications that are involved in cancer development and progression are reviewed, and how to apply currently available small molecule inhibitors for histone modifiers as tool compounds to study the functional significance of histone modifications and their clinical implications is reviewed.
Abstract: The epigenetic modifications of histones are versatile marks that are intimately connected to development and disease pathogenesis including human cancers. In this review, we will discuss the many different types of histone modifications and the biological processes with which they are involved. Specifically, we review the enzymatic machineries and modifications that are involved in cancer development and progression, and how to apply currently available small molecule inhibitors for histone modifiers as tool compounds to study the functional significance of histone modifications and their clinical implications.

Journal ArticleDOI
TL;DR: It is found that CAF-derived exosomal miR-196a confers cisplatin resistance in HNC by targeting CDKN1B and ING5, indicating miR -196a may serve as a promising predictor of and potential therapeutic target for cisplatoon resistance in head and neck cancer.
Abstract: Cisplatin resistance is a major challenge for advanced head and neck cancer (HNC). Understanding the underlying mechanisms and developing effective strategies against cisplatin resistance are highly desired in the clinic. However, how tumor stroma modulates HNC growth and chemoresistance is unclear. We show that cancer-associated fibroblasts (CAFs) are intrinsically resistant to cisplatin and have an active role in regulating HNC cell survival and proliferation by delivering functional miR-196a from CAFs to tumor cells via exosomes. Exosomal miR-196a then binds novel targets, CDKN1B and ING5, to endow HNC cells with cisplatin resistance. Exosome or exosomal miR-196a depletion from CAFs functionally restored HNC cisplatin sensitivity. Importantly, we found that miR-196a packaging into CAF-derived exosomes might be mediated by heterogeneous nuclear ribonucleoprotein A1 (hnRNPA1). Moreover, we also found that high levels of plasma exosomal miR-196a are clinically correlated with poor overall survival and chemoresistance. The present study finds that CAF-derived exosomal miR-196a confers cisplatin resistance in HNC by targeting CDKN1B and ING5, indicating miR-196a may serve as a promising predictor of and potential therapeutic target for cisplatin resistance in HNC.

Journal ArticleDOI
TL;DR: Powell et al. as discussed by the authors presented scPred, a new generalizable method that is able to provide highly accurate classification of single cells, using a combination of unbiased feature selection from a reduced-dimension space, and machine-learning probability-based prediction method.
Abstract: Single-cell RNA sequencing has enabled the characterization of highly specific cell types in many tissues, as well as both primary and stem cell-derived cell lines. An important facet of these studies is the ability to identify the transcriptional signatures that define a cell type or state. In theory, this information can be used to classify an individual cell based on its transcriptional profile. Here, we present scPred, a new generalizable method that is able to provide highly accurate classification of single cells, using a combination of unbiased feature selection from a reduced-dimension space, and machine-learning probability-based prediction method. We apply scPred to scRNA-seq data from pancreatic tissue, mononuclear cells, colorectal tumor biopsies, and circulating dendritic cells and show that scPred is able to classify individual cells with high accuracy. The generalized method is available at https://github.com/powellgenomicslab/scPred/.

Journal ArticleDOI
TL;DR: It is shown that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages, and a classification method is presented that integrates multiple signals to classify sequences and metagenome-assembled genomes.
Abstract: Current-day metagenomics analyses increasingly involve de novo taxonomic classification of long DNA sequences and metagenome-assembled genomes. Here, we show that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages. We present a classification method that integrates multiple signals to classify sequences (Contig Annotation Tool, CAT) and metagenome-assembled genomes (Bin Annotation Tool, BAT). Classifications are automatically made at low taxonomic ranks if closely related organisms are present in the reference database and at higher ranks otherwise. The result is a high classification precision even for sequences from considerably unknown organisms.

Journal ArticleDOI
Naihui Zhou1, Yuxiang Jiang2, Timothy Bergquist3, Alexandra J. Lee4  +185 moreInstitutions (71)
TL;DR: The third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed, concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not.
Abstract: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

Journal ArticleDOI
TL;DR: A benchmarking framework is presented that is applied to 10 computational methods for scATAC-seq on 13 synthetic and real datasets from different assays, profiling cell types from diverse tissues and organisms.
Abstract: Recent innovations in single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) enable profiling of the epigenetic landscape of thousands of individual cells. scATAC-seq data analysis presents unique methodological challenges. scATAC-seq experiments sample DNA, which, due to low copy numbers (diploid in humans), lead to inherent data sparsity (1–10% of peaks detected per cell) compared to transcriptomic (scRNA-seq) data (10–45% of expressed genes detected per cell). Such challenges in data generation emphasize the need for informative features to assess cell heterogeneity at the chromatin level. We present a benchmarking framework that is applied to 10 computational methods for scATAC-seq on 13 synthetic and real datasets from different assays, profiling cell types from diverse tissues and organisms. Methods for processing and featurizing scATAC-seq data were compared by their ability to discriminate cell types when combined with common unsupervised clustering approaches. We rank evaluated methods and discuss computational challenges associated with scATAC-seq analysis including inherently sparse data, determination of features, peak calling, the effects of sequencing coverage and noise, and clustering performance. Running times and memory requirements are also discussed. This reference summary of scATAC-seq methods offers recommendations for best practices with consideration for both the non-expert user and the methods developer. Despite variation across methods and datasets, SnapATAC, Cusanovich2018, and cisTopic outperform other methods in separating cell populations of different coverages and noise levels in both synthetic and real datasets. Notably, SnapATAC is the only method able to analyze a large dataset (> 80,000 cells).

Journal ArticleDOI
TL;DR: A new version of RnBeads software - an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing, provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, and improved usability with a novel graphical user interface.
Abstract: DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples. Here, we describe a new version of our RnBeads software - an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer.