Showing papers in "Genome Biology in 2019"

PDF

Open Access

Journal Article•DOI•

OrthoFinder: phylogenetic orthology inference for comparative genomics

[...]

David M. Emms¹, Steven L. Kelly¹•Institutions (1)

14 Nov 2019-Genome Biology

TL;DR: This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics.

...read moreread less

Abstract: Here, we present a major advance of the OrthoFinder method. This extends OrthoFinder’s high accuracy orthogroup inference to provide phylogenetic inference of orthologs, rooted gene trees, gene duplication events, the rooted species tree, and comparative genomics statistics. Each output is benchmarked on appropriate real or simulated datasets, and where comparable methods exist, OrthoFinder is equivalent to or outperforms these methods. Furthermore, OrthoFinder is the most accurate ortholog inference method on the Quest for Orthologs benchmark test. Finally, OrthoFinder’s comprehensive phylogenetic analysis is achieved with equivalent speed and scalability to the fastest, score-based heuristic methods. OrthoFinder is available at https://github.com/davidemms/OrthoFinder.

...read moreread less

2,376 citations

Journal Article•DOI•

Improved metagenomic analysis with Kraken 2.

[...]

Derrick E. Wood¹, Jennifer Lu¹, Ben Langmead¹•Institutions (1)

Johns Hopkins University¹

28 Nov 2019-Genome Biology

TL;DR: Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold.

...read moreread less

Abstract: Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.

...read moreread less

2,261 citations

Journal Article•DOI•

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

[...]

Christoph Hafemeister, Rahul Satija¹•Institutions (1)

New York University¹

23 Dec 2019-Genome Biology

TL;DR: It is proposed that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity.

...read moreread less

Abstract: Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

...read moreread less

1,898 citations

Journal Article•DOI•

Performance of neural network basecalling tools for Oxford Nanopore sequencing.

[...]

Ryan R. Wick¹, Louise M. Judd¹, Kathryn E. Holt¹, Kathryn E. Holt²•Institutions (2)

Monash University¹, University of London²

24 Jun 2019-Genome Biology

TL;DR: The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance, and users should consider producing a custom model using a larger neural network and/or training data from the same species.

...read moreread less

Abstract: Basecalling, the computational process of translating raw electrical signal to nucleotide sequence, is of critical importance to the sequencing platforms produced by Oxford Nanopore Technologies (ONT). Here, we examine the performance of different basecalling tools, looking at accuracy at the level of bases within individual reads and at majority-rule consensus basecalls in an assembly. We also investigate some additional aspects of basecalling: training using a taxon-specific dataset, using a larger neural network model and improving consensus basecalls in an assembly by additional signal-level analysis with Nanopolish. Training basecallers on taxon-specific data results in a significant boost in consensus accuracy, mostly due to the reduction of errors in methylation motifs. A larger neural network is able to improve both read and consensus accuracy, but at a cost to speed. Improving consensus sequences (‘polishing’) with Nanopolish somewhat negates the accuracy differences in basecallers, but pre-polish accuracy does have an effect on post-polish accuracy. Basecalling accuracy has seen significant improvements over the last 2 years. The current version of ONT’s Guppy basecaller performs well overall, with good accuracy and fast performance. If higher accuracy is required, users should consider producing a custom model using a larger neural network and/or training data from the same species.

...read moreread less

1,488 citations

Journal Article•DOI•

PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells

[...]

F. Alexander Wolf, Fiona K. Hamey¹, Mireya Plass², Jordi Solana², Joakim S. Dahlin³, Joakim S. Dahlin¹, Berthold Göttgens¹, Nikolaus Rajewsky², Lukas M. Simon, Fabian J. Theis⁴ - Show less +6 more•Institutions (4)

University of Cambridge¹, Max Delbrück Center for Molecular Medicine², Karolinska University Hospital³, Technische Universität München⁴

19 Mar 2019-Genome Biology

TL;DR: Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions, which preserves the global topology of data, allow analyzing data at different resolutions, and result in much higher computational efficiency of the typical exploratory data analysis workflow.

...read moreread less

Abstract: Single-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions ( https://github.com/theislab/paga ). PAGA maps preserve the global topology of data, allow analyzing data at different resolutions, and result in much higher computational efficiency of the typical exploratory data analysis workflow. We demonstrate the method by inferring structure-rich cell maps with consistent topology across four hematopoietic datasets, adult planaria and the zebrafish embryo and benchmark computational performance on one million neurons.

...read moreread less

827 citations

Journal Article•DOI•

Cytoscape Automation: empowering workflow-based network analysis.

[...]

David Otasek¹, John H. Morris², Jorge Boucas³, Alexander R. Pico⁴, Barry Demchak⁵ - Show less +1 more•Institutions (5)

University of California, Los Angeles¹, University of California, San Francisco², Max Planck Society³, Gladstone Institutes⁴, University of California, San Diego⁵

02 Sep 2019-Genome Biology

TL;DR: Cytoscape Automation (CA), which marries CyToscape to highly productive workflow systems, for example, Python/R in Jupyter/RStudio, is described, which exposes over 270 Cytoscapes core functions and 34 apps as REST-callable functions with standardized JSON interfaces backed by Swagger documentation.

...read moreread less

Abstract: Cytoscape is one of the most successful network biology analysis and visualization tools, but because of its interactive nature, its role in creating reproducible, scalable, and novel workflows has been limited. We describe Cytoscape Automation (CA), which marries Cytoscape to highly productive workflow systems, for example, Python/R in Jupyter/RStudio. We expose over 270 Cytoscape core functions and 34 Cytoscape apps as REST-callable functions with standardized JSON interfaces backed by Swagger documentation. Independent projects to create and publish Python/R native CA interface libraries have reached an advanced stage, and a number of automation workflows are already published.

...read moreread less

721 citations

Journal Article•DOI•

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

[...]

Sam Kovaka¹, Aleksey V. Zimin¹, Geo Pertea¹, Roham Razaghi¹, Steven L. Salzberg, Mihaela Pertea¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

16 Dec 2019-Genome Biology

TL;DR: StringTie2 is a reference-guided transcriptome assembler that works with both short and long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies.

...read moreread less

Abstract: RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

...read moreread less

635 citations

Journal Article•DOI•

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

[...]

Nathan D. Grubaugh¹, Nathan D. Grubaugh², Karthik Gangavarapu¹, Joshua Quick³, Nathaniel L. Matteson¹, Jaqueline Goes de Jesus³, Jaqueline Goes de Jesus⁴, Bradley J. Main⁵, Amanda L Tan⁶, Lauren M. Paul⁶, Doug E. Brackney⁷, Saran Grewal, Nikos Gurfield, Koen K. A. Van Rompay⁸, Sharon Isern⁶, Scott F. Michael⁶, Lark L. Coffey⁵, Nicholas J. Loman³, Kristian G. Andersen⁹, Kristian G. Andersen¹ - Show less +16 more•Institutions (9)

Scripps Research Institute¹, Yale University², University of Birmingham³, Oswaldo Cruz Foundation⁴, University of California, Davis⁵, Florida State University College of Arts and Sciences⁶, Connecticut Agricultural Experiment Station⁷, California National Primate Research Center⁸, Scripps Health⁹

08 Jan 2019-Genome Biology

TL;DR: The utility of PrimalSeq is demonstrated by measuring Zika and West Nile virus diversity from varied sample types and the accumulation of genetic diversity is influenced by experimental and biological systems.

...read moreread less

Abstract: How viruses evolve within hosts can dictate infection outcomes; however, reconstructing this process is challenging. We evaluate our multiplexed amplicon approach, PrimalSeq, to demonstrate how virus concentration, sequencing coverage, primer mismatches, and replicates influence the accuracy of measuring intrahost virus diversity. We develop an experimental protocol and computational tool, iVar, for using PrimalSeq to measure virus diversity using Illumina and compare the results to Oxford Nanopore sequencing. We demonstrate the utility of PrimalSeq by measuring Zika and West Nile virus diversity from varied sample types and show that the accumulation of genetic diversity is influenced by experimental and biological systems.

...read moreread less

612 citations

Journal Article•DOI•

EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data

[...]

Aaron T. L. Lun¹, Samantha J. Riesenfeld², Tallulah S. Andrews³, Tomás Gomes³, John C. Marioni¹, John C. Marioni⁴, John C. Marioni³ - Show less +3 more•Institutions (4)

University of Cambridge¹, Broad Institute², Wellcome Trust Sanger Institute³, European Bioinformatics Institute⁴

22 Mar 2019-Genome Biology

TL;DR: This work describes a new statistical method, EmptyDrops, based on detecting significant deviations from the expression profile of the ambient solution that retains distinct cell types that would have been discarded by existing methods in several real data sets.

...read moreread less

Abstract: Droplet-based single-cell RNA sequencing protocols have dramatically increased the throughput of single-cell transcriptomics studies. A key computational challenge when processing these data is to distinguish libraries for real cells from empty droplets. Here, we describe a new statistical method for calling cells from droplet-based data, based on detecting significant deviations from the expression profile of the ambient solution. Using simulations, we demonstrate that EmptyDrops has greater power than existing approaches while controlling the false discovery rate among detected cells. Our method also retains distinct cell types that would have been discarded by existing methods in several real data sets.

...read moreread less

499 citations

Journal Article•DOI•

Prediction of functional microRNA targets by integrative modeling of microRNA binding and target expression data

[...]

Weijun Liu¹, Xiaowei Wang¹•Institutions (1)

Washington University in St. Louis¹

22 Jan 2019-Genome Biology

TL;DR: A large-scale RNA sequencing study is performed to experimentally identify genes that are downregulated by 25 miRNAs and an improved computational model for genome-wide miRNA target prediction is developed and validated.

...read moreread less

Abstract: We perform a large-scale RNA sequencing study to experimentally identify genes that are downregulated by 25 miRNAs. This RNA-seq dataset is combined with public miRNA target binding data to systematically identify miRNA targeting features that are characteristic of both miRNA binding and target downregulation. By integrating these common features in a machine learning framework, we develop and validate an improved computational model for genome-wide miRNA target prediction. All prediction data can be accessed at miRDB ( http://mirdb.org ).

...read moreread less

478 citations

Journal Article•DOI•

Gene duplication and evolution in recurring polyploidization-diploidization cycles in plants.

[...]

Xin Qiao¹, Qionghou Li¹, Hao Yin¹, Kaijie Qi¹, Leiting Li¹, Runze Wang¹, Shaoling Zhang¹, Andrew H. Paterson² - Show less +4 more•Institutions (2)

Nanjing Agricultural University¹, Plant Genome Mapping Laboratory²

21 Feb 2019-Genome Biology

TL;DR: A comprehensive landscape of different modes of gene duplication across the plant kingdom is identified by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.

...read moreread less

Abstract: The sharp increase of plant genome and transcriptome data provide valuable resources to investigate evolutionary consequences of gene duplication in a range of taxa, and unravel common principles underlying duplicate gene retention. We survey 141 sequenced plant genomes to elucidate consequences of gene and genome duplication, processes central to the evolution of biodiversity. We develop a pipeline named DupGen_finder to identify different modes of gene duplication in plants. Genes derived from whole-genome, tandem, proximal, transposed, or dispersed duplication differ in abundance, selection pressure, expression divergence, and gene conversion rate among genomes. The number of WGD-derived duplicate genes decreases exponentially with increasing age of duplication events—transposed duplication- and dispersed duplication-derived genes declined in parallel. In contrast, the frequency of tandem and proximal duplications showed no significant decrease over time, providing a continuous supply of variants available for adaptation to continuously changing environments. Moreover, tandem and proximal duplicates experienced stronger selective pressure than genes formed by other modes and evolved toward biased functional roles involved in plant self-defense. The rate of gene conversion among WGD-derived gene pairs declined over time, peaking shortly after polyploidization. To provide a platform for accessing duplicated gene pairs in different plants, we constructed the Plant Duplicate Gene Database. We identify a comprehensive landscape of different modes of gene duplication across the plant kingdom by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.

...read moreread less

Journal Article•DOI•

DNA methylation aging clocks: challenges and recommendations

[...]

Christopher G. Bell¹, Robert Lowe¹, Peter D. Adams², Peter D. Adams³, Andrea A. Baccarelli⁴, Stephan Beck⁵, Jordana T. Bell⁶, Brock C. Christensen⁷, Vadim N. Gladyshev⁸, Bastiaan T. Heijmans⁹, Steve Horvath¹⁰, Trey Ideker¹¹, Jean Pierre J. Issa¹², Karl T. Kelsey¹³, Riccardo E. Marioni¹⁴, Wolf Reik¹⁵, Wolf Reik¹⁶, Caroline L Relton¹⁷, Leonard C. Schalkwyk¹⁸, Andrew E. Teschendorff¹⁹, Andrew E. Teschendorff⁵, Wolfgang Wagner²⁰, Kang Zhang²¹, Vardhman K. Rakyan¹ - Show less +20 more•Institutions (21)

25 Nov 2019-Genome Biology

TL;DR: Key challenges to understand clock mechanisms and biomarker utility are discussed, including dissecting the drivers and regulators of age-related changes in single-cell, tissue- and disease-specific models, as well as exploring other epigenomic marks, longitudinal and diverse population studies, and non-human models.

...read moreread less

Abstract: Epigenetic clocks comprise a set of CpG sites whose DNA methylation levels measure subject age. These clocks are acknowledged as a highly accurate molecular correlate of chronological age in humans and other vertebrates. Also, extensive research is aimed at their potential to quantify biological aging rates and test longevity or rejuvenating interventions. Here, we discuss key challenges to understand clock mechanisms and biomarker utility. This requires dissecting the drivers and regulators of age-related changes in single-cell, tissue- and disease-specific models, as well as exploring other epigenomic marks, longitudinal and diverse population studies, and non-human models. We also highlight important ethical issues in forensic age determination and predicting the trajectory of biological aging in an individual.

...read moreread less

Journal Article•DOI•

Current status and applications of genome-scale metabolic models.

[...]

Changdai Gu¹, Gi Bae Kim¹, Won Jun Kim¹, Hyun Uk Kim¹, Sang Yup Lee¹ - Show less +1 more•Institutions (1)

KAIST¹

13 Jun 2019-Genome Biology

TL;DR: Current reconstructed GEMs are reviewed and discussed, including strain development for chemicals and materials production, drug targeting in pathogens, prediction of enzyme functions, pan-reactome analysis, modeling interactions among multiple cells or organisms, and understanding human diseases.

...read moreread less

Abstract: Genome-scale metabolic models (GEMs) computationally describe gene-protein-reaction associations for entire metabolic genes in an organism, and can be simulated to predict metabolic fluxes for various systems-level metabolic studies. Since the first GEM for Haemophilus influenzae was reported in 1999, advances have been made to develop and simulate GEMs for an increasing number of organisms across bacteria, archaea, and eukarya. Here, we review current reconstructed GEMs and discuss their applications, including strain development for chemicals and materials production, drug targeting in pathogens, prediction of enzyme functions, pan-reactome analysis, modeling interactions among multiple cells or organisms, and understanding human diseases.

...read moreread less

Journal Article•DOI•

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline

[...]

Shujun Ou¹, Weija Su¹, Yi Liao², Kapeel Chougule³, Jireh Agda⁴, Adam J. Hellinga⁴, Carlos Santiago Blanco Lugo⁴, Tyler A. Elliott⁴, Doreen Ware⁵, Doreen Ware³, Thomas Peterson¹, Ning Jiang⁶, Candice N. Hirsch⁷, Matthew B. Hufford¹ - Show less +10 more•Institutions (7)

Iowa State University¹, University of California, Irvine², Cold Spring Harbor Laboratory³, University of Guelph⁴, Cornell University⁵, Michigan State University⁶, University of Minnesota⁷

16 Dec 2019-Genome Biology

TL;DR: A comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) is created that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements and will greatly facilitate TE annotation in eukaryotic genomes.

...read moreread less

Abstract: Sequencing technology and assembly algorithms have matured to the point that high-quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse transposable elements (TEs) and provide an opportunity for comprehensive annotation of TEs. Numerous methods exist for annotation of each class of TEs, but their relative performances have not been systematically compared. Moreover, a comprehensive pipeline is needed to produce a non-redundant library of TEs for species lacking this resource to generate whole-genome TE annotations. We benchmark existing programs based on a carefully curated library of rice TEs. We evaluate the performance of methods annotating long terminal repeat (LTR) retrotransposons, terminal inverted repeat (TIR) transposons, short TIR transposons known as miniature inverted transposable elements (MITEs), and Helitrons. Performance metrics include sensitivity, specificity, accuracy, precision, FDR, and F1. Using the most robust programs, we create a comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a filtered non-redundant TE library for annotation of structurally intact and fragmented elements. EDTA also deconvolutes nested TE insertions frequently found in highly repetitive genomic regions. Using other model species with curated TE libraries (maize and Drosophila), EDTA is shown to be robust across both plant and animal species. The benchmarking results and pipeline developed here will greatly facilitate TE annotation in eukaryotic genomes. These annotations will promote a much more in-depth understanding of the diversity and evolution of TEs at both intra- and inter-species levels. EDTA is open-source and freely available: https://github.com/oushujun/EDTA.

...read moreread less

Journal Article•DOI•

RaGOO: fast and accurate reference-guided scaffolding of draft genomes

[...]

Michael Alonge¹, Sebastian Soyk², Srividya Ramakrishnan¹, Xingang Wang², Sara Goodwin², Fritz J. Sedlazeck³, Zachary B. Lippman², Zachary B. Lippman⁴, Michael C. Schatz¹, Michael C. Schatz² - Show less +6 more•Institutions (4)

Johns Hopkins University¹, Cold Spring Harbor Laboratory², Baylor College of Medicine³, Howard Hughes Medical Institute⁴

28 Oct 2019-Genome Biology

TL;DR: This work presents RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes and demonstrates the scalability and utility of the tool.

...read moreread less

Abstract: We present RaGOO, a reference-guided contig ordering and orienting tool that leverages the speed and sensitivity of Minimap2 to accurately achieve chromosome-scale assemblies in minutes. After the pseudomolecules are constructed, RaGOO identifies structural variants, including those spanning sequencing gaps. We show that RaGOO accurately orders and orients 3 de novo tomato genome assemblies, including the widely used M82 reference cultivar. We then demonstrate the scalability and utility of RaGOO with a pan-genome analysis of 103 Arabidopsis thaliana accessions by examining the structural variants detected in the newly assembled pseudomolecules. RaGOO is available open source at https://github.com/malonge/RaGOO .

...read moreread less

Journal Article•DOI•

A comparison of automatic cell identification methods for single-cell RNA sequencing data

[...]

Tamim Abdelaal¹, Tamim Abdelaal², Lieke Michielsen¹, Lieke Michielsen², Davy Cats¹, Dylan Hoogduin¹, Hailiang Mei¹, Marcel J. T. Reinders², Marcel J. T. Reinders¹, Ahmed Mahfouz¹, Ahmed Mahfouz² - Show less +7 more•Institutions (2)

Leiden University Medical Center¹, Delft University of Technology²

09 Sep 2019-Genome Biology

TL;DR: This work benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers and found that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations.

...read moreread less

Abstract: Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods’ sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub ( https://github.com/tabdelaal/scRNAseq_Benchmark ). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.

...read moreread less

Journal Article•DOI•

Structural variant calling: the long and the short of it

[...]

Medhat Mahmoud¹, Nastassia Gobet², Nastassia Gobet³, Diana Ivette Cruz-Dávalos², Diana Ivette Cruz-Dávalos³, Ninon Mounier², Christophe Dessimoz, Fritz J. Sedlazeck¹ - Show less +4 more•Institutions (3)

Baylor College of Medicine¹, Swiss Institute of Bioinformatics², University of Lausanne³

20 Nov 2019-Genome Biology

TL;DR: These approaches are reviewed with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach.

...read moreread less

Abstract: Recent research into structural variants (SVs) has established their importance to medicine and molecular biology, elucidating their role in various diseases, regulation of gene expression, ethnic diversity, and large-scale chromosome evolution—giving rise to the differences within populations and among species. Nevertheless, characterizing SVs and determining the optimal approach for a given experimental design remains a computational and scientific challenge. Multiple approaches have emerged to target various SV classes, zygosities, and size ranges. Here, we review these approaches with respect to their ability to infer SVs across the full spectrum of large, complex variations and present computational methods for each approach.

...read moreread less

Journal Article•DOI•

Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods

[...]

Brian J. Haas¹, Alexander Dobin², Bo Li¹, Bo Li³, Nicolas Stransky, Nathalie Pochet¹, Nathalie Pochet⁴, Aviv Regev⁵, Aviv Regev¹ - Show less +5 more•Institutions (5)

Broad Institute¹, Cold Spring Harbor Laboratory², Harvard University³, Brigham and Women's Hospital⁴, Massachusetts Institute of Technology⁵

21 Oct 2019-Genome Biology

TL;DR: The lower accuracy of de novo assembly-based methods notwithstanding, they are useful for reconstructing fusion isoforms and tumor viruses, both of which are important in cancer research.

...read moreread less

Abstract: Accurate fusion transcript detection is essential for comprehensive characterization of cancer transcriptomes. Over the last decade, multiple bioinformatic tools have been developed to predict fusions from RNA-seq, based on either read mapping or de novo fusion transcript assembly. We benchmark 23 different methods including applications we develop, STAR-Fusion and TrinityFusion, leveraging both simulated and real RNA-seq. Overall, STAR-Fusion, Arriba, and STAR-SEQR are the most accurate and fastest for fusion detection on cancer transcriptomes. The lower accuracy of de novo assembly-based methods notwithstanding, they are useful for reconstructing fusion isoforms and tumor viruses, both of which are important in cancer research.

...read moreread less

Journal Article•DOI•

Translation of the circular RNA circβ-catenin promotes liver cancer cell growth through activation of the Wnt pathway

[...]

Wei-Cheng Liang¹, Wei-Cheng Liang², Cheuk-Wa Wong², Puping Liang¹, Mai Shi², Ye Cao², Shitao Rao², Stephen Kwok-Wing Tsui², Mary M.Y. Waye², Qi Zhang¹, Weiming Fu³, Jinfang Zhang⁴ - Show less +8 more•Institutions (4)

Sun Yat-sen University¹, The Chinese University of Hong Kong², Southern Medical University³, Guangzhou University of Chinese Medicine⁴

26 Apr 2019-Genome Biology

TL;DR: A non-canonical function of circRNA in modulating liver cancer cell growth through the Wnt pathway is illustrated, which can provide novel mechanistic insights into the underlying mechanisms of hepatocellular carcinoma.

...read moreread less

Abstract: Circular RNAs are a class of regulatory RNA transcripts, which are ubiquitously expressed in eukaryotes. In the current study, we evaluate the function of a novel circRNA derived from the β-catenin gene locus, circβ-catenin. Circβ-catenin is predominantly localized in the cytoplasm and displays resistance to RNase-R treatment. We find that circβ-catenin is highly expressed in liver cancer tissues when compared to adjacent normal tissues. Silencing of circβ-catenin significantly suppresses malignant phenotypes in vitro and in vivo, and knockdown of this circRNA reduces the protein level of β-catenin without affecting its mRNA level. We show that circβ-catenin affects a wide spectrum of Wnt pathway-related genes, and furthermore, circβ-catenin produces a novel 370-amino acid β-catenin isoform that uses the start codon as the linear β-catenin mRNA transcript and translation is terminated at a new stop codon created by circularization. We find that this novel isoform can stabilize full-length β-catenin by antagonizing GSK3β-induced β-catenin phosphorylation and degradation, leading to activation of the Wnt pathway. Our findings illustrate a non-canonical function of circRNA in modulating liver cancer cell growth through the Wnt pathway, which can provide novel mechanistic insights into the underlying mechanisms of hepatocellular carcinoma.

...read moreread less

Journal Article•DOI•

Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing

[...]

Shunichi Kosugi, Yukihide Momozawa, Xiaoxi Liu, Chikashi Terao, Michiaki Kubo, Yoichiro Kamatani - Show less +2 more

03 Jun 2019-Genome Biology

TL;DR: This work comprehensively evaluates the performance of 69 existing SV detection algorithms and enumerates potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories.

...read moreread less

Abstract: Structural variations (SVs) or copy number variations (CNVs) greatly impact the functions of the genes encoded in the genome and are responsible for diverse human diseases. Although a number of existing SV detection algorithms can detect many types of SVs using whole genome sequencing (WGS) data, no single algorithm can call every type of SVs with high precision and high recall. We comprehensively evaluate the performance of 69 existing SV detection algorithms using multiple simulated and real WGS datasets. The results highlight a subset of algorithms that accurately call SVs depending on specific types and size ranges of the SVs and that accurately determine breakpoints, sizes, and genotypes of the SVs. We enumerate potential good algorithms for each SV category, among which GRIDSS, Lumpy, SVseq2, SoftSV, Manta, and Wham are better algorithms in deletion or duplication categories. To improve the accuracy of SV calling, we systematically evaluate the accuracy of overlapping calls between possible combinations of algorithms for every type and size range of SVs. The results demonstrate that both the precision and recall for overlapping calls vary depending on the combinations of specific algorithms rather than the combinations of methods used in the algorithms. These results suggest that careful selection of the algorithms for each type and size range of SVs is required for accurate calling of SVs. The selection of specific pairs of algorithms for overlapping calls promises to effectively improve the SV detection accuracy.

...read moreread less

Journal Article•DOI•

The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens

[...]

Dimitra Repana¹, Dimitra Repana², Joel Nulsen¹, Joel Nulsen², Lisa Dressler¹, Lisa Dressler², Michele Bortolomeazzi², Michele Bortolomeazzi¹, Santhilata Kuppili Venkata², Santhilata Kuppili Venkata¹, Aikaterini Tourna¹, Aikaterini Tourna², Anna Yakovleva¹, Anna Yakovleva², Tommaso Palmieri², Tommaso Palmieri¹, Francesca D. Ciccarelli², Francesca D. Ciccarelli¹ - Show less +14 more•Institutions (2)

King's College London¹, Francis Crick Institute²

03 Jan 2019-Genome Biology

TL;DR: The Network of Cancer Genes is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles, and annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality.

...read moreread less

Abstract: The Network of Cancer Genes (NCG) is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles. These genes were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites. This represents a more than 1.5-fold content increase compared to the previous version. NCG also annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality. NCG is accessible at http://ncg.kcl.ac.uk/ .

...read moreread less

Journal Article•DOI•

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

[...]

F. William Townes¹, F. William Townes², Stephanie C. Hicks³, Martin J. Aryee, Rafael A. Irizarry¹ - Show less +1 more•Institutions (3)

Harvard University¹, Princeton University², Johns Hopkins University³

23 Dec 2019-Genome Biology

TL;DR: Simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance are proposed, which outperform the current practice in a downstream clustering assessment using ground truth datasets.

...read moreread less

Abstract: Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.

...read moreread less

Journal Article•DOI•

Identification of transcription factor binding sites using ATAC-seq

[...]

Zhijian Li¹, Marcel H. Schulz, Thomas Look¹, Matthias Begemann¹, Martin Zenke¹, Ivan Gesteira Costa Filho¹ - Show less +2 more•Institutions (1)

RWTH Aachen University¹

26 Feb 2019-Genome Biology

TL;DR: HINT-ATAC uses a position dependency model to learn the cleavage preferences of the transposase, and observes strand-specific cleavage patterns around transcription factor binding sites, which are determined by local nucleosome architecture.

...read moreread less

Abstract: Transposase-Accessible Chromatin followed by sequencing (ATAC-seq) is a simple protocol for detection of open chromatin. Computational footprinting, the search for regions with depletion of cleavage events due to transcription factor binding, is poorly understood for ATAC-seq. We propose the first footprinting method considering ATAC-seq protocol artifacts. HINT-ATAC uses a position dependency model to learn the cleavage preferences of the transposase. We observe strand-specific cleavage patterns around transcription factor binding sites, which are determined by local nucleosome architecture. By incorporating all these biases, HINT-ATAC is able to significantly outperform competing methods in the prediction of transcription factor binding sites with footprints.

...read moreread less

Journal Article•DOI•

Epigenetic modifications of histones in cancer

[...]

Zibo Zhao¹, Ali Shilatifard¹•Institutions (1)

Northwestern University¹

20 Nov 2019-Genome Biology

TL;DR: The enzymatic machineries and modifications that are involved in cancer development and progression are reviewed, and how to apply currently available small molecule inhibitors for histone modifiers as tool compounds to study the functional significance of histone modifications and their clinical implications is reviewed.

...read moreread less

Abstract: The epigenetic modifications of histones are versatile marks that are intimately connected to development and disease pathogenesis including human cancers. In this review, we will discuss the many different types of histone modifications and the biological processes with which they are involved. Specifically, we review the enzymatic machineries and modifications that are involved in cancer development and progression, and how to apply currently available small molecule inhibitors for histone modifiers as tool compounds to study the functional significance of histone modifications and their clinical implications.

...read moreread less

Journal Article•DOI•

Exosomal miR-196a derived from cancer-associated fibroblasts confers cisplatin resistance in head and neck cancer through targeting CDKN1B and ING5

[...]

Xing Qin¹, Haiyan Guo¹, Xiaoning Wang¹, Xueqin Zhu¹, Ming Yan¹, Xu Wang¹, Qin Xu¹, Jianbo Shi¹, Eryi Lu¹, Wantao Chen¹, Jianjun Zhang¹ - Show less +7 more•Institutions (1)

Shanghai Jiao Tong University¹

14 Jan 2019-Genome Biology

TL;DR: It is found that CAF-derived exosomal miR-196a confers cisplatin resistance in HNC by targeting CDKN1B and ING5, indicating miR -196a may serve as a promising predictor of and potential therapeutic target for cisplatoon resistance in head and neck cancer.

...read moreread less

Abstract: Cisplatin resistance is a major challenge for advanced head and neck cancer (HNC). Understanding the underlying mechanisms and developing effective strategies against cisplatin resistance are highly desired in the clinic. However, how tumor stroma modulates HNC growth and chemoresistance is unclear. We show that cancer-associated fibroblasts (CAFs) are intrinsically resistant to cisplatin and have an active role in regulating HNC cell survival and proliferation by delivering functional miR-196a from CAFs to tumor cells via exosomes. Exosomal miR-196a then binds novel targets, CDKN1B and ING5, to endow HNC cells with cisplatin resistance. Exosome or exosomal miR-196a depletion from CAFs functionally restored HNC cisplatin sensitivity. Importantly, we found that miR-196a packaging into CAF-derived exosomes might be mediated by heterogeneous nuclear ribonucleoprotein A1 (hnRNPA1). Moreover, we also found that high levels of plasma exosomal miR-196a are clinically correlated with poor overall survival and chemoresistance. The present study finds that CAF-derived exosomal miR-196a confers cisplatin resistance in HNC by targeting CDKN1B and ING5, indicating miR-196a may serve as a promising predictor of and potential therapeutic target for cisplatin resistance in HNC.

...read moreread less

Journal Article•DOI•

scPred : accurate supervised method for cell-type classification from single-cell RNA-seq data

[...]

José Alquicira-Hernandez¹, José Alquicira-Hernandez², Anuja Sathe³, Hanlee P. Ji³, Quan Nguyen², Joseph E. Powell⁴, Joseph E. Powell¹ - Show less +3 more•Institutions (4)

Garvan Institute of Medical Research¹, University of Queensland², Stanford University³, University of New South Wales⁴

12 Dec 2019-Genome Biology

TL;DR: Powell et al. as discussed by the authors presented scPred, a new generalizable method that is able to provide highly accurate classification of single cells, using a combination of unbiased feature selection from a reduced-dimension space, and machine-learning probability-based prediction method.

...read moreread less

Abstract: Single-cell RNA sequencing has enabled the characterization of highly specific cell types in many tissues, as well as both primary and stem cell-derived cell lines. An important facet of these studies is the ability to identify the transcriptional signatures that define a cell type or state. In theory, this information can be used to classify an individual cell based on its transcriptional profile. Here, we present scPred, a new generalizable method that is able to provide highly accurate classification of single cells, using a combination of unbiased feature selection from a reduced-dimension space, and machine-learning probability-based prediction method. We apply scPred to scRNA-seq data from pancreatic tissue, mononuclear cells, colorectal tumor biopsies, and circulating dendritic cells and show that scPred is able to classify individual cells with high accuracy. The generalized method is available at https://github.com/powellgenomicslab/scPred/.

...read moreread less

Journal Article•DOI•

Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT

[...]

F. A. Bastiaan von Meijenfeldt¹, Ksenia Arkhipova¹, Diego D. Cambuy¹, Felipe H. Coutinho², Felipe H. Coutinho³, Felipe H. Coutinho⁴, Bas E. Dutilh⁴, Bas E. Dutilh¹ - Show less +4 more•Institutions (4)

Utrecht University¹, Universidad Miguel Hernández de Elche², Federal University of Rio de Janeiro³, Radboud University Nijmegen⁴

22 Oct 2019-Genome Biology

TL;DR: It is shown that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages, and a classification method is presented that integrates multiple signals to classify sequences and metagenome-assembled genomes.

...read moreread less

Abstract: Current-day metagenomics analyses increasingly involve de novo taxonomic classification of long DNA sequences and metagenome-assembled genomes. Here, we show that the conventional best-hit approach often leads to classifications that are too specific, especially when the sequences represent novel deep lineages. We present a classification method that integrates multiple signals to classify sequences (Contig Annotation Tool, CAT) and metagenome-assembled genomes (Bin Annotation Tool, BAT). Classifications are automatically made at low taxonomic ranks if closely related organisms are present in the reference database and at higher ranks otherwise. The result is a high classification precision even for sequences from considerably unknown organisms.

...read moreread less

Journal Article•DOI•

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

[...]

Naihui Zhou¹, Yuxiang Jiang², Timothy Bergquist³, Alexandra J. Lee⁴ +185 more•Institutions (71)

19 Nov 2019-Genome Biology

TL;DR: The third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed, concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not.

...read moreread less

Abstract: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.

...read moreread less

Journal Article•DOI•

Assessment of computational methods for the analysis of single-cell ATAC-seq data

[...]

Huidong Chen, Caleb A. Lareau¹, Caleb A. Lareau², Tommaso Andreani, Michael E. Vinyard, Sara P. Garcia¹, Kendell Clement, Miguel A. Andrade-Navarro³, Jason D. Buenrostro¹, Jason D. Buenrostro², Luca Pinello - Show less +7 more•Institutions (3)

Harvard University¹, Broad Institute², University of Mainz³

18 Nov 2019-Genome Biology

TL;DR: A benchmarking framework is presented that is applied to 10 computational methods for scATAC-seq on 13 synthetic and real datasets from different assays, profiling cell types from diverse tissues and organisms.

...read moreread less

Abstract: Recent innovations in single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) enable profiling of the epigenetic landscape of thousands of individual cells. scATAC-seq data analysis presents unique methodological challenges. scATAC-seq experiments sample DNA, which, due to low copy numbers (diploid in humans), lead to inherent data sparsity (1–10% of peaks detected per cell) compared to transcriptomic (scRNA-seq) data (10–45% of expressed genes detected per cell). Such challenges in data generation emphasize the need for informative features to assess cell heterogeneity at the chromatin level. We present a benchmarking framework that is applied to 10 computational methods for scATAC-seq on 13 synthetic and real datasets from different assays, profiling cell types from diverse tissues and organisms. Methods for processing and featurizing scATAC-seq data were compared by their ability to discriminate cell types when combined with common unsupervised clustering approaches. We rank evaluated methods and discuss computational challenges associated with scATAC-seq analysis including inherently sparse data, determination of features, peak calling, the effects of sequencing coverage and noise, and clustering performance. Running times and memory requirements are also discussed. This reference summary of scATAC-seq methods offers recommendations for best practices with consideration for both the non-expert user and the methods developer. Despite variation across methods and datasets, SnapATAC, Cusanovich2018, and cisTopic outperform other methods in separating cell populations of different coverages and noise levels in both synthetic and real datasets. Notably, SnapATAC is the only method able to analyze a large dataset (> 80,000 cells).

...read moreread less

Journal Article•DOI•

RnBeads 2.0: comprehensive analysis of DNA methylation data

[...]

Fabian Müller¹, Fabian Müller², Michael Scherer³, Michael Scherer², Yassen Assenov⁴, Pavlo Lutsik⁴, Jörn Walter³, Thomas Lengauer², Christoph Bock⁵, Christoph Bock⁶, Christoph Bock² - Show less +7 more•Institutions (6)

Stanford University¹, Max Planck Society², Saarland University³, German Cancer Research Center⁴, Medical University of Vienna⁵, Austrian Academy of Sciences⁶

14 Mar 2019-Genome Biology

TL;DR: A new version of RnBeads software - an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing, provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, and improved usability with a novel graphical user interface.

...read moreread less

Abstract: DNA methylation is a widely investigated epigenetic mark with important roles in development and disease. High-throughput assays enable genome-scale DNA methylation analysis in large numbers of samples. Here, we describe a new version of our RnBeads software - an R/Bioconductor package that implements start-to-finish analysis workflows for Infinium microarrays and various types of bisulfite sequencing. RnBeads 2.0 (https://rnbeads.org/) provides additional data types and analysis methods, new functionality for interpreting DNA methylation differences, improved usability with a novel graphical user interface, and better use of computational resources. We demonstrate RnBeads 2.0 in four re-runnable use cases focusing on cell differentiation and cancer.

...read moreread less

Collapse