scispace - formally typeset
Search or ask a question

Showing papers by "Zhong Wang published in 2019"


Journal ArticleDOI
26 Jul 2019-PeerJ
TL;DR: Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed, and recommends the community adopts Meta BAT 2 for their meetagenome binning experiments.
Abstract: We previously reported on MetaBAT, an automated metagenome binning software tool to reconstruct single genomes from microbial communities for subsequent analyses of uncultivated microbial species. MetaBAT has become one of the most popular binning tools largely due to its computational efficiency and ease of use, especially in binning experiments with a large number of samples and a large assembly. MetaBAT requires users to choose parameters to fine-tune its sensitivity and specificity. If those parameters are not chosen properly, binning accuracy can suffer, especially on assemblies of poor quality. Here, we developed MetaBAT 2 to overcome this problem. MetaBAT 2 uses a new adaptive binning algorithm to eliminate manual parameter tuning. We also performed extensive software engineering optimization to increase both computational and memory efficiency. Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed. Binning a typical metagenome assembly takes only a few minutes on a single commodity workstation. We therefore recommend the community adopts MetaBAT 2 for their metagenome binning experiments. MetaBAT 2 is open source software and available at https://bitbucket.org/berkeleylab/metabat.

1,334 citations


Journal ArticleDOI
TL;DR: An Apache Spark‐based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization and produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies.
Abstract: Author(s): Shi, Lizhen; Meng, Xiandong; Tseng, Elizabeth; Mascagni, Michael; Wang, Zhong | Abstract: MOTIVATION:Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes. RESULTS:Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems. AVAILABILITY AND IMPLEMENTATION:https://bitbucket.org/berkeleylab/jgi-sparc.

20 citations


Journal ArticleDOI
TL;DR: MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly.
Abstract: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub .

11 citations


Journal ArticleDOI
06 Dec 2019-Genes
TL;DR: This review surveys some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics in the context of ease of development, robustness, scalability, and efficiency.
Abstract: The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.

9 citations


Posted ContentDOI
21 Oct 2019-bioRxiv
TL;DR: An efficient software suite that estimates similarities between genomes based on their k-mer matches, and subsequently uses these similarities for classification, clustering, and visualization, and demonstrates that Genome Constellation can tackle the computational and algorithmic challenges in large-scale taxonomy analyses in metagenomics.
Abstract: Classifying taxa, including those that have not previously been identified, is a key task in characterizing the microbial communities of under-described habitats, including permanently ice-covered lakes in the dry valleys of the Antarctic. Current supervised phylogeny-based methods fall short on recognizing species assembled from metagenomic datasets from such habitats, as they are often incomplete or lack closely known relatives. Here, we report an efficient software suite, 99Genome Constellation99, that is capable of rapidly characterizing a large number of metagenome-assembled genomes. Genome Constellation estimates similarities between genomes based on their k-mer matches, and subsequently uses these similarities for classification, clustering, and visualization. The clusters of reference genomes formed by Genome Constellation closely resemble known phylogenetic relationships while simultaneously revealing unexpected connections. In a dataset containing 1,693 draft genomes assembled from the Antarctic lake communities where only 40\% could be placed in a phylogenetic tree, Genome Constellation improves taxa assignment to 61%. The clustering-based analysis revealed several novel taxa groups, including six clusters that may represent new bacterial phyla. Remarkably, we discovered 63 new giant viruses, 3 of which could not be found by using the traditional marker-based approach. In summary, we demonstrate that Genome Constellation provides an unbiased option to rapidly analyze a large number of microbial genomes and visually explore their relatedness. The software is available under BSD license at: https://bitbucket.org/berkeleylab/jgi-genomeconstellation/.

5 citations


Posted ContentDOI
29 Apr 2019-bioRxiv
TL;DR: This paper extends a previously developed scalable read clustering method on Apache Spark, SpaRC, by adding a new method to further cluster small clusters that exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem.
Abstract: Motivation Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Results Based on a previously developed scalable read clustering method on Apache Spark, SpaRC, that has very low false positives, here we extended its capability by adding a new method to further cluster small clusters. This method exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using a synthetic dataset from mouse gut microbiomes we show that this method has the potential to cluster almost all of the reads from genomes with sufficient sequencing coverage. We also explored several clustering parameters that deferentially affect genomes with various sequencing coverage. Availability https://bitbucket.org/berkeleylab/jgi-sparc/. Contact zhongwang@lbl.gov

1 citations


Journal ArticleDOI
TL;DR: The next generation of policymakers and decision-makers will have to consider climate change in a more holistic manner than in the past when making decisions on climate change adaptation and policy.
Abstract: Author(s): Yin, Hengfu; Guo, Hao-Bo; Weston, David J; Borland, Anne M; Ranjan, Priya; Abraham, Paul E; Jawdy, Sara S; Wachira, James; Tuskan, Gerald A; Tschaplinski, Timothy J; Wullschleger, Stan D; Guo, Hong; Hettich, Robert L; Gross, Stephen M; Wang, Zhong; Visel, Axel; Yang, Xiaohan | Abstract: Following publication of the original article.