Showing papers by "Zhong Wang published in 2019"

PDF

Open Access

Journal Article•DOI•

MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.

[...]

Dongwan D. Kang¹, Feng Li², Edward Kirton¹, Ashleigh Thomas¹, Rob Egan¹, Hong An², Zhong Wang³, Zhong Wang¹, Zhong Wang⁴ - Show less +5 more•Institutions (4)

Joint Genome Institute¹, University of Science and Technology of China², University of California, Merced³, Lawrence Berkeley National Laboratory⁴

26 Jul 2019-PeerJ

TL;DR: Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed, and recommends the community adopts Meta BAT 2 for their meetagenome binning experiments.

...read moreread less

Abstract: We previously reported on MetaBAT, an automated metagenome binning software tool to reconstruct single genomes from microbial communities for subsequent analyses of uncultivated microbial species. MetaBAT has become one of the most popular binning tools largely due to its computational efficiency and ease of use, especially in binning experiments with a large number of samples and a large assembly. MetaBAT requires users to choose parameters to fine-tune its sensitivity and specificity. If those parameters are not chosen properly, binning accuracy can suffer, especially on assemblies of poor quality. Here, we developed MetaBAT 2 to overcome this problem. MetaBAT 2 uses a new adaptive binning algorithm to eliminate manual parameter tuning. We also performed extensive software engineering optimization to increase both computational and memory efficiency. Comparing MetaBAT 2 to alternative software tools on over 100 real world metagenome assemblies shows superior accuracy and computing speed. Binning a typical metagenome assembly takes only a few minutes on a single commodity workstation. We therefore recommend the community adopts MetaBAT 2 for their metagenome binning experiments. MetaBAT 2 is open source software and available at https://bitbucket.org/berkeleylab/metabat.

...read moreread less

1,334 citations

Journal Article•DOI•

SpaRC: scalable sequence clustering using Apache Spark.

[...]

Lizhen Shi¹, Xiandong Meng², Xiandong Meng³, Elizabeth Tseng⁴, Michael Mascagni¹, Zhong Wang², Zhong Wang⁵, Zhong Wang³ - Show less +4 more•Institutions (5)

Florida State University¹, Lawrence Berkeley National Laboratory², United States Department of Energy³, Pacific Biosciences⁴, University of California, Merced⁵

01 Mar 2019-Bioinformatics

TL;DR: An Apache Spark‐based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization and produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies.

...read moreread less

Abstract: Author(s): Shi, Lizhen; Meng, Xiandong; Tseng, Elizabeth; Mascagni, Michael; Wang, Zhong | Abstract: MOTIVATION:Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes. RESULTS:Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems. AVAILABILITY AND IMPLEMENTATION:https://bitbucket.org/berkeleylab/jgi-sparc.

...read moreread less

20 citations

Journal Article•DOI•

De novo Nanopore read quality improvement using deep learning.

[...]

Nathan LaPierre¹, Rob Egan², Wei Wang¹, Zhong Wang³, Zhong Wang⁴, Zhong Wang² - Show less +2 more•Institutions (4)

University of California, Los Angeles¹, Joint Genome Institute², University of California, Merced³, Lawrence Berkeley National Laboratory⁴

06 Nov 2019-BMC Bioinformatics

TL;DR: MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly.

...read moreread less

Abstract: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. The limited sensitivity of existing read-based error correction methods can cause large-scale mis-assemblies in the assembled genomes, motivating further innovation in this area. Here we developed a Convolutional Neural Network (CNN) based method, called MiniScrub, for identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments to minimize their interference in downstream assembly process. MiniScrub first generates read-to-read overlaps via MiniMap2, then encodes the overlaps into images, and finally builds CNN models to predict low-quality segments. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality, and improves read error correction in the metagenome setting. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. MiniScrub is able to robustly improve read quality of Oxford Nanopore reads, especially in the metagenome setting, making it useful for downstream applications such as de novo assembly. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub .

...read moreread less

11 citations

Journal Article•DOI•

Computational Strategies for Scalable Genomics Analysis.

[...]

Lizhen Shi¹, Zhong Wang²•Institutions (2)

Florida State University¹, United States Department of Energy²

06 Dec 2019-Genes

TL;DR: This review surveys some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics in the context of ease of development, robustness, scalability, and efficiency.

...read moreread less

Abstract: The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.

...read moreread less

9 citations

Posted Content•DOI•

A new method for rapid genome classification, clustering, visualization, and novel taxa discovery from metagenome

[...]

Zhong Wang¹, Harrison Ho², Rob Egan¹, Shijie Yao¹, Dongwan D. Kang¹, Jeff Froula¹, Volkan Sevim¹, Frederik Schulz¹, Jackie E. Shay², Derek N. Macklin³, Kayla McCue⁴, Rachel Orsini³, Daniel Barich⁵, Christopher J. Sedlacek⁶, Wei Li⁷, Rachael M. Morgan-Kiss⁸, Tanja Woyke¹, Joan L. Slonczewski⁵ - Show less +14 more•Institutions (8)

Joint Genome Institute¹, University of California, Merced², Stanford University³, Massachusetts Institute of Technology⁴, Kenyon College⁵, University of Vienna⁶, Montana State University⁷, Miami University⁸

21 Oct 2019-bioRxiv

TL;DR: An efficient software suite that estimates similarities between genomes based on their k-mer matches, and subsequently uses these similarities for classification, clustering, and visualization, and demonstrates that Genome Constellation can tackle the computational and algorithmic challenges in large-scale taxonomy analyses in metagenomics.

...read moreread less

Abstract: Classifying taxa, including those that have not previously been identified, is a key task in characterizing the microbial communities of under-described habitats, including permanently ice-covered lakes in the dry valleys of the Antarctic. Current supervised phylogeny-based methods fall short on recognizing species assembled from metagenomic datasets from such habitats, as they are often incomplete or lack closely known relatives. Here, we report an efficient software suite, 99Genome Constellation99, that is capable of rapidly characterizing a large number of metagenome-assembled genomes. Genome Constellation estimates similarities between genomes based on their k-mer matches, and subsequently uses these similarities for classification, clustering, and visualization. The clusters of reference genomes formed by Genome Constellation closely resemble known phylogenetic relationships while simultaneously revealing unexpected connections. In a dataset containing 1,693 draft genomes assembled from the Antarctic lake communities where only 40\% could be placed in a phylogenetic tree, Genome Constellation improves taxa assignment to 61%. The clustering-based analysis revealed several novel taxa groups, including six clusters that may represent new bacterial phyla. Remarkably, we discovered 63 new giant viruses, 3 of which could not be found by using the traditional marker-based approach. In summary, we demonstrate that Genome Constellation provides an unbiased option to rapidly analyze a large number of microbial genomes and visually explore their relatedness. The software is available under BSD license at: https://bitbucket.org/berkeleylab/jgi-genomeconstellation/.

...read moreread less

5 citations

Posted Content•DOI•

Deconvolute individual genomes from metagenome sequences through read clustering

[...]

Kexue Li¹, Lili Wang¹, Lizhen Shi², Li Deng¹, Li Deng³, Zhong Wang⁴, Zhong Wang³, Zhong Wang⁵ - Show less +4 more•Institutions (5)

Shanghai University¹, Florida State University², Joint Genome Institute³, Lawrence Berkeley National Laboratory⁴, University of California, Merced⁵

29 Apr 2019-bioRxiv

TL;DR: This paper extends a previously developed scalable read clustering method on Apache Spark, SpaRC, by adding a new method to further cluster small clusters that exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem.

...read moreread less

Abstract: Motivation Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Results Based on a previously developed scalable read clustering method on Apache Spark, SpaRC, that has very low false positives, here we extended its capability by adding a new method to further cluster small clusters. This method exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using a synthetic dataset from mouse gut microbiomes we show that this method has the potential to cluster almost all of the reads from genomes with sufficient sequencing coverage. We also explored several clustering parameters that deferentially affect genomes with various sequencing coverage. Availability https://bitbucket.org/berkeleylab/jgi-sparc/. Contact zhongwang@lbl.gov

...read moreread less

1 citations

Journal Article•DOI•

Correction to: Diel rewiring and positive selection of ancient plant proteins enabled evolution of CAM photosynthesis in Agave.

[...]

Hengfu Yin¹, Hao-Bo Guo², David J. Weston¹, Anne M. Borland³, Anne M. Borland¹, Priya Ranjan¹, Paul E. Abraham¹, Sara S. Jawdy¹, James M. Wachira⁴, Gerald A. Tuskan¹, Timothy J. Tschaplinski¹, Stan D. Wullschleger¹, Hong Guo², Robert L. Hettich¹, Stephen M. Gross⁵, Stephen M. Gross⁶, Zhong Wang⁷, Zhong Wang⁵, Zhong Wang⁸, Axel Visel⁵, Axel Visel⁷, Axel Visel⁸, Xiaohan Yang¹ - Show less +19 more•Institutions (8)

Oak Ridge National Laboratory¹, University of Tennessee², Newcastle University³, Morgan State University⁴, Joint Genome Institute⁵, Illumina⁶, Lawrence Berkeley National Laboratory⁷, University of California, Merced⁸

10 Apr 2019-BMC Genomics

TL;DR: The next generation of policymakers and decision-makers will have to consider climate change in a more holistic manner than in the past when making decisions on climate change adaptation and policy.

...read moreread less

Abstract: Author(s): Yin, Hengfu; Guo, Hao-Bo; Weston, David J; Borland, Anne M; Ranjan, Priya; Abraham, Paul E; Jawdy, Sara S; Wachira, James; Tuskan, Gerald A; Tschaplinski, Timothy J; Wullschleger, Stan D; Guo, Hong; Hettich, Robert L; Gross, Stephen M; Wang, Zhong; Visel, Axel; Yang, Xiaohan | Abstract: Following publication of the original article.

...read moreread less