Showing papers by "Zhong Wang published in 2018"

PDF

Open Access

Journal Article•DOI•

Diel rewiring and positive selection of ancient plant proteins enabled evolution of CAM photosynthesis in Agave

[...]

Hengfu Yin¹, Hao-Bo Guo², David J. Weston¹, Anne M. Borland³, Anne M. Borland¹, Priya Ranjan¹, Paul E. Abraham¹, Sara S. Jawdy¹, James M. Wachira⁴, Gerald A. Tuskan¹, Timothy J. Tschaplinski¹, Stan D. Wullschleger¹, Hong Guo², Robert L. Hettich¹, Stephen M. Gross⁵, Stephen M. Gross⁶, Zhong Wang⁷, Zhong Wang⁶, Zhong Wang⁸, Axel Visel⁸, Axel Visel⁷, Axel Visel⁶, Xiaohan Yang¹ - Show less +19 more•Institutions (8)

Oak Ridge National Laboratory¹, University of Tennessee², Newcastle University³, Morgan State University⁴, Illumina⁵, Joint Genome Institute⁶, Lawrence Berkeley National Laboratory⁷, University of California, Merced⁸

06 Aug 2018-BMC Genomics

TL;DR: It is proposed that the accelerated evolution of key proteins together with a diel re-programming of gene expression were required for CAM evolution from C3 ancestors in Agave, providing evidence of adaptive evolution of CAM related pathways.

...read moreread less

Abstract: Crassulacean acid metabolism (CAM) enhances plant water-use efficiency through an inverse day/night pattern of stomatal closure/opening that facilitates nocturnal CO2 uptake. CAM has evolved independently in over 35 plant lineages, accounting for ~ 6% of all higher plants. Agave species are highly heat- and drought-tolerant, and have been domesticated as model CAM crops for beverage, fiber, and biofuel production in semi-arid and arid regions. However, the genomic basis of evolutionary innovation of CAM in genus Agave is largely unknown. Using an approach that integrated genomics, gene co-expression networks, comparative genomics and protein structure analyses, we investigated the molecular evolution of CAM as exemplified in Agave. Comparative genomics analyses among C3, C4 and CAM species revealed that core metabolic components required for CAM have ancient genomic origins traceable to non-vascular plants while regulatory proteins required for diel re-programming of metabolism have a more recent origin shared among C3, C4 and CAM species. We showed that accelerated evolution of key functional domains in proteins responsible for primary metabolism and signaling, together with a diel re-programming of the transcription of genes involved in carbon fixation, carbohydrate processing, redox homeostasis, and circadian control is required for the evolution of CAM in Agave. Furthermore, we highlighted the potential candidates contributing to the adaptation of CAM functional modules. This work provides evidence of adaptive evolution of CAM related pathways. We showed that the core metabolic components required for CAM are shared by non-vascular plants, but regulatory proteins involved in re-reprogramming of carbon fixation and metabolite transportation appeared more recently. We propose that the accelerated evolution of key proteins together with a diel re-programming of gene expression were required for CAM evolution from C3 ancestors in Agave.

...read moreread less

41 citations

Posted Content•DOI•

SpaRC: Scalable Sequence Clustering using Apache Spark

[...]

Lizhen Shi¹, Xiandong Meng², Elizabeth Tseng³, Michael Mascagni¹, Zhong Wang² - Show less +1 more•Institutions (3)

Florida State University¹, Joint Genome Institute², Pacific Biosciences³

11 Jan 2018-bioRxiv

TL;DR: A Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization and suggests SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments.

...read moreread less

Abstract: Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at https://bitbucket.org/LizhenShi/sparc.

...read moreread less

3 citations

Journal Article•DOI•

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

[...]

Han Lin¹, Zhichao Su¹, Xiandong Meng², Xu Jin¹, Zhong Wang², Wenting Han¹, Hong An¹, Mengxian Chi¹, Zheng Wu¹ - Show less +5 more•Institutions (2)

University of Science and Technology of China¹, Lawrence Berkeley National Laboratory²

01 Aug 2018-International Journal of Parallel Programming

TL;DR: The results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

...read moreread less

Abstract: Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193 $$\times $$ speedup for the computing-intensive step and 9.6 $$\times $$ speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

...read moreread less

3 citations

Posted Content•DOI•

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

[...]

Nathan LaPierre¹, Nathan LaPierre², Rob Egan¹, Wei Wang², Zhong Wang³, Zhong Wang⁴, Zhong Wang¹ - Show less +3 more•Institutions (4)

Joint Genome Institute¹, University of California, Los Angeles², Lawrence Berkeley National Laboratory³, University of California, Merced⁴

03 Oct 2018-bioRxiv

TL;DR: This work developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments, which robustly improves read quality.

...read moreread less

Abstract: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neural Network (CNN) based method, called MiniScrub, for de novo identification and subsequent "scrubbing" (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub

...read moreread less

2 citations