scispace - formally typeset
Search or ask a question

Showing papers by "Zhong Wang published in 2018"


Journal ArticleDOI
TL;DR: It is proposed that the accelerated evolution of key proteins together with a diel re-programming of gene expression were required for CAM evolution from C3 ancestors in Agave, providing evidence of adaptive evolution of CAM related pathways.
Abstract: Crassulacean acid metabolism (CAM) enhances plant water-use efficiency through an inverse day/night pattern of stomatal closure/opening that facilitates nocturnal CO2 uptake. CAM has evolved independently in over 35 plant lineages, accounting for ~ 6% of all higher plants. Agave species are highly heat- and drought-tolerant, and have been domesticated as model CAM crops for beverage, fiber, and biofuel production in semi-arid and arid regions. However, the genomic basis of evolutionary innovation of CAM in genus Agave is largely unknown. Using an approach that integrated genomics, gene co-expression networks, comparative genomics and protein structure analyses, we investigated the molecular evolution of CAM as exemplified in Agave. Comparative genomics analyses among C3, C4 and CAM species revealed that core metabolic components required for CAM have ancient genomic origins traceable to non-vascular plants while regulatory proteins required for diel re-programming of metabolism have a more recent origin shared among C3, C4 and CAM species. We showed that accelerated evolution of key functional domains in proteins responsible for primary metabolism and signaling, together with a diel re-programming of the transcription of genes involved in carbon fixation, carbohydrate processing, redox homeostasis, and circadian control is required for the evolution of CAM in Agave. Furthermore, we highlighted the potential candidates contributing to the adaptation of CAM functional modules. This work provides evidence of adaptive evolution of CAM related pathways. We showed that the core metabolic components required for CAM are shared by non-vascular plants, but regulatory proteins involved in re-reprogramming of carbon fixation and metabolite transportation appeared more recently. We propose that the accelerated evolution of key proteins together with a diel re-programming of gene expression were required for CAM evolution from C3 ancestors in Agave.

41 citations


Posted ContentDOI
11 Jan 2018-bioRxiv
TL;DR: A Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization and suggests SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments.
Abstract: Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed a Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large scale sequence data analysis problems. The software is available under the Apache 2.0 license at https://bitbucket.org/LizhenShi/sparc.

3 citations


Journal ArticleDOI
TL;DR: The results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.
Abstract: Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analyzing metagenomics data includes both data-intensive and compute-intensive steps, making the entire process hard to scale. Here we aim to optimize a metagenomics application that partitions the shortgun metagenomics sequences based on their species of origin. Our solution combines MapReduce-based BioPig analytic toolkit with MPI to provide scalability in respective to both data and compute. We also made some improvements to the existing BioPig toolkit by using simplified data types and compressed k-mer storage. These optimizations leads up to 193 $$\times $$ speedup for the computing-intensive step and 9.6 $$\times $$ speedup over the entire pipeline. Our optimized application is also capable of processing datasets that are 16 times larger on the same hardware platform. These results suggest integrating heterogeneous technologies such as Hadoop and MPI is quite efficient to solve large genomics problems that are both data-intensive and compute-intensive.

3 citations


Posted ContentDOI
03 Oct 2018-bioRxiv
TL;DR: This work developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments, which robustly improves read quality.
Abstract: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neural Network (CNN) based method, called MiniScrub, for de novo identification and subsequent "scrubbing" (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub

2 citations