Home
/
Authors
/
Zhong Wang

Author

Zhong Wang

Other affiliations: Joint Genome Institute, Yale University, Duke University ...read more

Bio: Zhong Wang is an academic researcher from Lawrence Berkeley National Laboratory. The author has contributed to research in topics: Genome & RNA-Seq. The author has an hindex of 29, co-authored 61 publications receiving 21060 citations. Previous affiliations of Zhong Wang include Joint Genome Institute & Yale University.

Topics: Genome, RNA-Seq, Transcriptome, Metagenomics, Gene ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2001
1993

Papers

PDF

Open Access

More filters

Journal Article•DOI•

RNA polymerase II dynamics shape enhancer-promoter interactions.

[...]

Gilad Barshad, James J. Lewis, Alexandra G. Chivu, Abderhman Abuhashem, Nils Krietenstein, Edward J. Rice, Yitian Ma, Zhong Wang, Oliver J. Rando, Anna-Katerina Hadjantonakis, Charles G. Danko - Show less +7 more

10 Jul 2023-Nature Genetics

2 citations

Proceedings Article•DOI•

A Scalable Pipeline for Transcriptome Profiling Tasks with On-Demand Computing Clouds

[...]

Shayan Shams¹, Nayong Kim¹, Xiandong Meng, Ming Tai Ha², Shantenu Jha², Zhong Wang, Joohyun Kim¹ - Show less +3 more•Institutions (2)

Louisiana State University¹, Rutgers University²

23 May 2016

TL;DR: A pilot-based approach with which scalable data analytics essential for a large RNA-seq data set are efficiently carried out, targeting cloud environments with on-demand computing and maximizing merits of Infrastructure as a Service (IaaS) clouds is introduced.

...read moreread less

Abstract: We introduce a pilot-based approach with which scalable data analytics essential for a large RNA-seq data set are efficiently carried out. Major development mechanisms, designed in order to achieve the required scalability, in particular, targeting cloud environments with on-demand computing, are presented. With an example of Amazon EC2, by harnessing distributed and parallel computing implementations, our pipeline is able to allocate optimally computing resources to tasks of a target workflow in an efficient manner. Consequently, decreasing time-to-completion (TTC) or cost, avoiding failures due to a limited resource of a single node, and enabling scalable data analysis with multiple options can be achieved. Our developed pipeline benefits from the underlying pilot system, Radical Pilot, being readily amenable to scalable solutions over distributed heterogeneous computing resources and suitable for advanced workflows of dynamically adaptive executions. In order to provide insights on such features, benchmark experiments, using two real data sets, were carried out. The benchmark experiments focus on the most computationally expensive transcript assembly step. Evaluation and comparison of transcript assembly accuracy using a single de novo assembler or the combination of multiple assemblers are also presented, underscoring its potential as a platform to support multi-assembler multi-parameter methods or ensemble methods which are statistically attractive and easily feasible with our scalable pipeline. The developed pipeline, as manifested by results presented in this work, is built upon effective strategies that address major challenging issues and viable solutions toward an integrative and scalable method for large-scale RNA-seq data analysis, particularly maximizing merits of Infrastructure as a Service (IaaS) clouds

...read moreread less

2 citations

Posted Content•DOI•

MiniScrub: de novo long read scrubbing using approximate alignment and deep learning

[...]

Nathan LaPierre¹, Nathan LaPierre², Rob Egan², Wei Wang¹, Zhong Wang³, Zhong Wang², Zhong Wang⁴ - Show less +3 more•Institutions (4)

University of California, Los Angeles¹, Joint Genome Institute², Lawrence Berkeley National Laboratory³, University of California, Merced⁴

03 Oct 2018-bioRxiv

TL;DR: This work developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments, which robustly improves read quality.

...read moreread less

Abstract: Long read sequencing technologies such as Oxford Nanopore can greatly decrease the complexity of de novo genome assembly and large structural variation identification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neural Network (CNN) based method, called MiniScrub, for de novo identification and subsequent "scrubbing" (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world control datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads produces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub

...read moreread less

2 citations

Proceedings Article•DOI•

Performance evaluation and tuning of BioPig for genomic analysis

[...]

Lizhen Shi¹, Zhong Wang², Weikuan Yu¹, Xiandong Meng²•Institutions (2)

Florida State University¹, Joint Genome Institute²

15 Nov 2015

TL;DR: The overall job execution time of k-mer counting on BioPig was reduced by 50% using an optimized set of parameters using Hadoop parameters from five different perspectives based on a baseline configuration.

...read moreread less

Abstract: In this study, we aim to optimize Hadoop parameters to improve the performance of BioPig on Amazon Web Service (AWS). BioPig is a toolkit for large-scale sequencing data analysis and is built on Hadoop and Pig that enables easy parallel programming and scaling to datasets of terabyte sizes. AWS is the most popular cloud-computing platform offered by Amazon. When running BioPig jobs on AWS, the default configuration parameters may lead to high computational costs. We select the k-mer counting as it is used in a large number of next generation sequence (NGS) data analysis tools. We tuned Hadoop parameters from five different perspectives based on a baseline configuration. We found tuning different Hadoop parameters led to various performance improvements. The overall job execution time of k-mer counting on BioPig was reduced by 50% using an optimized set of parameters. This paper documents our tuning experiments as a valuable reference for future Hadoop-based analytics applications on genomics datasets.

...read moreread less

1 citations

Posted Content•DOI•

Deconvolute individual genomes from metagenome sequences through read clustering

[...]

Kexue Li¹, Lili Wang¹, Lizhen Shi², Li Deng¹, Li Deng³, Zhong Wang⁴, Zhong Wang⁵, Zhong Wang³ - Show less +4 more•Institutions (5)

Shanghai University¹, Florida State University², Joint Genome Institute³, Lawrence Berkeley National Laboratory⁴, University of California, Merced⁵

29 Apr 2019-bioRxiv

TL;DR: This paper extends a previously developed scalable read clustering method on Apache Spark, SpaRC, by adding a new method to further cluster small clusters that exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem.

...read moreread less

Abstract: Motivation Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Results Based on a previously developed scalable read clustering method on Apache Spark, SpaRC, that has very low false positives, here we extended its capability by adding a new method to further cluster small clusters. This method exploits statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using a synthetic dataset from mouse gut microbiomes we show that this method has the potential to cluster almost all of the reads from genomes with sufficient sequencing coverage. We also explored several clustering parameters that deferentially affect genomes with various sequencing coverage. Availability https://bitbucket.org/berkeleylab/jgi-sparc/. Contact zhongwang@lbl.gov

...read moreread less

1 citations

1
2
3
4
5
6
7
…
8
9
10
11
12
13
14
…

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

[...]

Ben Langmead¹, Cole Trapnell¹, Mihai Pop¹, Steven L. Salzberg¹•Institutions (1)

University of Maryland, College Park¹

04 Mar 2009-Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

...read moreread less

20,335 citations

疟原虫var基因转换速率变化导致抗原变异[英]／Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A

[...]

宁北芳, 朱淮民

28 Jul 2005

TL;DR: PfPMP1）与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用，在黏附及免疫逃避中起关键的作�ly.

...read moreread less

Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1（PfPMP1）与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用，在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员，通过启动转录不同的var基因变异体为抗原变异提供了分子基础。

...read moreread less

18,940 citations

Journal Article•DOI•

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

[...]

Bo Li¹, Colin N. Dewey¹•Institutions (1)

University of Wisconsin-Madison¹

04 Aug 2011-BMC Bioinformatics

TL;DR: It is shown that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads, and estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired- end reads, depending on the number of possible splice forms for each gene.

...read moreread less

Abstract: RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.

...read moreread less

14,524 citations

Journal Article•DOI•

Differential expression analysis for sequence count data.

[...]

Simon Anders, Wolfgang Huber

27 Oct 2010-Genome Biology

TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.

...read moreread less

Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

...read moreread less

13,356 citations

Journal Article•DOI•

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

[...]

Cole Trapnell¹, Cole Trapnell², Brian A. Williams³, Geo Pertea², Ali Mortazavi³, Gordon Kwan³, Marijke J. van Baren⁴, Steven L. Salzberg², Barbara J. Wold³, Lior Pachter¹ - Show less +6 more•Institutions (4)

University of California, Berkeley¹, University of Maryland, College Park², California Institute of Technology³, Washington University in St. Louis⁴

01 May 2010-Nature Biotechnology

TL;DR: The results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.

...read moreread less

Abstract: High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.

...read moreread less

13,337 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse