scispace - formally typeset
Search or ask a question
Author

Zhong Wang

Other affiliations: Joint Genome Institute, Yale University, Duke University  ...read more
Bio: Zhong Wang is an academic researcher from Lawrence Berkeley National Laboratory. The author has contributed to research in topics: Genome & RNA-Seq. The author has an hindex of 29, co-authored 61 publications receiving 21060 citations. Previous affiliations of Zhong Wang include Joint Genome Institute & Yale University.
Topics: Genome, RNA-Seq, Transcriptome, Metagenomics, Gene


Papers
More filters
Journal ArticleDOI
TL;DR: This review attempts to give an in-depth review of these statistical methods, with the goal of providing a comprehensive guide when choosing appropriate metrics for RNA-Seq statistical analyses.
Abstract: RNA sequencing (RNA-Seq) is rapidly replacing microarrays for profiling gene expression with much improved accuracy and sensitivity. One of the most common questions in a typical gene profiling experiment is how to identify a set of transcripts that are differentially expressed between different experimental conditions. Some of the statistical methods developed for microarray data analysis can be applied to RNA-Seq data with or without modifications. Recently several additional methods have been developed specifically for RNA-Seq data sets. This review attempts to give an in-depth review of these statistical methods, with the goal of providing a comprehensive guide when choosing appropriate metrics for RNA-Seq statistical analyses.

52 citations

Journal ArticleDOI
TL;DR: In this article , a Bayesian cell proportion reconstruction inferred using statistical marginalization (BayesPrism) was developed to predict cellular composition and gene expression in individual cell types from bulk RNA-seq, using patient-derived, scRNA-seq as prior information.
Abstract: Abstract Inferring single-cell compositions and their contributions to global gene expression changes from bulk RNA sequencing (RNA-seq) datasets is a major challenge in oncology. Here we develop Bayesian cell proportion reconstruction inferred using statistical marginalization (BayesPrism), a Bayesian method to predict cellular composition and gene expression in individual cell types from bulk RNA-seq, using patient-derived, scRNA-seq as prior information. We conduct integrative analyses in primary glioblastoma, head and neck squamous cell carcinoma and skin cutaneous melanoma to correlate cell type composition with clinical outcomes across tumor types, and explore spatial heterogeneity in malignant and nonmalignant cell states. We refine current cancer subtypes using gene expression annotation after exclusion of confounding nonmalignant cells. Finally, we identify genes whose expression in malignant cells correlates with macrophage infiltration, T cells, fibroblasts and endothelial cells across multiple tumor types. Our work introduces a new lens to accurately infer cellular composition and expression in large cohorts of bulk RNA-seq data.

50 citations

Journal ArticleDOI
TL;DR: It is proposed that the accelerated evolution of key proteins together with a diel re-programming of gene expression were required for CAM evolution from C3 ancestors in Agave, providing evidence of adaptive evolution of CAM related pathways.
Abstract: Crassulacean acid metabolism (CAM) enhances plant water-use efficiency through an inverse day/night pattern of stomatal closure/opening that facilitates nocturnal CO2 uptake. CAM has evolved independently in over 35 plant lineages, accounting for ~ 6% of all higher plants. Agave species are highly heat- and drought-tolerant, and have been domesticated as model CAM crops for beverage, fiber, and biofuel production in semi-arid and arid regions. However, the genomic basis of evolutionary innovation of CAM in genus Agave is largely unknown. Using an approach that integrated genomics, gene co-expression networks, comparative genomics and protein structure analyses, we investigated the molecular evolution of CAM as exemplified in Agave. Comparative genomics analyses among C3, C4 and CAM species revealed that core metabolic components required for CAM have ancient genomic origins traceable to non-vascular plants while regulatory proteins required for diel re-programming of metabolism have a more recent origin shared among C3, C4 and CAM species. We showed that accelerated evolution of key functional domains in proteins responsible for primary metabolism and signaling, together with a diel re-programming of the transcription of genes involved in carbon fixation, carbohydrate processing, redox homeostasis, and circadian control is required for the evolution of CAM in Agave. Furthermore, we highlighted the potential candidates contributing to the adaptation of CAM functional modules. This work provides evidence of adaptive evolution of CAM related pathways. We showed that the core metabolic components required for CAM are shared by non-vascular plants, but regulatory proteins involved in re-reprogramming of carbon fixation and metabolite transportation appeared more recently. We propose that the accelerated evolution of key proteins together with a diel re-programming of gene expression were required for CAM evolution from C3 ancestors in Agave.

41 citations

Journal ArticleDOI
TL;DR: In this article , the authors examined the relationship between histone modifications and transcription using experimental perturbations combined with sensitive machine-learning tools and found that transcription is required for active histone modification.
Abstract: The role of histone modifications in transcription remains incompletely understood. Here, we examine the relationship between histone modifications and transcription using experimental perturbations combined with sensitive machine-learning tools. Transcription predicted the variation in active histone marks and complex chromatin states, like bivalent promoters, down to single-nucleosome resolution and at an accuracy that rivaled the correspondence between independent ChIP-seq experiments. Blocking transcription rapidly removed two punctate marks, H3K4me3 and H3K27ac, from chromatin indicating that transcription is required for active histone modifications. Transcription was also required for maintenance of H3K27me3, consistent with a role for RNA in recruiting PRC2. A subset of DNase-I-hypersensitive sites were refractory to prediction, precluding models where transcription initiates pervasively at any open chromatin. Our results, in combination with past literature, support a model in which active histone modifications serve a supportive, rather than an essential regulatory, role in transcription. A machine-learning tool can predict the distribution of histone post-translational modifications using nascent transcription data. Inhibiting transcription impacts H3K4me3, H3K27ac and H3K27me3 dynamics.

37 citations

Journal ArticleDOI
TL;DR: It is shown that avidin, streptavidin, or polyclonal anti-biotin (but not a monoclonal anti- biotin) antibody is capable of specifically capturing in vivo biotinylated beta-galactosidase and c-Jun and that this capture is dependent upon the presence of both avidin and the BCCP moiety.
Abstract: We describe an in vivo approach for the isolation of proteins interacting with a protein of interest. The protein of interest is "tagged" with a portion of the biotin carboxylase carrier protein (BCCP), encoded on a specially constructed plasmid, so that it becomes biotinylated in vivo. The "query" proteins (e.g., those in a cDNA library) are tagged by fusing them to the 3' end of the lacZ gene on a lambda vector in such a way that the beta-galactosidase activity is not disrupted. These phage are transfected into cells containing the plasmid encoding the BCCP-tagged protein. The infection lyses the cells and exposes the protein complexes. The BCCP-tagged protein and any associated protein(s) are "captured" by using avidin, streptavidin, or anti-biotin antibody-coated filters. The detection of bound protein is accomplished by directly assaying for beta-galactosidase activity on the filters. Positive plaques can be plaque-purified for DNA sequencing. We have tested this approach by using c-Fos and c-Jun as our model system. We show that avidin, streptavidin, or polyclonal anti-biotin (but not a monoclonal anti-biotin) antibody is capable of specifically capturing in vivo biotinylated beta-galactosidase and c-Jun and that this capture is dependent upon the presence of both avidin and the BCCP moiety. Further, complexes containing c-Jun and c-Fos can also be isolated in this manner, and the isolation of this complex is dependent on the presence of c-Fos, c-Jun, avidin, and the BCCP moiety. We discuss the possible uses and limitations of this technique for isolating proteins that interact with a known protein.

33 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

28 Jul 2005
TL;DR: PfPMP1)与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作�ly.
Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1(PfPMP1)与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员,通过启动转录不同的var基因变异体为抗原变异提供了分子基础。

18,940 citations

Journal ArticleDOI
TL;DR: It is shown that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads, and estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired- end reads, depending on the number of possible splice forms for each gene.
Abstract: RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.

14,524 citations

Journal ArticleDOI
TL;DR: A method based on the negative binomial distribution, with variance and mean linked by local regression, is proposed and an implementation, DESeq, as an R/Bioconductor package is presented.
Abstract: High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

13,356 citations

Journal ArticleDOI
TL;DR: The results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.
Abstract: High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.

13,337 citations