scispace - formally typeset
Search or ask a question

Showing papers by "Zhengdong D. Zhang published in 2008"


Journal ArticleDOI
Zhengdong D. Zhang1, Joel Rozowsky1, Michael Snyder1, Jin Chang1, Mark Gerstein1 
TL;DR: The results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution.
Abstract: ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.

105 citations


Journal ArticleDOI
TL;DR: RACE sequencing is an efficient, sensitive, and highly accurate method for characterization of the transcriptome of specific cell/tissue types, and it appears that much of the genome is represented in polyA+ RNA.
Abstract: Background: Recent studies of the mammalian transcriptome have revealed a large number of additional transcribed regions and extraordinary complexity in transcript diversity. However, there is still much uncertainty regarding precisely what portion of the genome is transcribed, the exact structures of these novel transcripts, and the levels of the transcripts produced. Results: We have interrogated the transcribed loci in 420 selected ENCyclopedia Of DNA Elements (ENCODE) regions using rapid amplification of cDNA ends (RACE) sequencing. We analyzed annotated known gene regions, but primarily we focused on novel transcriptionally active regions (TARs), which were previously identified by high-density oligonucleotide tiling arrays and on random regions that were not believed to be transcribed. We found RACE sequencing to be very sensitive and were able to detect low levels of transcripts in specific cell types that were not detectable by microarrays. We also observed many instances of sense-antisense transcripts; further analysis suggests that many of the antisense transcripts (but not all) may be artifacts generated from the reverse transcription reaction. Our results show that the majority of the novel TARs analyzed (60%) are connected to other novel TARs or known exons. Of previously unannotated random regions, 17% were shown to produce overlapping transcripts. Furthermore, it is estimated that 9% of the novel transcripts encode proteins. Conclusion: We conclude that RACE sequencing is an efficient, sensitive, and highly accurate method for characterization of the transcriptome of specific cell/tissue types. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs can encode protein and are likely to be functional.

59 citations


Journal ArticleDOI
TL;DR: It is shown that the selective pressure acting on CD4 is highly variable between regions in the protein and identified codon sites under strong positive selection, which may reflect forces driven by SIV infection and provide a link between changes in sequence and structure of CD4 during evolution and the interaction with the immunodeficiency virus.
Abstract: CD4, an integral membrane glycoprotein, plays a critical role in the immune response and in the life cycle of simian and human immunodeficiency virus (SIV and HIV). Pairwise comparisons of orthologous human and mouse genes show that CD4 is evolving much faster than the majority of mammalian genes. The acceleration is too great to be attributed to a simple relaxation of the action of purifying selection alone. Here we show that the selective pressure acting on CD4 is highly variable between regions in the protein and identify codon sites under strong positive selection. We reconstruct the coding sequences for ancestral primate CD4s and model tertiary structures of all ancestral and extant sequences. Structural mapping of positively selected sites shows they distribute on the surface of the D1 domain of CD4, where the exogenous SIV gp120 protein binds. Moreover, structural models of the ancestral sequences show substantially larger variation in the interfacial electrostatic charge on CD4 and in the surface complementary between CD4 and gp120 in CD4 lineages from primates with natural SIV infections than those without. Thus, positive selection on CD4 among primates may reflect forces driven by SIV infection and could provide a link between changes in sequence and structure of CD4 during evolution and the interaction with the immunodeficiency virus.

35 citations