scispace - formally typeset
Search or ask a question

Showing papers by "Michael Snyder published in 2011"


Journal ArticleDOI
TL;DR: An overview of the project and the resources it is generating and the application of ENCODE data to interpret the human genome are provided.
Abstract: The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

1,446 citations


Journal ArticleDOI
TL;DR: By genotyping CNVs in the CEPH, Yoruba, and Chinese-Japanese populations, it is estimated that at least 11% of all CNV loci involve complex, multi-allelic events, a considerably higher estimate than reported earlier.
Abstract: Copy number variation (CNV) in the genome is a complex phenomenon, and not completely understood. We have developed a method, CNVnator, for CNV discovery and genotyping from read-depth (RD) analysis of personal genome sequencing. Our method is based on combining the established mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) to broaden the range of discovered CNVs. We calibrated CNVnator using the extensive validation performed by the 1000 Genomes Project. Because of this, we could use CNVnator for CNV discovery and genotyping in a population and characterization of atypical CNVs, such as de novo and multi-allelic events. Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate (3%-20%), high genotyping accuracy (93%-95%), and high resolution in breakpoint discovery (<200 bp in 90% of cases with high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approaches: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair. By genotyping CNVs in the CEPH, Yoruba, and Chinese-Japanese populations, we estimated that at least 11% of all CNV loci involve complex, multi-allelic events, a considerably higher estimate than reported earlier. Moreover, among these events, we observed cases with allele distribution strongly deviating from Hardy-Weinberg equilibrium, possibly implying selection on certain complex loci. Finally, by combining discovery and genotyping, we identified six potential de novo CNVs in two family trios.

1,376 citations


Journal ArticleDOI
Ryan E. Mills1, Klaudia Walter2, Chip Stewart3, Robert E. Handsaker4  +371 moreInstitutions (21)
03 Feb 2011-Nature
TL;DR: A map of unbalanced SVs is constructed based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations, and serves as a resource for sequencing-based association studies.
Abstract: Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

1,085 citations


Journal ArticleDOI
TL;DR: The results suggest that the Nimblegen platform, which is the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to sensitively detect small variants.
Abstract: Whole exome sequencing by high-throughput sequencing of target-enriched genomic DNA (exome-seq) has become common in basic and translational research as a means of interrogating the interpretable part of the human genome at relatively low cost. We present a comparison of three major commercial exome sequencing platforms from Agilent, Illumina and Nimblegen applied to the same human blood sample. Our results suggest that the Nimblegen platform, which is the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to sensitively detect small variants. Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina captures untranslated regions, which are not targeted by the Nimblegen and Agilent platforms. We also compare exome sequencing and whole genome sequencing (WGS) of the same sample, demonstrating that exome sequencing can detect additional small variants missed by WGS.

504 citations


Journal ArticleDOI
TL;DR: This Review concentrates on the technology behind the third- and fourth-generation sequencing methods: their challenges, current limitations, and tantalizing promise.
Abstract: DNA sequencing is in the throes of an enormous technological shift marked by dramatic throughput increases, a precipitously dropping per-base cost of raw sequence, and an accompanying requirement for substantial investment in large capital equipment in order to utilize the technology. Investigations that were, for most, unreachable luxuries just a few years ago (individual genome sequencing, metagenomics studies, and the sequencing of myriad organisms of interest) are being increasingly enabled, at a rapid pace. This Review concentrates on the technology behind the third- and fourth-generation sequencing methods: their challenges, current limitations, and tantalizing promise. First-generation sequencing encompasses the chain termination method pioneered by Sanger and Coulson1 in 1975 or the chemical method of Maxam and Gilbert in 1976–1977.2 In 1977, Sanger sequenced the first genome, bacteriophage ΦX 174, which is 5375 bases in length.3 These methods and their early history4 have been reviewed in detail previously.5 Four-color fluorescent Sanger sequencing, where each color corresponds to one of the four DNA bases, is the method used by the automated capillary electrophoresis (CE) systems marketed by Applied Biosystems Inc., now integrated into Life Technologies, and by Beckman Coulter Inc. (Table 1).6 The first composite human genome sequence, reported in 2001, was obtained largely using CE, at great cost and with intense human effort over more than a decade.7,8 While the genome reported in 2001 was a work in progress, the availability of an ever-improving “reference” genome is the basis of an ongoing transformation of biological science and remains fundamental to investigations of genotype–phenotype relationships. Considering reports that have appeared (and not appeared) in the literature to date, it could well be that medically meaningful (actionable) insights into complex diseases will require additional types of “personal” genomic data, for instance, tissue-specific mRNA expression profiling and mRNA sequencing, individualized analysis of gene regulatory regions, epigenetic profiling, and high-quality, long-range chromosome mapping to catalog significant deletions, insertions, rearrangements, etc. Correlation of such integrated genomic data sets with comprehensive medical histories for hundreds or thousands of individuals may be what it takes to reach an era of personalized medicine.9–11 Large-scale sequencing centers are now completing the conversion to next-generation sequencers; the Joint Genome Institute (JGI) has retired all of their Sanger sequencing instruments.12 At the other extreme, until small-scale next-generation sequencers can outperform CE on a cost per accurate base called as well as read length, CE systems will likely remain in heavy use for benchtop-scale, targeted sequencing for directed investigations such as quantitative gene expression, biomarker identification, and pathway analysis. Table 1 First- and Second-Generation Sequencing Technologies

346 citations



Journal ArticleDOI
TL;DR: A computational pipeline is developed that identifies allele‐specific events with significant differences in the number of mapped reads between maternal and paternal alleles, and investigates the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework.
Abstract: To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA-Seq and ChIP-Seq data sets generated for this purpose. In addition to observing fairly widespread allele-specific behavior within individual functional genomic data sets (including results consistent with X-chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.

330 citations


Journal ArticleDOI
TL;DR: A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations, and a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data is presented.
Abstract: As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.

322 citations


Journal ArticleDOI
TL;DR: An in vivo response of the mature RPE to diverse stressors that prolongs RPE cell survival at the expense of epithelial attributes and photoreceptor function is revealed.
Abstract: Retinal pigment epithelial (RPE) cell dysfunction plays a central role in various retinal degenerative diseases, but knowledge is limited regarding the pathways responsible for adult RPE stress responses in vivo. RPE mitochondrial dysfunction has been implicated in the pathogenesis of several forms of retinal degeneration. Here we have shown that postnatal ablation of RPE mitochondrial oxidative phosphorylation in mice triggers gradual epithelium dedifferentiation, typified by reduction of RPE-characteristic proteins and cellular hypertrophy. The electrical response of the retina to light decreased and photoreceptors eventually degenerated. Abnormal RPE cell behavior was associated with increased glycolysis and activation of, and dependence upon, the hepatocyte growth factor/met proto-oncogene pathway. RPE dedifferentiation and hypertrophy arose through stimulation of the AKT/mammalian target of rapamycin (AKT/mTOR) pathway. Administration of an oxidant to wild-type mice also caused RPE dedifferentiation and mTOR activation. Importantly, treatment with the mTOR inhibitor rapamycin blunted key aspects of dedifferentiation and preserved photoreceptor function for both insults. These results reveal an in vivo response of the mature RPE to diverse stressors that prolongs RPE cell survival at the expense of epithelial attributes and photoreceptor function. Our findings provide a rationale for mTOR pathway inhibition as a therapeutic strategy for retinal degenerative diseases involving RPE stress.

270 citations


Journal ArticleDOI
TL;DR: Examination of the binding targets of three related HOX factors--LIN-39, MAB-5, and EGL-5--indicates that these factors regulate genes involved in cellular migration, neuronal function, and vulval differentiation, consistent with their known roles in these developmental processes.
Abstract: Regulation of gene expression by sequence-specific transcription factors is central to developmental programs and depends on the binding of transcription factors with target sites in the genome. To date, most such analyses in Caenorhabditis elegans have focused on the interactions between a single transcription factor with one or a few select target genes. As part of the modENCODE Consortium, we have used chromatin immunoprecipitation coupled with high-throughput DNA sequencing (ChIP-seq) to determine the genome-wide binding sites of 22 transcription factors (ALR-1, BLMP-1, CEH-14, CEH-30, EGL-27, EGL-5, ELT-3, EOR-1, GEI-11, HLH-1, LIN-11, LIN-13, LIN-15B, LIN-39, MAB-5, MDL-1, MEP-1, PES-1, PHA-4, PQM-1, SKN-1, and UNC-130) at diverse developmental stages. For each factor we determined candidate gene targets, both coding and non-coding. The typical binding sites of almost all factors are within a few hundred nucleotides of the transcript start site. Most factors target a mixture of coding and non-coding target genes, although one factor preferentially binds to non-coding RNA genes. We built a regulatory network among the 22 factors to determine their functional relationships to each other and found that some factors appear to act preferentially as regulators and others as target genes. Examination of the binding targets of three related HOX factors--LIN-39, MAB-5, and EGL-5--indicates that these factors regulate genes involved in cellular migration, neuronal function, and vulval differentiation, consistent with their known roles in these developmental processes. Ultimately, the comprehensive mapping of transcription factor binding sites will identify features of transcriptional networks that regulate C. elegans developmental processes.

244 citations


Journal ArticleDOI
TL;DR: The results from the ChIP and immunoprecipitation experiments suggest that SWI/SNF facilitates gene regulation and genome function more broadly and through a greater diversity of interactions than previously appreciated.
Abstract: A systems understanding of nuclear organization and events is critical for determining how cells divide, differentiate, and respond to stimuli and for identifying the causes of diseases. Chromatin remodeling complexes such as SWI/SNF have been implicated in a wide variety of cellular processes including gene expression, nuclear organization, centromere function, and chromosomal stability, and mutations in SWI/SNF components have been linked to several types of cancer. To better understand the biological processes in which chromatin remodeling proteins participate, we globally mapped binding regions for several components of the SWI/SNF complex throughout the human genome using ChIP-Seq. SWI/SNF components were found to lie near regulatory elements integral to transcription (e.g. 5′ ends, RNA Polymerases II and III, and enhancers) as well as regions critical for chromosome organization (e.g. CTCF, lamins, and DNA replication origins). Interestingly we also find that certain configurations of SWI/SNF subunits are associated with transcripts that have higher levels of expression, whereas other configurations of SWI/SNF factors are associated with transcripts that have lower levels of expression. To further elucidate the association of SWI/SNF subunits with each other as well as with other nuclear proteins, we also analyzed SWI/SNF immunoprecipitated complexes by mass spectrometry. Individual SWI/SNF factors are associated with their own family members, as well as with cellular constituents such as nuclear matrix proteins, key transcription factors, and centromere components, implying a ubiquitous role in gene regulation and nuclear function. We find an overrepresentation of both SWI/SNF-associated regions and proteins in cell cycle and chromosome organization. Taken together the results from our ChIP and immunoprecipitation experiments suggest that SWI/SNF facilitates gene regulation and genome function more broadly and through a greater diversity of interactions than previously appreciated.

Journal ArticleDOI
TL;DR: A novel synthetic human reference sequence is developed that is ethnically concordant and used for the analysis of genomes from a nuclear family with history of familial thrombophilia, demonstrating that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci.
Abstract: Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (,1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.

Journal ArticleDOI
TL;DR: The Mapped Read Format (MRF) is developed, a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies.
Abstract: Summary: The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses. Availability and implementation: RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/. Contact: ude.elay@reggebah.sakul; ude.elay@nietsreg.kram Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: A formalism based on analogy to simple models of sequence evolution was developed and used to conduct a systematic study of network rewiring on all the currently available biological networks, and it was found that, similar to sequences, biological networks show a decreased rate of change at large time divergences.
Abstract: We have accumulated a large amount of biological network data and expect even more to come. Soon, we anticipate being able to compare many different biological networks as we commonly do for molecular sequences. It has long been believed that many of these networks change, or "rewire", at different rates. It is therefore important to develop a framework to quantify the differences between networks in a unified fashion. We developed such a formalism based on analogy to simple models of sequence evolution, and used it to conduct a systematic study of network rewiring on all the currently available biological networks. We found that, similar to sequences, biological networks show a decreased rate of change at large time divergences, because of saturation in potential substitutions. However, different types of biological networks consistently rewire at different rates. Using comparative genomics and proteomics data, we found a consistent ordering of the rewiring rates: transcription regulatory, phosphorylation regulatory, genetic interaction, miRNA regulatory, protein interaction, and metabolic pathway network, from fast to slow. This ordering was found in all comparisons we did of matched networks between organisms. To gain further intuition on network rewiring, we compared our observed rewirings with those obtained from simulation. We also investigated how readily our formalism could be mapped to other network contexts; in particular, we showed how it could be applied to analyze changes in a range of "commonplace" networks such as family trees, co-authorships and linux-kernel function dependencies.

Journal ArticleDOI
TL;DR: A network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets is presented, finding that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential.
Abstract: We present a network framework for analyzing multi-level regulation in higher eukaryotes based on systematic integration of various high-throughput datasets. The network, namely the integrated regulatory network, consists of three major types of regulation: TF→gene, TF→miRNA and miRNA→gene. We identified the target genes and target miRNAs for a set of TFs based on the ChIP-Seq binding profiles, the predicted targets of miRNAs using annotated 3′UTR sequences and conservation information. Making use of the system-wide RNA-Seq profiles, we classified transcription factors into positive and negative regulators and assigned a sign for each regulatory interaction. Other types of edges such as protein-protein interactions and potential intra-regulations between miRNAs based on the embedding of miRNAs in their host genes were further incorporated. We examined the topological structures of the network, including its hierarchical organization and motif enrichment. We found that transcription factors downstream of the hierarchy distinguish themselves by expressing more uniformly at various tissues, have more interacting partners, and are more likely to be essential. We found an over-representation of notable network motifs, including a FFL in which a miRNA cost-effectively shuts down a transcription factor and its target. We used data of C. elegans from the modENCODE project as a primary model to illustrate our framework, but further verified the results using other two data sets. As more and more genome-wide ChIP-Seq and RNA-Seq data becomes available in the near future, our methods of data integration have various potential applications.

Journal ArticleDOI
TL;DR: In this article, a review of differential gene regulation focusing on evolutionary-developmental biology, global comparison of genomic sequences, whole-genome gene expression, and transcription factor (TF) binding profiles is presented.
Abstract: Understanding how individuals differ from one another and from closely related species is a fundamental problem in biology. Recent evidence suggests that much of the variation both within and between species is due to differential gene regulation. Here we review differential gene regulation focusing on evolutionary-developmental (evo-devo) biology, global comparison of genomic sequences, whole-genome gene expression, and transcription factor (TF) binding profiles. We also explore the relationship between divergence rate of regulatory sequences, coding sequences, and TF binding events using several different measures and discuss their implications in the context of evolution of regulatory networks. Finally, we discuss the current status and future challenges in relating regulatory variation to the divergence across and within species.

Journal ArticleDOI
TL;DR: The findings suggest the dephosphorylation of the formins may be important for their observed localization change during exit from mitosis and indicate that Cdc14 targets proteins involved in wide-ranging mitotic events.

Journal ArticleDOI
TL;DR: Compared with the existing read-depth and read-pair approaches for SV identification, this method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions.
Abstract: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

Journal ArticleDOI
TL;DR: The incRNA model is used to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy, find more than 7000 novel ncRNA candidates, and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families.
Abstract: We present an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs). It combines a large amount of expression data, RNA secondary-structure stability, and evolutionary conservation at the protein and nucleic-acid level. Using the incRNA model and data from the modENCODE consortium, we are able to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy (97% AUC on an independent validation set), and find more than 7000 novel ncRNA candidates, among which more than 1000 are located in the intergenic regions of C. elegans genome. Based on the validation set, we estimate that 91% of the approximately 7000 novel ncRNA candidates are true positives. We then analyze 15 novel ncRNA candidates by RT-PCR, detecting the expression for 14. In addition, we characterize the properties of all the novel ncRNA candidates and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families. We also find that they are often targeted by specific transcription factors (∼59% of intergenic novel ncRNA candidates). Overall, our study identifies many new potential ncRNAs in C. elegans and provides a method that can be adapted to other organisms.

Journal ArticleDOI
30 Nov 2011-PLOS ONE
TL;DR: It is found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays, which is important for cost effective CNV detection and validation for both basic and clinical applications.
Abstract: Accurate and efficient genome-wide detection of copy number variants (CNVs) is essential for understanding human genomic variation, genome-wide CNV association type studies, cytogenetics research and diagnostics, and independent validation of CNVs identified from sequencing based technologies. Numerous, array-based platforms for CNV detection exist utilizing array Comparative Genome Hybridization (aCGH), Single Nucleotide Polymorphism (SNP) genotyping or both. We have quantitatively assessed the abilities of twelve leading genome-wide CNV detection platforms to accurately detect Gold Standard sets of CNVs in the genome of HapMap CEU sample NA12878, and found significant differences in performance. The technologies analyzed were the NimbleGen 4.2 M, 2.1 M and 3×720 K Whole Genome and CNV focused arrays, the Agilent 1×1 M CGH and High Resolution and 2×400 K CNV and SNP+CGH arrays, the Illumina Human Omni1Quad array and the Affymetrix SNP 6.0 array. The Gold Standards used were a 1000 Genomes Project sequencing-based set of 3997 validated CNVs and an ultra high-resolution aCGH-based set of 756 validated CNVs. We found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays. Our results are important for cost effective CNV detection and validation for both basic and clinical applications.

Journal ArticleDOI
TL;DR: This global investigation of proteins that interact with the majority of yeast protein kinases using protein microarrays indicates that kinases operate in a highly interconnected network that coordinates many activities of the proteome.
Abstract: Protein kinases are key regulators of cellular processes. In spite of considerable effort, a full understanding of the pathways they participate in remains elusive. We globally investigated the proteins that interact with the majority of yeast protein kinases using protein microarrays. Eighty-five kinases were purified and used to probe yeast proteome microarrays. One-thousand-twenty-three interactions were identified, and the vast majority were novel. Coimmunoprecipitation experiments indicate that many of these interactions occurred in vivo. Many novel links of kinases to previously distinct cellular pathways were discovered. For example, the well-studied Kss1 filamentous pathway was found to bind components of diverse cellular pathways, such as those of the stress response pathway and the Ccr4–Not transcriptional/translational regulatory complex; genetic tests revealed that these different components operate in the filamentation pathway in vivo. Overall, our results indicate that kinases operate in a highly interconnected network that coordinates many activities of the proteome. Our results further demonstrate that protein microarrays uncover a diverse set of interactions not observed previously.

Journal ArticleDOI
Hang Yin1, Sarah Sweeney1, Debasish Raha1, Michael Snyder1, Haifan Lin1 
TL;DR: A modified method for chromatin immunoprecipitation and deep sequencing (ChIP–Seq) is used and its use to construct a high-resolution map of the Drosophila melanogaster key histone marks, heterochromatin protein 1a (HP1a) and RNA polymerase II (polII) reveals fundamental features of chromatin modification landscape shared by major adult Drosophile cell types.
Abstract: Epigenetic research has been focused on cell-type-specific regulation; less is known about common features of epigenetic programming shared by diverse cell types within an organism. Here, we report a modified method for chromatin immunoprecipitation and deep sequencing (ChIP–Seq) and its use to construct a high-resolution map of the Drosophila melanogaster key histone marks, heterochromatin protein 1a (HP1a) and RNA polymerase II (polII). These factors are mapped at 50-bp resolution genome-wide and at 5-bp resolution for regulatory sequences of genes, which reveals fundamental features of chromatin modification landscape shared by major adult Drosophila cell types: the enrichment of both heterochromatic and euchromatic marks in transposons and repetitive sequences, the accumulation of HP1a at transcription start sites with stalled polII, the signatures of histone code and polII level/position around the transcriptional start sites that predict both the mRNA level and functionality of genes, and the enrichment of elongating polII within exons at splicing junctions. These features, likely conserved among diverse epigenomes, reveal general strategies for chromatin modifications.

Journal ArticleDOI
TL;DR: It is found that two reticulon proteins modulate the transport of an immune receptor to the cell membrane and the role of receptor secretion in determining the receptor's forward signaling efficacy and the cell’s response is highlighted.
Abstract: Receptors localized at the plasma membrane are critical for the recognition of pathogens. The molecular determinants that regulate receptor transport to the plasma membrane are poorly understood. In a screen for proteins that interact with the FLAGELIN-SENSITIVE2 (FLS2) receptor using Arabidopsis thaliana protein microarrays, we identified the reticulon-like protein RTNLB1. We showed that FLS2 interacts in vivo with both RTNLB1 and its homolog RTNLB2 and that a Ser-rich region in the N-terminal tail of RTNLB1 is critical for the interaction with FLS2. Transgenic plants that lack RTNLB1 and RTNLB2 (rtnlb1 rtnlb2) or overexpress RTNLB1 (RTNLB1ox) exhibit reduced activation of FLS2-dependent signaling and increased susceptibility to pathogens. In both rtnlb1 rtnlb2 and RTNLB1ox, FLS2 accumulation at the plasma membrane was significantly affected compared with the wild type. Transient overexpression of RTNLB1 led to FLS2 retention in the endoplasmic reticulum (ER) and affected FLS2 glycosylation but not FLS2 stability. Removal of the critical N-terminal Ser-rich region or either of the two Tyr-dependent sorting motifs from RTNLB1 causes partial reversion of the negative effects of excess RTNLB1 on FLS2 transport out of the ER and accumulation at the membrane. The results are consistent with a model whereby RTNLB1 and RTNLB2 regulate the transport of newly synthesized FLS2 to the plasma membrane.

Journal ArticleDOI
TL;DR: The Allele Binding Cooperativity test is described in detail, which uses variation in transcription factor binding among individuals to discover combinations of factors and their targets, and developed the ALPHABIT pipeline, which includes statistical analysis of binding sites followed by experimental validation.
Abstract: Regulation of gene expression at the transcriptional level is achieved by complex interactions of transcription factors operating at their target genes. Dissecting the specific combination of factors that bind each target is a significant challenge. Here, we describe in detail the Allele Binding Cooperativity test, which uses variation in transcription factor binding among individuals to discover combinations of factors and their targets. We developed the ALPHABIT (a large-scale process to hunt for allele binding interacting transcription factors) pipeline, which includes statistical analysis of binding sites followed by experimental validation, and demonstrate that this method predicts transcription factors that associate with NFκB. Our method successfully identifies factors that have been known to work with NFκB (E2A, STAT1, IRF2), but whose global coassociation and sites of cooperative action were not known. In addition, we identify a unique coassociation (EBF1) that had not been reported previously. We present a general approach for discovering combinatorial models of regulation and advance our understanding of the genetic basis of variation in transcription factor binding.

Journal ArticleDOI
TL;DR: It is proposed that PU.1 is a multifaceted factor with overlapping, as well as distinct, functions in several hematopoietic lineages, leading to the proposal that the earliest erythroid committed cells are dramatically reduced in vivo.
Abstract: PU.1 is a hematopoietic transcription factor that is required for the development of myeloid and B cells. PU.1 is also expressed in erythroid progenitors, where it blocks erythroid differentiation by binding to and inhibiting the main erythroid promoting factor, GATA-1. However, other mechanisms by which PU.1 affects the fate of erythroid progenitors have not been thoroughly explored. Here, we used ChIP-Seq analysis for PU.1 and gene expression profiling in erythroid cells to show that PU.1 regulates an extensive network of genes that constitute major pathways for controlling growth and survival of immature erythroid cells. By analyzing fetal liver erythroid progenitors from mice with low PU.1 expression, we also show that the earliest erythroid committed cells are dramatically reduced in vivo. Furthermore, we find that PU.1 also regulates many of the same genes and pathways in other blood cells, leading us to propose that PU.1 is a multifaceted factor with overlapping, as well as distinct, functions in several hematopoietic lineages.

Journal ArticleDOI
TL;DR: The results show that genes with functions in development and transcriptional regulation are activated by ASH2 via H3K4 trimethylation in nearby nucleosomes, and the occupancy of phosphorylated forms of RNA Polymerase II and histone marks associated with activation and repression of transcription are characterized.
Abstract: An important mechanism for gene regulation involves chromatin changes via histone modification. One such modification is histone H3 lysine 4 trimethylation (H3K4me3), which requires histone methyltranferase complexes (HMT) containing the trithorax-group (trxG) protein ASH2. Mutations in ash2 cause a variety of pattern formation defects in the Drosophila wing. We have identified genome-wide binding of ASH2 in wing imaginal discs using chromatin immunoprecipitation combined with sequencing (ChIP-Seq). Our results show that genes with functions in development and transcriptional regulation are activated by ASH2 via H3K4 trimethylation in nearby nucleosomes. We have characterized the occupancy of phosphorylated forms of RNA Polymerase II and histone marks associated with activation and repression of transcription. ASH2 occupancy correlates with phosphorylated forms of RNA Polymerase II and histone activating marks in expressed genes. Additionally, RNA Polymerase II phosphorylation on serine 5 and H3K4me3 are reduced in ash2 mutants in comparison to wild-type flies. Finally, we have identified specific motifs associated with ASH2 binding in genes that are differentially expressed in ash2 mutants. Our data suggest that recruitment of the ASH2-containing HMT complexes is context specific and points to a function of ASH2 and H3K4me3 in transcriptional pausing control.

Journal ArticleDOI
TL;DR: Recent advances in querying metaboliteprotein interactions are reviewed, and its potential impact in systems biology and drug development is speculated on.
Abstract: Small metabolites represent the group of non-polymer compounds that a cell continually synthesizes, acquires, and utilizes from its surroundings. In any living cell, small metabolites outnumber proteins by at least one order of magnitude, creating a plethora of non-inherited molecular interactions. Recent systematic examination disclosed that 20% of proteins bind to at least one hydrophobic metabolite, suggesting that a large fraction of the proteome may bind metabolites. These interactions can modify protein activities and abundance. Regulation of protein function by metabolites may provide benefits for cells to adapt to ever-changing environmental conditions. The construction of a global small metaboliteprotein interactome will provide new insights into functional genomes and small-molecule drug development. The inherited and acquired biomolecules have to work in concert to keep life from collapsing out of the state of dynamic equilibrium in accordance with the second law of thermodynamics [1]. The underlying regulatory mechanisms are traditionally thought to be biochemical reactions where protein enzymes control the abundance and flux of metabolites in feedback and feedforward loops. However, given the fact that most proteins are surrounded by an excessive number of metabolites, it is reasonable to postulate that protein functions can also be modulated by interacting with metabolites that are not their substrates or products. In fact, several classic examples exist in which metabolites regulate protein functions in a nonenzymatic manner, including transcription co-factors, hormone signals or neurotransmitters. A recent study has also noted ‘‘strongly bound impurities’’ in commercial protein samples [2], suggesting undisclosed metaboliteprotein interactions. However, the extent to which such types of regulators exist in cells has not been systematically examined until recently. In this paper, we attempt to review recent advances in querying metaboliteprotein interactions, and speculate on its potential impact in systems biology and drug development. Throughout this paper we use small metabolites to denote non-polymer natural compounds found in cells, and small molecules to denote both small metabolites and synthetic pharmaceutical chemicals.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: The Interpretome is presented, a system for private genome interpretation, which contains all genotype information in client-side interpretation scripts, supported by server-side databases, and provides state-of-the-art analyses for teaching clinical implications of personal genomics, including disease risk assessment and pharmacogenomics.
Abstract: The decreasing cost of genotyping and genome sequencing has ushered in an era of genomic personalized medicine. More than 100,000 individuals have been genotyped by direct-to-consumer genetic testing services, which offer a glimpse into the interpretation and exploration of a personal genome. However, these interpretations, which require extensive manual curation, are subject to the preferences of the company and are not customizable by the individual. Academic institutions teaching personalized medicine, as well as genetic hobbyists, may prefer to customize their analysis and have full control over the content and method of interpretation. We present the Interpretome, a system for private genome interpretation, which contains all genotype information in client-side interpretation scripts, supported by server-side databases. We provide state-of-the-art analyses for teaching clinical implications of personal genomics, including disease risk assessment and pharmacogenomics. Additionally, we have implemented client-side algorithms for ancestry inference, demonstrating the power of these methods without excessive computation. Finally, the modular nature of the system allows for plugin capabilities for custom analyses. This system will allow for personal genome exploration without compromising privacy, facilitating hands-on courses in genomics and personalized medicine.

Journal ArticleDOI
01 Oct 2011
TL;DR: This introductory unit provides a description of DNA sequencing with a focus on current and “NextGen” (second and third generation) automated technologies and applications.
Abstract: The process of DNA sequencing has made tremendous strides in throughput, improved accuracy, ease of production, and lowered cost. As the practice of DNA sequencing has improved, so has the downstream data analysis with sophisticated databases and bioinformatics tools. Together, these advances have enlarged the number of applications upon which DNA sequencing can be brought to bear. This introductory unit provides a description of DNA sequencing with a focus on current and “NextGen” (second and third generation) automated technologies and applications. Curr. Protoc. Mol. Biol. 96:7.0.1-7.0.18. © 2011 by John Wiley & Sons, Inc. Keywords: NextGen DNA sequencing; high-throughput sequencing; epigenomics; transcriptome; ChIP Seq; CNV; copy number variation

Journal ArticleDOI
TL;DR: These technologies are reviewed and how these phosphorylation mapping efforts have shed light on the understanding of kinase signaling pathways and eukaryotic proteomic networks in general are discussed.
Abstract: Protein phosphorylation continues to be regarded as one of the most important post-translational modifications found in eukaryotes and has been implicated in key roles in the development of a number of human diseases. In order to elucidate roles for the 518 human kinases, phosphorylation has routinely been studied using the budding yeast Saccharomyces cerevisiae as a model system. In recent years, a number of technologies have emerged to globally map phosphorylation in yeast. In this article, we review these technologies and discuss how these phosphorylation mapping efforts have shed light on our understanding of kinase signaling pathways and eukaryotic proteomic networks in general.