scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2012"


Journal ArticleDOI
Sarah Djebali, Carrie A. Davis1, Angelika Merkel, Alexander Dobin1, Timo Lassmann, Ali Mortazavi2, Ali Mortazavi3, Andrea Tanzer, Julien Lagarde, Wei Lin1, Felix Schlesinger1, Chenghai Xue1, Georgi K. Marinov2, Jainab Khatun4, Brian A. Williams2, Chris Zaleski1, Joel Rozowsky5, Marion S. Röder, Felix Kokocinski6, Rehab F. Abdelhamid, Tyler Alioto, Igor Antoshechkin2, Michael T. Baer1, Nadav Bar7, Philippe Batut1, Kimberly Bell1, Ian Bell8, Sudipto K. Chakrabortty1, Xian Chen9, Jacqueline Chrast10, Joao Curado, Thomas Derrien, Jorg Drenkow1, Erica Dumais8, Jacqueline Dumais8, Radha Duttagupta8, Emilie Falconnet11, Meagan Fastuca1, Kata Fejes-Toth1, Pedro G. Ferreira, Sylvain Foissac8, Melissa J. Fullwood12, Hui Gao8, David Gonzalez, Assaf Gordon1, Harsha P. Gunawardena9, Cédric Howald10, Sonali Jha1, Rory Johnson, Philipp Kapranov8, Brandon King2, Colin Kingswood, Oscar Junhong Luo12, Eddie Park3, Kimberly Persaud1, Jonathan B. Preall1, Paolo Ribeca, Brian A. Risk4, Daniel Robyr11, Michael Sammeth, Lorian Schaffer2, Lei-Hoon See1, Atif Shahab12, Jørgen Skancke7, Ana Maria Suzuki, Hazuki Takahashi, Hagen Tilgner13, Diane Trout2, Nathalie Walters10, Huaien Wang1, John A. Wrobel4, Yanbao Yu9, Xiaoan Ruan12, Yoshihide Hayashizaki, Jennifer Harrow6, Mark Gerstein5, Tim Hubbard6, Alexandre Reymond10, Stylianos E. Antonarakis11, Gregory J. Hannon1, Morgan C. Giddings9, Morgan C. Giddings4, Yijun Ruan12, Barbara J. Wold2, Piero Carninci, Roderic Guigó14, Thomas R. Gingeras8, Thomas R. Gingeras1 
06 Sep 2012-Nature
TL;DR: Evidence that three-quarters of the human genome is capable of being transcribed is reported, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs that prompt a redefinition of the concept of a gene.
Abstract: Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three-quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations, taken together, prompt a redefinition of the concept of a gene.

4,450 citations


Journal ArticleDOI
TL;DR: This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Abstract: The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

4,281 citations


01 Sep 2012
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.

2,767 citations


Journal ArticleDOI
TL;DR: This work discusses how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data and develops a set of working standards and guidelines for ChIP experiments that are updated routinely.
Abstract: Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.

1,801 citations


Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The combinatorial, co-association of transcription factors is found to be highly context specific: distinct combinations of factors bind at specific genomic locations.
Abstract: Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

1,449 citations


Journal ArticleDOI
17 Feb 2012-Science
TL;DR: Functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies are described.
Abstract: Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

1,186 citations



Journal ArticleDOI
16 Mar 2012-Cell
TL;DR: This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity and reveals extensive heteroallelic changes during healthy and disease states and an unexpected RNA editing mechanism.

1,142 citations


Journal ArticleDOI
20 Dec 2012-Nature
TL;DR: A whole-genome and transcriptome analysis of 20 human iPSC lines derived from the primary skin fibroblasts finds that approximately 30% of the fibroblast cells have somatic CNVs in their genomes, suggesting widespread somatic mosaicism in the human body.
Abstract: A whole-genome and transcriptome analysis of 20 human induced pluripotent stem-cell lines shows that reprogramming does not necessarily add de novo copy number variants to what is already present in the somatic cells from which they originated. The ability to derive induced pluripotent stem cells (iPSCs) from somatic cells raises exciting possibilities for the study of human development and regenerative medicine. These applications require that the clonal cells maintain the genetic background of the individual from whom they are derived, so reports of chromosomal copy number variations (CNVs) in reprogrammed cells carry serious implications for their translational utility. Flora Vaccarino and colleagues now report a whole-genome and transcriptome analysis of 20 human iPSC lines from seven individuals. They found that reprogramming does not necessarily add de novo CNVs to those already present in the somatic genome. Interestingly, they also found a mosaic CNV pattern within individuals, confirming previous findings from cultured human fibroblasts. This work shows that iPSCs can be used as a discovery tool for the investigation of genomic mosaicism due to low-frequency CNVs in human tissues. Reprogramming somatic cells into induced pluripotent stem cells (iPSCs) has been suspected of causing de novo copy number variation1,2,3,4. To explore this issue, here we perform a whole-genome and transcriptome analysis of 20 human iPSC lines derived from the primary skin fibroblasts of seven individuals using next-generation sequencing. We find that, on average, an iPSC line manifests two copy number variants (CNVs) not apparent in the fibroblasts from which the iPSC was derived. Using PCR and digital droplet PCR, we show that at least 50% of those CNVs are present as low-frequency somatic genomic variants in parental fibroblasts (that is, the fibroblasts from which each corresponding human iPSC line is derived), and are manifested in iPSC lines owing to their clonal origin. Hence, reprogramming does not necessarily lead to de novo CNVs in iPSCs, because most of the line-manifested CNVs reflect somatic mosaicism in the human skin. Moreover, our findings demonstrate that clonal expansion, and iPSC lines in particular, can be used as a discovery tool to reliably detect low-frequency CNVs in the tissue of origin. Overall, we estimate that approximately 30% of the fibroblast cells have somatic CNVs in their genomes, suggesting widespread somatic mosaicism in the human body. Our study paves the way to understanding the fundamental question of the extent to which cells of the human body normally acquire structural alterations in their DNA post-zygotically.

353 citations


Journal ArticleDOI
TL;DR: This work sequenced the genome of an individual with both Illumina and Complete Genomics to a high average coverage, and compared their performance with respect to sequence coverage and calling of single-nucleotide variants, insertions and deletions.
Abstract: Whole-genome sequencing is becoming commonplace, but the accuracy and completeness of variant calling by the most widely used platforms from Illumina and Complete Genomics have not been reported. Here we sequenced the genome of an individual with both technologies to a high average coverage of ∼76×, and compared their performance with respect to sequence coverage and calling of single-nucleotide variants (SNVs), insertions and deletions (indels). Although 88.1% of the ∼3.7 million unique SNVs were concordant between platforms, there were tens of thousands of platform-specific calls located in genes and other genomic regions. In contrast, 26.5% of indels were concordant between platforms. Target enrichment validated 92.7% of the concordant SNVs, whereas validation by genotyping array revealed a sensitivity of 99.3%. The validation experiments also suggested that >60% of the platform-specific variants were indeed present in the genome. Our results have important implications for understanding the accuracy and completeness of the genome sequencing platforms.

319 citations


Journal ArticleDOI
TL;DR: This work presents the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines, and determines the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene.
Abstract: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.

Journal ArticleDOI
TL;DR: This study builds a novel quantitative model and finds that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy, and that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq.
Abstract: Background: Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines. Results: We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA. Conclusions: Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.

Journal ArticleDOI
TL;DR: Three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity, and the machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.
Abstract: Background Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors.

Journal ArticleDOI
TL;DR: A notable difference is revealed in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols, which implies that these features regulate transcription in a highly coordinated manner.
Abstract: Statistical models have been used to quantify the relationship between gene expression and transcription factor (TF) binding signals. Here we apply the models to the large-scale data generated by the ENCODE project to study transcriptional regulation by TFs. Our results reveal a notable difference in the prediction accuracy of expression levels of transcription start sites (TSSs) captured by different technologies and RNA extraction protocols. In general, the expression levels of TSSs with high CpG content are more predictable than those with low CpG content. For genes with alternative TSSs, the expression levels of downstream TSSs are more predictable than those of the upstream ones. Different TF categories and specific TFs vary substantially in their contributions to predicting expression. Between two cell lines, the differential expression of TSS can be precisely reflected by the difference of TF-binding signals in a quantitative manner, arguing against the conventional on-and-off model of TF binding. Finally, we explore the relationships between TF-binding signals and other chromatin features such as histone modifications and DNase hypersensitivity for determining expression. The models imply that these features regulate transcription in a highly coordinated manner.

Journal ArticleDOI
TL;DR: The results establish histone modification profiling as a tool for developmental enhancer discovery, and suggest that enhancers maintain an open chromatin state in multiple embryonic tissues independent of their activity level.
Abstract: .The regulatory elements that direct tissue-specific gene expression in the developing mammalian embryo remain largely unknown. Although chromatin profiling has proven to be a powerful method for mapping regulatory sequences in cultured cells, chromatin states characteristic of active developmental enhancers have not been directly identified in embryonic tissues. Here we use whole-transcriptome analysis coupled with genome-wide profiling of H3K27ac and H3K27me3 to map chromatin states and enhancers in mouse embryonic forelimb and hindlimb. We show that gene-expression differences between forelimb and hindlimb, and between limb and other embryonic cell types, are correlated with tissue-specific H3K27ac signatures at promoters and distal sites. Using H3K27ac profiles, we identified 28,377 putative enhancers, many of which are likely to be limb specific based on strong enrichment near genes highly expressed in the limb and comparisons with tissue-specific EP300 sites and known enhancers. We describe a chromatin state signature associated with active developmental enhancers, defined by high levels of H3K27ac marking, nucleosome displacement, hypersensitivity to sonication, and strong depletion of H3K27me3. We also find that some developmental enhancers exhibit components of this signature, including hypersensitivity, H3K27ac enrichment, and H3K27me3 depletion, at lower levels in tissues in which they are not active. Our results establish histone modification profiling as a tool for developmental enhancer discovery, and suggest that enhancers maintain an open chromatin state in multiple embryonic tissues independent of their activity level. [Supplemental material is available for this article.]

Journal ArticleDOI
TL;DR: This research presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive and therefore expensive and expensive process of designing and implementing nanofiltration systems.
Abstract: volume 30 number 3 march 2012 nature biotechnology Liege, Belgium. 31The Babraham Institute, Cambridge, UK. 32Genomatix Software GmbH, Munich, Germany. 33Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland. 34Christian-Albrechts-Universitaet Zu Kiel, Kiel, Germany. 35Cellzome AG, Heidelberg, Germany. 36Institut National de la Sante et de la Recherche Medicale, Marseille, France. 37Weizmann Institute of Science, Rehovot, Israel. 38Barcelona Supercomputing Center, Barcelona, Spain. 39Centro Nacional de Investigaciones Oncologicas, Madrid, Spain. 40University Medical Centre Groningen, Groningen, The Netherlands. 41University of Saarland, Saarbruecken, Germany. 42Oxford Nanopore Technologies Ltd., Oxford, UK. e-mail: h.stunnenberg@ncmls.ru.nl

Journal ArticleDOI
TL;DR: In this article, the authors constructed statistical models to relate TF binding and histone modification (HM) to gene expression levels in mouse embryonic stem cells and found that TF binding achieved the highest predictive power in a small DNA region centered at the transcription start sites of genes, while HMs exhibited high predictive powers across a wide region around genes.
Abstract: Transcription factor (TF) binding and histone modification (HM) are important for the precise control of gene expression. Hence, we constructed statistical models to relate these to gene expression levels in mouse embryonic stem cells. While both TF binding and HMs are highly ‘predictive’ of gene expression levels (in a statistical, but perhaps not strictly mechanistic, sense), we find they show distinct differences in the spatial patterning of their predictive strength: TF binding achieved the highest predictive power in a small DNA region centered at the transcription start sites of genes, while the HMs exhibited high predictive powers across a wide region around genes. Intriguingly, our results suggest that TF binding and HMs are redundant in strict statistical sense for predicting gene expression. We also show that our TF and HM models are cell line specific; specifically, TF binding and HM are more predictive of gene expression in the same cell line, and the differential gene expression between cell lines is predictable by differential HMs. Finally, we found that the models trained solely on protein-coding genes are predictive of expression levels of microRNAs, suggesting that their regulation by TFs and HMs may share a similar mechanism to that for protein-coding genes.

Journal ArticleDOI
TL;DR: Over the next few years and in collaboration with the global human genetics community, the CMGs hope to facilitate the identification of the genes underlying a very large fraction of all Mendelian disorders.
Abstract: In science and medicine there are occasional major advances that facilitate transformations of a field. The application of next-generation massively parallel sequencing technologies coupled with powerful computational approaches to discover genes for Mendelian disorders is arguably such a major advance [Biesecker, 2010]. Just three years ago, the strategy of exome sequencing (ES) followed by discrete filtering was introduced and shown to be a potential approach to identify the genes underlying Mendelian conditions [Choi et al., 2009; Ng et al., 2010; Ng et al., 2009]. Since then, ES and whole genome sequencing (WGS) [Lupski et al., 2010] have been used to explain the cause of dozens of disorders [Bamshad et al., 2011; Claudia Gonzaga-Jauregui, 2012; Gilissen et al., 2011] including those transmitted as X-linked, autosomal recessive [Bilguvar and et al., 2010], and autosomal dominant traits [Choi et al. 2011]; as well as phenotypes caused by de novo dominant mutations [Choate et al. 2010; O'Roak et al. 2011; Vissers et al. 2010] and somatic mosaicism [Lindhurst et al. 2011]. Given the technical and analytical improvements expected over the next several years, the application of ES/WGS-based strategies will enable the identification of the genes underlying a very large fraction of all known Mendelian disorders for which the genetic basis is not yet known — at a small fraction of the current cost for discovery per disorder. Based on these advances, exploring all Mendelian disorders should become an imperative for the worldwide human genetics community. The discoveries made through such exploration would be of enormous service to families, and will provide novel entry points to investigate the mechanisms underlying disease development. Such an effort would, however, be very ambitious, requiring an unprecedented degree of cooperation and coordination in the field of medical genetics and the assistance of patients and families from around the world. A global initiative to explore all Mendelian conditions is now emerging. The initiative, which includes the International Rare Diseases Research Consortium, the Finding of Rare Disease Genes (FORGE) in Canada, and centers in Europe, East Asia and elsewhere, will establish the necessary collaborative framework and physical infrastructure to achieve this goal. In the United States, the National Human Genome Research Institute (NHGRI) and the National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health (NIH) have partnered to support this effort at three Centers for Mendelian Genomics (CMGs): the Center for Mendelian Genomics at the University of Washington; the Center for Mendelian Disorders at Yale University; and the Baylor-Johns Hopkins Center for Mendelian Genomics at Baylor College of Medicine and the Johns Hopkins University. The CMGs have four major goals: (1) to ascertain samples for all Mendelian disorders for which the genetic basis is not yet understood from clinicians and investigators around the world by developing a ‘public list’ of samples and by coordinating submissions to the NIH program with those of other international programs; (2) to improve the efficiency of the sequencing pipeline and quality of exome and genome data through ongoing technology innovation; (3) to determine the genetic basis for as many Mendelian conditions as possible; and (4) to disseminate methods and data to facilitate gene discovery by investigators working independently across the globe. The CMGs will study disorders with age of onset across the entire lifespan including well-delineated, known Mendelian phenotypes as well as novel phenotypes thought to be Mendelian on the basis of their segregation patterns in families. Initially, the “public list” will be a catalog of collected DNA samples, organized by condition, that have entered the sequencing pipeline at any one of the CMGs. Next, this list will expand to include all of the Mendelian disorders for which the CMGs have solicited DNA samples, sequencing status, and the results obtained to date. Since the availability of a sufficient number of samples from clinically well-characterized cases and families will be critical to the discovery of genes for all Mendelian disorders, one aim of the public list is to facilitate coordination of sample collection and implementation of ES among clinicians and researchers around the world. Ultimately, the CMGs aim to develop the list into a comprehensive community resource that provides information on samples that are available worldwide for concerted disease gene discovery efforts. With this introduction, we would like to engage the medical genetics community to join with us and collaborate with the CMGs by submitting information about familial conditions, adding samples to the effort from individuals or families with rare Mendelian disorders to stimulate collaborations and eventually new insights. Since many of these conditions are rare, and often with locus heterogeneity, multiple investigators will need to contribute samples from individuals and families with the same diagnosis to improve chances for finding and validating candidate genes and variants. Following sequencing and analysis, the CMGs will return the results to the submitters, and will collaborate with investigators to facilitate further analysis, functional studies, and, ultimately publication. Results will be provided to the collaborating investigators as soon as possible. Collaborating investigators will have data exclusivity for a minimum of six months to ensure ample time to conduct follow up studies and prepare manuscripts for publication. The development of a public list will ensure transparency and facilitate communications about progress within and beyond the medical genetics community; this list will soon be accessible via the CMG website (http://mendelian.org). While the workflow will vary among CMGs, several key common practices across the CMGs will enable high-quality data production and analysis with partnering contributors. Phenotypic information associated with each sample and family will be collected and evaluated to increase diagnostic precision, identify phenotypic features that clarify genetic heterogeneity and aid in the identification of previously unrecognized disorders. DNA samples will be genotyped using low-cost, genome-wide marker arrays to provide a unique profile for sample tracking, to identify copy number variation (CNVs) and to provide genetic information for subsequent analysis of sequence variations (large and small insertion-deletions) underlying these conditions. These approaches, when applied to well-characterized pedigrees, will aid in finding the genomic intervals shared among all (or nearly all) cases, and reduce the genomic search space and speed the subsequent identification of candidate gene(s) by the collaborating investigators [Sobreira et al., 2010]. The success of this effort will require collaboration at an unprecedented scale in the field of human genetics. The CMGs and their global partners welcome partnering scientists and clinicians with samples or families affected with a Mendelian condition to collaborate with us by submitting inquiries to gro.nailednem@lednemg. The CMGs have partnered with Wiley-Liss and the American Journal of Medical Genetics (AJMG) to advertise in each issue to its worldwide readership of clinical and medical geneticists. The corresponding author of each manuscript accepted by the Journal will be provided with information about the CMGs. This is a welcome and important partnership since the AJMG is a well-known forum for the delineation of new syndromes and for reporting the description of novel rare, Mendelian conditions. Similarly, the Online Mendelian Inheritance in Man (OMIM; www.OMIM.org) catalog will provide a means for quickly disseminating summaries of newly discovered genes for known Mendelian disorders and adding newly delineated disorders to the OMIM catalog. We welcome the possibilities for other scientists, international journals and websites to link to the CMG site (http://mendelian.org). We ask clinicians and scientists to collaborate with us by submitting cases and families. The key drivers for the CMGs are to serve the scientific community and the individuals and families with rare diseases by improving knowledge about these rare conditions. By making specific diagnoses and exploring on a genome level the relationships between sequence variation and phenotype, collaboratively we can achieve more comprehensive pre-symptomatic or carrier screening. Additionally, the knowledge obtained from these efforts could initiate the exploration of new and/or improved therapeutics for these conditions. The CMG mechanism will eliminate significant financial and technical barriers for clinicians who wish to gain deeper understanding of genetic disorders and will catalyze interactions across the worldwide biomedical community to utilize these genotype/phenotype correlations to drive a deeper understanding of the biology of disease. The application of powerful new genomic approaches in genetics will provide an unprecedented view into the molecular basis of many, if not most, unexplained Mendelian phenotypes. The CMG will provide the community with access to production infrastructure, bioinformatics support, and analytical expertise. The challenge to the human genetics community is to help initiate this new phase of medical genomics and take advantage of this new opportunity by providing the resources for such studies that obviously cannot occur without finding and characterizing the patients and their families. Please contact us at gro.nailednem@lednemg or through the web portals of the individual centers with questions or with samples to submit.

Journal ArticleDOI
TL;DR: Fundamental cell-intrinsic properties of the switch between self-renewal and differentiation, and valuable insights for manipulating HSCs and other differentiating systems are demonstrated.
Abstract: A critical problem in biology is understanding how cells choose between self-renewal and differentiation. To generate a comprehensive view of the mechanisms controlling early hematopoietic precursor self-renewal and differentiation, we used systems-based approaches and murine EML multipotential hematopoietic precursor cells as a primary model. EML cells give rise to a mixture of self-renewing Lin-SCA+CD34+ cells and partially differentiated non-renewing Lin-SCA-CD34− cells in a cell autonomous fashion. We identified and validated the HMG box protein TCF7 as a regulator in this self-renewal/differentiation switch that operates in the absence of autocrine Wnt signaling. We found that Tcf7 is the most down-regulated transcription factor when CD34+ cells switch into CD34− cells, using RNA–Seq. We subsequently identified the target genes bound by TCF7, using ChIP–Seq. We show that TCF7 and RUNX1 (AML1) bind to each other's promoter regions and that TCF7 is necessary for the production of the short isoforms, but not the long isoforms of RUNX1, suggesting that TCF7 and the short isoforms of RUNX1 function coordinately in regulation. Tcf7 knock-down experiments and Gene Set Enrichment Analyses suggest that TCF7 plays a dual role in promoting the expression of genes characteristic of self-renewing CD34+ cells while repressing genes activated in partially differentiated CD34− state. Finally a network of up-regulated transcription factors of CD34+ cells was constructed. Factors that control hematopoietic stem cell (HSC) establishment and development, cell growth, and multipotency were identified. These studies in EML cells demonstrate fundamental cell-intrinsic properties of the switch between self-renewal and differentiation, and yield valuable insights for manipulating HSCs and other differentiating systems.

Journal ArticleDOI
TL;DR: This is one of the highest quality fungal genomes and, to the authors' knowledge, the only thoroughly annotated and transcriptionally profiled fungal endophyte genome currently available and provides the genomic foundation for the study of a model endophytes system.
Abstract: The microbial conversion of solid cellulosic biomass to liquid biofuels may provide a renewable energy source for transportation fuels. Endophytes represent a promising group of organisms, as they are a mostly untapped reservoir of metabolic diversity. They are often able to degrade cellulose, and they can produce an extraordinary diversity of metabolites. The filamentous fungal endophyte Ascocoryne sarcoides was shown to produce potential-biofuel metabolites when grown on a cellulose-based medium; however, the genetic pathways needed for this production are unknown and the lack of genetic tools makes traditional reverse genetics difficult. We present the genomic characterization of A. sarcoides and use transcriptomic and metabolomic data to describe the genes involved in cellulose degradation and to provide hypotheses for the biofuel production pathways. In total, almost 80 biosynthetic clusters were identified, including several previously found only in plants. Additionally, many transcriptionally active regions outside of genes showed condition-specific expression, offering more evidence for the role of long non-coding RNA in gene regulation. This is one of the highest quality fungal genomes and, to our knowledge, the only thoroughly annotated and transcriptionally profiled fungal endophyte genome currently available. The analyses and datasets contribute to the study of cellulose degradation and biofuel production and provide the genomic foundation for the study of a model endophyte system.

Journal ArticleDOI
TL;DR: The Variant Annotation Tool (VAT) is developed to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals.
Abstract: Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment. Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org. Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu Supplementary Information: Supplementary data are available at Bioinformatics online.


Journal ArticleDOI
TL;DR: The main purpose of this workshop was to articulate ways in which the biomedical research community can capitalize on recent technology advances and synergize with ongoing efforts to advance the field of human proteomics.
Abstract: A National Institutes of Health (NIH) workshop was convened in Bethesda, MD on September 26–27, 2011, with representative scientific leaders in the field of proteomics and its applications to clinical settings. The main purpose of this workshop was to articulate ways in which the biomedical research community can capitalize on recent technology advances and synergize with ongoing efforts to advance the field of human proteomics. This executive summary and the following full report describe the main discussions and outcomes of the workshop.

Journal ArticleDOI
TL;DR: It is postulate that CNDs of these conserved sequences fine-tune developmental pathways by altering the levels of RNA.
Abstract: Gene expression differences are shaped by selective pressures and contribute to phenotypic differences between species. We identified 964 copy number differences (CNDs) of conserved sequences across three primate species and examined their potential effects on gene expression profiles. Samples with copy number different genes had significantly different expression than samples with neutral copy number. Genes encoding regulatory molecules differed in copy number and were associated with significant expression differences. Additionally, we identified 127 CNDs that were processed pseudogenes and some of which were expressed. Furthermore, there were copy number-different regulatory regions such as ultraconserved elements and long intergenic noncoding RNAs with the potential to affect expression. We postulate that CNDs of these conserved sequences fine-tune developmental pathways by altering the levels of RNA.

Journal ArticleDOI
TL;DR: This paper investigates the connection between genotype and athletic phenotype in the context of these four genes in various sport fields and across different ethnicities and genders, and does an extensive literature survey on these genes and the polymorphisms found to be associated with athletic performance.
Abstract: Our genes influence our athletic ability. However, the causal genetic factors and mechanisms, and the extent of their effects, remain largely elusive. Many studies investigate this association between specific genes and athletic performance. Such studies have increased in number over the past few years, as recent developments and patents in DNA sequencing have made large amounts of sequencing data available for such analysis. In this paper, we consider four of the most intensively studied genes in relation to athletic ability: angiotensin I-converting enzyme, alpha-actinin 3, peroxismose proliferator-activator receptor alpha and nitric oxide synthase 3. We investigate the connection between genotype and athletic phenotype in the context of these four genes in various sport fields and across different ethnicities and genders. We do an extensive literature survey on these genes and the polymorphisms (single nucleotide polymorphisms or indels) found to be associated with athletic performance. We also present, for each of these polymorphisms, the allele frequencies in the different ethnicities reported in the pilot phase of the 1000 Genomes Project - arguably the largest human genome-sequencing endeavor to date. We discuss the considerable success, and significant drawbacks, of past research along these lines, and propose interesting directions for future research.

Journal ArticleDOI
TL;DR: 3D protein structures constitute a valuable conceptual and predictive framework by providing rational and compelling classification schemes for network elements, as well as revealing interesting intrinsic differences between distinct node types, such as disorder and evolutionary features, which may then be rationalized in light of their respective functions within networks.

Journal ArticleDOI
Jiang Du1, Jing Leng1, Lukas Habegger1, Andrea Sboner1, Drew McDermott1, Mark Gerstein1 
06 Jan 2012-PLOS ONE
TL;DR: A statistical solution for isoform quantification in next-generation Sequencing is developed, based on analyzing a set of RNA-Seq reads, and a practical implementation is presented, available from archive.gersteinlab.org/proj/rnaseq/IQSeq.
Abstract: With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.

Journal ArticleDOI
TL;DR: In the version of this article initially published, the accession code to obtain raw sequence data was given as S RA045736.2; the correct code is SRA0457 36.2.
Abstract: Nat Biotechnol 30, 78–82 (2012); published online 18 December 2011; corrected after print 7 June 2012 In the version of this article initially published, the accession code to obtain raw sequence data was given as SRA0457362; the correct code is SRA045736 The error has been corrected in the HTMLand PDF versions of the article