scispace - formally typeset
Search or ask a question
Author

Kerstin Howe

Other affiliations: Yale University
Bio: Kerstin Howe is an academic researcher from Wellcome Trust Sanger Institute. The author has contributed to research in topics: Genome & Reference genome. The author has an hindex of 29, co-authored 81 publications receiving 8163 citations. Previous affiliations of Kerstin Howe include Yale University.


Papers
More filters
Journal ArticleDOI
Kerstin Howe, Matthew D. Clark, Carlos Torroja1, Carlos Torroja2  +171 moreInstitutions (11)
25 Apr 2013-Nature
TL;DR: A high-quality sequence assembly of the zebrafish genome is generated, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map, providing a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebra fish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.
Abstract: Zebrafish have become a popular organism for the study of vertebrate gene function. The virtually transparent embryos of this species, and the ability to accelerate genetic studies by gene knockdown or overexpression, have led to the widespread use of zebrafish in the detailed investigation of vertebrate gene function and increasingly, the study of human genetic disease. However, for effective modelling of human genetic disease it is important to understand the extent to which zebrafish genes and gene structures are related to orthologous human genes. To examine this, we generated a high-quality sequence assembly of the zebrafish genome, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map. Detailed automatic and manual annotation provides evidence of more than 26,000 protein-coding genes, the largest gene set of any vertebrate so far sequenced. Comparison to the human reference genome shows that approximately 70% of human genes have at least one obvious zebrafish orthologue. In addition, the high quality of this genome assembly provides a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebrafish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.

3,573 citations

Journal ArticleDOI
Martien A. M. Groenen1, Alan Archibald2, Hirohide Uenishi, Christopher K. Tuggle3, Yasuhiro Takeuchi4, Max F. Rothschild3, Claire Rogel-Gaillard5, Chankyu Park6, Denis Milan7, Hendrik-Jan Megens1, Shengting Li8, Denis M. Larkin9, Heebal Kim10, Laurent A. F. Frantz1, Mario Caccamo11, Hyeonju Ahn10, Bronwen Aken12, Anna Anselmo13, Christian Anthon14, Loretta Auvil15, Bouabid Badaoui13, Craig W. Beattie16, Christian Bendixen8, Daniel Berman17, Frank Blecha18, Jonas Blomberg19, Lars Bolund8, Mirte Bosse1, Sara Botti13, Zhan Bujie8, Megan Bystrom3, Boris Capitanu15, Denise Carvalho-Silva20, Patrick Chardon5, Celine Chen21, Ryan Cheng3, Sang-Haeng Choi, William Chow12, Richard Clark12, C M Clee12, Richard P. M. A. Crooijmans1, Harry D. Dawson21, Patrice Dehais7, Fioravante De Sapio2, Bert Dibbits1, Nizar Drou11, Zhi-Qiang Du3, Kellye Eversole, João Fadista22, João Fadista8, Susan Fairley12, Thomas Faraut7, Geoffrey J. Faulkner22, Geoffrey J. Faulkner2, Katie E. Fowler23, Merete Fredholm14, Eric Fritz3, James G. R. Gilbert12, Elisabetta Giuffra5, Elisabetta Giuffra13, Jan Gorodkin14, Darren K. Griffin23, Jennifer Harrow12, Alexander Hayward24, Kerstin Howe12, Zhi-Liang Hu3, Sean Humphray22, Sean Humphray12, Toby Hunt12, Henrik Hornshøj8, Jin-Tae Jeon25, Patric Jern24, Matthew Jones12, Jerzy Jurka26, Hiroyuki Kanamori, Ronan Kapetanovic2, Jaebum Kim15, Jaebum Kim6, Jae-Hwan Kim, Kyu-Won Kim, Tae-Hun Kim, Greger Larson27, Kyooyeol Lee6, Kyung-Tai Lee, Richard M. Leggett11, Harris A. Lewin28, Yingrui Li, Wan Sheng Liu29, Jane E. Loveland12, Yao Lu, Joan K. Lunney17, Jian Ma15, Ole Madsen1, Katherine M. Mann22, Katherine M. Mann17, Lucy Matthews12, Stuart McLaren12, Takeya Morozumi, Michael P. Murtaugh30, Jitendra Narayan9, Dinh Truong Nguyen6, Peixiang Ni, Song-Jung Oh31, Suneel Kumar Onteru3, Frank Panitz8, Eung-Woo Park, Hong-Seog Park, Géraldine Pascal32, Yogesh Paudel1, Miguel Pérez-Enciso, Ricardo H. Ramirez-Gonzalez11, James M. Reecy3, Sandra L. Rodriguez-Zas15, Gary A. Rohrer17, Lauretta A. Rund15, Yongming Sang18, Kyle M. Schachtschneider15, Joshua G. Schraiber33, John C. Schwartz30, Linda Scobie34, Carol Scott12, Stephen M. J. Searle12, Bertrand Servin7, Bruce R. Southey15, Göran O. Sperber19, Peter F. Stadler35, Jonathan V. Sweedler15, Hakim Tafer35, Bo Thomsen8, Rashmi Wali34, Jian Wang, Jun Wang14, Simon D. M. White12, Xun Xu, Martine Yerle7, Guojie Zhang, Jianguo Zhang, Jie Zhang36, Shuhong Zhao36, Jane Rogers11, Carol Churcher12, Lawrence B. Schook15 
15 Nov 2012-Nature
TL;DR: The assembly and analysis of the genome sequence of a female domestic Duroc pig and a comparison with the genomes of wild and domestic pigs from Europe and Asia reveal a deep phylogenetic split between European and Asian wild boars ∼1 million years ago.
Abstract: For 10,000 years pigs and humans have shared a close and complex relationship. From domestication to modern breeding practices, humans have shaped the genomes of domestic pigs. Here we present the assembly and analysis of the genome sequence of a female domestic Duroc pig (Sus scrofa) and a comparison with the genomes of wild and domestic pigs from Europe and Asia. Wild pigs emerged in South East Asia and subsequently spread across Eurasia. Our results reveal a deep phylogenetic split between European and Asian wild boars ∼1 million years ago, and a selective sweep analysis indicates selection on genes involved in RNA processing and regulation. Genes associated with immune response and olfaction exhibit fast evolution. Pigs have the largest repertoire of functional olfactory receptor genes, reflecting the importance of smell in this scavenging animal. The pig genome sequence provides an important resource for further improvements of this important livestock species, and our identification of many putative disease-causing variants extends the potential of the pig as a biomedical model.

1,189 citations

Journal ArticleDOI
TL;DR: A novel tool, purge_dups, is presented, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps and can reduce heter allele duplication and increase assembly continuity while maintaining completeness of the primary assembly.
Abstract: Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.

728 citations

Journal ArticleDOI
Arang Rhie1, Shane A. McCarthy2, Shane A. McCarthy3, Olivier Fedrigo4, Joana Damas5, Giulio Formenti4, Sergey Koren1, Marcela Uliano-Silva6, William Chow3, Arkarachai Fungtammasan, J. H. Kim7, Chul Hee Lee7, Byung June Ko7, Mark Chaisson8, Gregory Gedman4, Lindsey J. Cantin4, Françoise Thibaud-Nissen1, Leanne Haggerty9, Iliana Bista3, Iliana Bista2, Michelle Smith3, Bettina Haase4, Jacquelyn Mountcastle4, Sylke Winkler10, Sylke Winkler11, Sadye Paez4, Jason T. Howard, Sonja C. Vernes10, Sonja C. Vernes12, Sonja C. Vernes13, Tanya M. Lama14, Frank Grützner15, Wesley C. Warren16, Christopher N. Balakrishnan17, Dave W Burt18, Jimin George19, Matthew T. Biegler4, David Iorns, Andrew Digby, Daryl Eason, Bruce C. Robertson20, Taylor Edwards21, Mark Wilkinson22, George F. Turner23, Axel Meyer24, Andreas F. Kautt25, Andreas F. Kautt24, Paolo Franchini24, H. William Detrich26, Hannes Svardal27, Hannes Svardal28, Maximilian Wagner29, Gavin J. P. Naylor30, Martin Pippel10, Milan Malinsky3, Milan Malinsky31, Mark Mooney, Maria Simbirsky, Brett T. Hannigan, Trevor Pesout32, Marlys L. Houck33, Ann C Misuraca33, Sarah B. Kingan34, Richard Hall34, Zev N. Kronenberg34, Ivan Sović34, Christopher Dunn34, Zemin Ning3, Alex Hastie, Joyce V. Lee, Siddarth Selvaraj, Richard E. Green32, Nicholas H. Putnam, Ivo Gut35, Jay Ghurye36, Erik Garrison32, Ying Sims3, Joanna Collins3, Sarah Pelan3, James Torrance3, Alan Tracey3, Jonathan Wood3, Robel E. Dagnew8, Dengfeng Guan2, Dengfeng Guan37, Sarah E. London38, David F. Clayton19, Claudio V. Mello39, Samantha R. Friedrich39, Peter V. Lovell39, Ekaterina Osipova10, Farooq O. Al-Ajli40, Farooq O. Al-Ajli41, Simona Secomandi42, Heebal Kim7, Constantina Theofanopoulou4, Michael Hiller43, Yang Zhou, Robert S. Harris44, Kateryna D. Makova44, Paul Medvedev44, Jinna Hoffman1, Patrick Masterson1, Karen Clark1, Fergal J. Martin9, Kevin L. Howe9, Paul Flicek9, Brian P. Walenz1, Woori Kwak, Hiram Clawson32, Mark Diekhans32, Luis R Nassar32, Benedict Paten32, Robert H. S. Kraus24, Robert H. S. Kraus10, Andrew J. Crawford45, M. Thomas P. Gilbert46, M. Thomas P. Gilbert47, Guojie Zhang, Byrappa Venkatesh48, Robert W. Murphy49, Klaus-Peter Koepfli50, Beth Shapiro32, Beth Shapiro51, Warren E. Johnson52, Warren E. Johnson50, Federica Di Palma53, Tomas Marques-Bonet, Emma C. Teeling54, Tandy Warnow55, Jennifer A. Marshall Graves56, Oliver A. Ryder57, Oliver A. Ryder33, David Haussler32, Stephen J. O'Brien58, Jonas Korlach34, Harris A. Lewin5, Kerstin Howe3, Eugene W. Myers10, Eugene W. Myers11, Richard Durbin2, Richard Durbin3, Adam M. Phillippy1, Erich D. Jarvis51, Erich D. Jarvis4 
National Institutes of Health1, University of Cambridge2, Wellcome Trust Sanger Institute3, Rockefeller University4, University of California, Davis5, Leibniz Association6, Seoul National University7, University of Southern California8, European Bioinformatics Institute9, Max Planck Society10, Dresden University of Technology11, University of St Andrews12, Radboud University Nijmegen13, University of Massachusetts Amherst14, University of Adelaide15, University of Missouri16, East Carolina University17, University of Queensland18, Clemson University19, University of Otago20, University of Arizona21, Natural History Museum22, Bangor University23, University of Konstanz24, Harvard University25, Northeastern University26, University of Antwerp27, National Museum of Natural History28, University of Graz29, University of Florida30, University of Basel31, University of California, Santa Cruz32, Zoological Society of San Diego33, Pacific Biosciences34, Pompeu Fabra University35, University of Maryland, College Park36, Harbin Institute of Technology37, University of Chicago38, Oregon Health & Science University39, Qatar Airways40, Monash University Malaysia Campus41, University of Milan42, Goethe University Frankfurt43, Pennsylvania State University44, University of Los Andes45, University of Copenhagen46, Norwegian University of Science and Technology47, Agency for Science, Technology and Research48, Royal Ontario Museum49, Smithsonian Institution50, Howard Hughes Medical Institute51, Walter Reed Army Institute of Research52, University of East Anglia53, University College Dublin54, University of Illinois at Urbana–Champaign55, La Trobe University56, University of California, San Diego57, Nova Southeastern University58
28 Apr 2021-Nature
TL;DR: The Vertebrate Genomes Project (VGP) as mentioned in this paper is an international effort to generate high quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
Abstract: High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

647 citations

Journal ArticleDOI
TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

643 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Abstract: The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

4,281 citations

Journal ArticleDOI
21 Oct 2004-Nature
TL;DR: The current human genome sequence (Build 35) as discussed by the authors contains 2.85 billion nucleotides interrupted by only 341 gaps and is accurate to an error rate of approximately 1 event per 100,000 bases.
Abstract: The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.

3,989 citations

Journal ArticleDOI
TL;DR: The authors' data provide clues as to how neurons and astrocytes differ in their ability to dynamically regulate glycolytic flux and lactate generation attributable to unique splicing of PKM2, the gene encoding the glycoleytic enzyme pyruvate kinase.
Abstract: The major cell classes of the brain differ in their developmental processes, metabolism, signaling, and function To better understand the functions and interactions of the cell types that comprise these classes, we acutely purified representative populations of neurons, astrocytes, oligodendrocyte precursor cells, newly formed oligodendrocytes, myelinating oligodendrocytes, microglia, endothelial cells, and pericytes from mouse cerebral cortex We generated a transcriptome database for these eight cell types by RNA sequencing and used a sensitive algorithm to detect alternative splicing events in each cell type Bioinformatic analyses identified thousands of new cell type-enriched genes and splicing isoforms that will provide novel markers for cell identification, tools for genetic manipulation, and insights into the biology of the brain For example, our data provide clues as to how neurons and astrocytes differ in their ability to dynamically regulate glycolytic flux and lactate generation attributable to unique splicing of PKM2, the gene encoding the glycolytic enzyme pyruvate kinase This dataset will provide a powerful new resource for understanding the development and function of the brain To ensure the widespread distribution of these datasets, we have created a user-friendly website (http://webstanfordedu/group/barres_lab/brain_rnaseqhtml) that provides a platform for analyzing and comparing transciption and alternative splicing profiles for various cell classes in the brain

3,891 citations

Journal ArticleDOI
TL;DR: The SWISS-PROT protein knowledgebase connects amino acid sequences with the current knowledge in the Life Sciences by providing an interdisciplinary overview of relevant information by bringing together experimental results, computed features and sometimes even contradictory conclusions.
Abstract: The SWISS-PROT protein knowledgebase (http://www.expasy.org/sprot/ and http://www.ebi.ac.uk/swissprot/) connects amino acid sequences with the current knowledge in the Life Sciences. Each protein entry provides an interdisciplinary overview of relevant information by bringing together experimental results, computed features and sometimes even contradictory conclusions. Detailed expertise that goes beyond the scope of SWISS-PROT is made available via direct links to specialised databases. SWISS-PROT provides annotated entries for all species, but concentrates on the annotation of entries from human (the HPI project) and other model organisms to ensure the presence of high quality annotation for representative members of all protein families. Part of the annotation can be transferred to other family members, as is already done for microbes by the High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP) project. Protein families and groups of proteins are regularly reviewed to keep up with current scientific findings. Complementarily, TrEMBL strives to comprise all protein sequences that are not yet represented in SWISS-PROT, by incorporating a perpetually increasing level of mostly automated annotation. Researchers are welcome to contribute their knowledge to the scientific community by submitting relevant findings to SWISS-PROT at swiss-prot@expasy.org.

3,440 citations

Journal ArticleDOI
TL;DR: It is found that lincRNA expression is strikingly tissue-specific compared with coding genes, and that l incRNAs are typically coexpressed with their neighboring genes, albeit to an extent similar to that of pairs of neighboring protein-coding genes.
Abstract: Large intergenic noncoding RNAs (lincRNAs) are emerging as key regulators of diverse cellular processes. Determining the function of individual lincRNAs remains a challenge. Recent advances in RNA sequencing (RNA-seq) and computational methods allow for an unprecedented analysis of such transcripts. Here, we present an integrative approach to define a reference catalog of >8000 human lincRNAs. Our catalog unifies previously existing annotation sources with transcripts we assembled from RNA-seq data collected from ~4 billion RNA-seq reads across 24 tissues and cell types. We characterize each lincRNA by a panorama of >30 properties, including sequence, structural, transcriptional, and orthology features. We found that lincRNA expression is strikingly tissue-specific compared with coding genes, and that lincRNAs are typically coexpressed with their neighboring genes, albeit to an extent similar to that of pairs of neighboring protein-coding genes. We distinguish an additional subset of transcripts that have high evolutionary conservation but may include short ORFs and may serve as either lincRNAs or small peptides. Our integrated, comprehensive, yet conservative reference catalog of human lincRNAs reveals the global properties of lincRNAs and will facilitate experimental studies and further functional classification of these genes.

3,114 citations