scispace - formally typeset
Search or ask a question
Author

Wendy Wu

Bio: Wendy Wu is an academic researcher from National Institutes of Health. The author has contributed to research in topics: RefSeq & Reference genome. The author has an hindex of 4, co-authored 4 publications receiving 4017 citations. Previous affiliations of Wendy Wu include University of California, Santa Cruz.

Papers
More filters
Journal ArticleDOI
TL;DR: The approach to utilizing available RNA-Seq and other data types in the authors' manual curation process for vertebrate, plant, and other species is summarized, and a new direction for prokaryotic genomes and protein name management is described.
Abstract: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

4,104 citations

Journal ArticleDOI
TL;DR: The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration.
Abstract: The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript and protein sequence records derived from data in public sequence archives and from computation, curation and collaboration (http://wwwncbinlmnihgov/refseq/) We report here on growth of the mammalian and human subsets, changes to NCBI’s eukaryotic annotation pipeline and modifications affecting transcript and protein records Recent changes to NCBI’s eukaryotic genome annotation pipeline provide higher throughput, and the addition of RNAseq data to the pipeline results in a significant expansion of the number of transcripts and novel exons annotated on mammalian RefSeq genomes Recent annotation changes include reporting supporting evidence for transcript records, modification of exon feature annotation and the addition of a structured report of gene and sequence attributes of biological interest We also describe a revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins and we summarize the current status of the RefSeqGene project

949 citations

Journal ArticleDOI
TL;DR: The current status and recent growth in the CCDS dataset is described, as well as recent changes to the web and FTP sites, which include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and the approach to representing genes for which support evidence is incomplete.
Abstract: The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.

157 citations

Journal ArticleDOI
Gary F. Temple1, Daniela S. Gerhard1, Rebekah S. Rasooly1, Elise A. Feingold1, Peter J. Good1, Cristen Robinson1, Allison Mandich1, Jeffrey G. Derge2, Jeanne Lewis2, Debonny Shoaf2, Francis S. Collins1, Wonhee Jang1, Lukas Wagner1, Carolyn M. Shenmen1, Leonie Misquitta1, Carl F. Schaefer1, Kenneth H. Buetow1, Tom I. Bonner1, Linda Yankie1, Ming Ward1, Lon Phan1, Alex Astashyn1, Garth Brown1, Catherine M. Farrell1, Jennifer Hart1, Melissa J. Landrum1, Bonnie L. Maidak1, Michael R. Murphy1, Terence Murphy1, Bhanu Rajput1, Lillian D. Riddick1, David Webb1, Janet Weber1, Wendy Wu1, Kim D. Pruitt1, Donna Maglott1, Adam Siepel3, Brona Brejova4, Brona Brejova3, Mark Diekhans5, Rachel A. Harte5, Robert Baertsch5, Jim Kent5, David Haussler5, Michael R. Brent6, Laura Langton6, Charles L.G. Comstock6, Michael Stevens6, Chaochun Wei6, Chaochun Wei7, Marijke J. van Baren6, Kourosh Salehi-Ashtiani8, Ryan R. Murray8, Lila Ghamsari8, Elizabeth Mello8, Chenwei Lin8, Chenwei Lin9, Christa Pennacchio10, Christa Pennacchio11, Kirsten Schreiber11, Nicole Shapiro11, Nicole Shapiro12, Amber Marsh11, Elizabeth Pardes11, Troy Moore, Anita Lebeau, Mike Muratet, Blake A. Simmons, David Kloske, Stephanie Sieja, James R. Hudson, Praveen Sethupathy1, Michael J. Brownstein1, Narayan K. Bhat13, Narayan K. Bhat1, Joseph Lazar14, Howard J. Jacob14, Chris E. Gruber, Mark R. Smith, John Douglas Mcpherson15, Angela M. Garcia15, Preethi H. Gunaratne15, Preethi H. Gunaratne16, Jia Qian Wu15, Jia Qian Wu17, Donna M. Muzny15, Richard A. Gibbs15, Alice C. Young1, Gerard G. Bouffard1, Robert W. Blakesley1, Jim C. Mullikin1, Eric D. Green1, Mark Dickson9, Alex Rodriguez9, Alex Rodriguez18, Jane Grimwood9, Jeremy Schmutz9, Richard M. Myers9, Martin Hirst19, Thomas Zeng19, Kane Tse19, Michelle Moksa19, Merinda Deng19, Kevin Ma19, Diana Mah19, Johnson Pang19, Greg Taylor19, Eric Chuah19, Athena Deng19, Keith Fichter19, Anne Go19, Stephanie Lee19, Jing Wang19, Malachi Griffith19, Ryan D. Morin19, Richard A. Moore19, Michael Mayo19, Sarah Munro19, Susan Wagner19, Steven J.M. Jones19, Robert A. Holt19, Marco A. Marra19, Sun Lu, Shuwei Yang, James Hartigan20, Marcus Graf, Ralf Wagner, Stanley Letovksy21, Jacqueline C. Pulido, Keith Robison, Dominic Esposito1, James L. Hartley1, Vanessa Wall1, Ralph F. Hopkins1, Osamu Ohara, Stefan Wiemann22 
TL;DR: The Mammalian Gene Collection now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97%.
Abstract: Since its start, the Mammalian Gene Collection (MGC) has sought to provide at least one full-protein-coding sequence cDNA clone for every human and mouse gene with a RefSeq transcript, and at least 6200 rat genes. The MGC cloning effort initially relied on random expressed sequence tag screening of cDNA libraries. Here, we summarize our recent progress using directed RT-PCR cloning and DNA synthesis. The MGC now contains clones with the entire protein-coding sequence for 92% of human and 89% of mouse genes with curated RefSeq (NM-accession) transcripts, and for 97% of human and 96% of mouse genes with curated RefSeq transcripts that have one or more PubMed publications, in addition to clones for more than 6300 rat genes. These high-quality MGC clones and their sequences are accessible without restriction to researchers worldwide.

140 citations


Cited by
More filters
Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations

Journal ArticleDOI
Minoru Kanehisa1, Miho Furumichi1, Mao Tanabe1, Yoko Sato2, Kanae Morishima1 
TL;DR: The content has been expanded and the quality improved irrespective of whether or not the KOs appear in the three molecular network databases, and the newly introduced addendum category of the GENES database is a collection of individual proteins whose functions are experimentally characterized and from which an increasing number of KOs are defined.
Abstract: KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an encyclopedia of genes and genomes. Assigning functional meanings to genes and genomes both at the molecular and higher levels is the primary objective of the KEGG database project. Molecular-level functions are stored in the KO (KEGG Orthology) database, where each KO is defined as a functional ortholog of genes and proteins. Higher-level functions are represented by networks of molecular interactions, reactions and relations in the forms of KEGG pathway maps, BRITE hierarchies and KEGG modules. In the past the KO database was developed for the purpose of defining nodes of molecular networks, but now the content has been expanded and the quality improved irrespective of whether or not the KOs appear in the three molecular network databases. The newly introduced addendum category of the GENES database is a collection of individual proteins whose functions are experimentally characterized and from which an increasing number of KOs are defined. Furthermore, the DISEASE and DRUG databases have been improved by systematic analysis of drug labels for better integration of diseases and drugs with the KEGG molecular networks. KEGG is moving towards becoming a comprehensive knowledge base for both functional interpretation and practical application of genomic information.

5,741 citations

Journal ArticleDOI
TL;DR: The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.
Abstract: The Ensembl Variant Effect Predictor is a powerful toolset for the analysis, annotation, and prioritization of genomic variants in coding and non-coding regions. It provides access to an extensive collection of genomic annotation, with a variety of interfaces to suit different requirements, and simple options for configuring and extending analysis. It is open source, free to use, and supports full reproducibility of results. The Ensembl Variant Effect Predictor can simplify and accelerate variant interpretation in a wide range of study designs.

4,658 citations

Journal ArticleDOI
TL;DR: The approach to utilizing available RNA-Seq and other data types in the authors' manual curation process for vertebrate, plant, and other species is summarized, and a new direction for prokaryotic genomes and protein name management is described.
Abstract: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

4,104 citations

Journal ArticleDOI
TL;DR: Long noncoding RNAs (lncRNAs) as discussed by the authors form extensive networks of ribonucleoprotein (RNP) complexes with numerous chromatin regulators and then target these enzymatic activities to appropriate locations in the genome.
Abstract: The central dogma of gene expression is that DNA is transcribed into messenger RNAs, which in turn serve as the template for protein synthesis. The discovery of extensive transcription of large RNA transcripts that do not code for proteins, termed long noncoding RNAs (lncRNAs), provides an important new perspective on the centrality of RNA in gene regulation. Here, we discuss genome-scale strategies to discover and characterize lncRNAs. An emerging theme from multiple model systems is that lncRNAs form extensive networks of ribonucleoprotein (RNP) complexes with numerous chromatin regulators and then target these enzymatic activities to appropriate locations in the genome. Consistent with this notion, lncRNAs can function as modular scaffolds to specify higher-order organization in RNP complexes and in chromatin states. The importance of these modes of regulation is underscored by the newly recognized roles of long RNAs for proper gene control across all kingdoms of life.

3,075 citations