scispace - formally typeset
Search or ask a question
Author

Brian Smith-White

Bio: Brian Smith-White is an academic researcher from National Institutes of Health. The author has contributed to research in topics: Genome & Genome project. The author has an hindex of 5, co-authored 5 publications receiving 3475 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The approach to utilizing available RNA-Seq and other data types in the authors' manual curation process for vertebrate, plant, and other species is summarized, and a new direction for prokaryotic genomes and protein name management is described.
Abstract: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

4,104 citations

Journal ArticleDOI
Tsuyoshi Tanaka1, Baltazar A. Antonio1, Shoshi Kikuchi1, Takashi Matsumoto1, Yoshiaki Nagamura1, Hisataka Numa1, Hiroaki Sakai1, Jianzhong Wu1, Takeshi Itoh1, Takeshi Itoh2, Takuji Sasaki1, Ryo Aono, Yasuyuki Fujii3, Takuya Habara, Erimi Harada, Masako Kanno, Yoshihiro Kawahara4, Hiroaki Kawashima, Hiromi Kubooka, Akihiro Matsuya, Hajime Nakaoka, Naomi Saichi, Ryoko Sanbonmatsu, Yoshiharu Sato, Yuji Shinso, Mami Suzuki, Jun-ichi Takeda, Motohiko Tanino, Fusano Todokoro, Kaori Yamaguchi, Naoyuki Yamamoto, Chisato Yamasaki, Tadashi Imanishi2, Toshihisa Okido, Masahito Tada, Kazuho Ikeo, Yoshio Tateno, Takashi Gojobori, Yao-Cheng Lin5, Fu Jin Wei5, Yue-Ie C. Hsing5, Qiang Zhao, Bin Han, Melissa Kramer6, Richard W. McCombie6, David Lonsdale7, Claire O'Donovan7, Eleanor J. Whitfield7, Rolf Apweiler7, Kanako O. Koyanagi8, Jitendra P. Khurana9, Saurabh Raghuvanshi9, Nagendra K. Singh10, Akhilesh K. Tyagi9, Georg Haberer, Masaki Fujisawa, Satomi Hosokawa, Yukiyo Ito, Hiroshi Ikawa, Michie Shibata, Mayu Yamamoto, Richard Bruskiewich11, Douglas R. Hoen12, Thomas E. Bureau12, Nobukazu Namiki13, Hajime Ohyanagi13, Yasumichi Sakai13, Satoshi Nobushima13, Katsumi Sakata13, Roberto A. Barrero14, Yutaka Sato15, Alexandre Souvorov16, Brian Smith-White16, Tatiana Tatusova16, Suyoung An17, Gynheung An17, Satoshi Oota, Galina Fuks18, Joachim Messing, Karen R. Christie19, Damien Lieberherr20, Hyeran Kim21, Andrea Zuccolo21, Rod A. Wing, Kan Nobuta22, Pamela J. Green22, Cheng Lu22, Blake C. Meyers22, Cristian Chaparro23, Benoît Piégu23, Olivier Panaud23, Manuel Echeverria23 
TL;DR: The latest version of the RAP-DB contains a variety of annotation data as follows: clone positions, structures and functions of 31 439 genes validated by cDNAs, RNA genes detected by massively parallel signature sequencing (MPSS) technology and sequence similarity, flanking sequences of mutant lines, transposable elements, etc.
Abstract: The Rice Annotation Project Database (RAP-DB) was created to provide the genome sequence assembly of the International Rice Genome Sequencing Project (IRGSP), manually curated annotation of the sequence, and other genomics information that could be useful for comprehensive understanding of the rice biology. Since the last publication of the RAP-DB, the IRGSP genome has been revised and reassembled. In addition, a large number of rice-expressed sequence tags have been released, and functional genomics resources have been produced worldwide. Thus, we have thoroughly updated our genome annotation by manual curation of all the functional descriptions of rice genes. The latest version of the RAP-DB contains a variety of annotation data as follows: clone positions, structures and functions of 31 439 genes validated by cDNAs, RNA genes detected by massively parallel signature sequencing (MPSS) technology and sequence similarity, flanking sequences of mutant lines, transposable elements, etc. Other annotation data such as Gnomon can be displayed along with those of RAP for comparison. We have also developed a new keyword search system to allow the user to access useful information. The RAP-DB is available at: http://rapdb.dna.affrc.go.jp/ and http://rapdb.lab.nig.ac.jp/.

342 citations

Journal ArticleDOI
Takeshi Itoh1, Takeshi Itoh2, Tsuyoshi Tanaka2, Roberto A. Barrero, Chisato Yamasaki1, Yasuyuki Fujii1, Phillip Hilton1, Baltazar A. Antonio2, Hideo Aono, Rolf Apweiler, Richard Bruskiewich3, Thomas E. Bureau4, Frances A. Burr5, Antonio Costa de Oliveira6, Galina Fuks7, Takuya Habara1, Georg Haberer, Bin Han, Erimi Harada1, Aiko T. Hiraki1, Hirohiko Hirochika2, Douglas R. Hoen4, Hiroki Hokari1, Satomi Hosokawa, Yue-Ie C. Hsing8, Hiroshi Ikawa9, Kazuho Ikeo, Tadashi Imanishi10, Tadashi Imanishi1, Yukiyo Ito, Pankaj Jaiswal11, Masako Kanno1, Yoshihiro Kawahara1, Yoshihiro Kawahara12, Toshiyuki Kawamura1, Hiroaki Kawashima1, Jitendra P. Khurana13, Shoshi Kikuchi2, Setsuko Komatsu2, Kanako O. Koyanagi10, Hiromi Kubooka1, Damien Lieberherr14, Yao-Cheng Lin8, David M. Lonsdale, Takashi Matsumoto2, Akihiro Matsuya1, W. Richard McCombie15, Joachim Messing7, Akio Miyao2, Nicola Mulder, Yoshiaki Nagamura2, Jongmin Nam16, Jongmin Nam17, Nobukazu Namiki, Hisataka Numa2, Shin Nurimoto1, Claire O'Donovan, Hajime Ohyanagi9, Toshihisa Okido, Satoshi Oota, Naoki Osato, Lance E. Palmer18, Lance E. Palmer15, Francis Quetier19, Saurabh Raghuvanshi13, Naomi Saichi1, Hiroaki Sakai1, Hiroaki Sakai2, Yasumichi Sakai9, Katsumi Sakata9, Tetsuya Sakurai, Fumihiko Sato1, Yoshiharu Sato1, Heiko Schoof20, Heiko Schoof21, Motoaki Seki, Michie Shibata, Yuji Shimizu9, Kazuo Shinozaki, Yuji Shinso1, Nagendra K. Singh22, Brian Smith-White23, Jun-ichi Takeda1, Motohiko Tanino1, Tatiana Tatusova23, Supat Thongjuea24, Fusano Todokoro1, Mika Tsugane, Akhilesh K. Tyagi13, Apichart Vanavichit24, Aihui Wang25, Rod A. Wing, Kaori Yamaguchi1, Mayu Yamamoto, Naoyuki Yamamoto1, Yeisoo Yu26, Hao Zhang1, Qiang Zhao, Kenichi Higo2, Benjamin Burr5, Takashi Gojobori1, Takuji Sasaki2 
TL;DR: The results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene.
Abstract: We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ∼32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene.

254 citations

Journal ArticleDOI
TL;DR: The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez, which makes Entrez a powerful system for genomic research.
Abstract: The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI.

19 citations

Book ChapterDOI
TL;DR: The National Center for Biotechnology Information provides a data-rich environment in support of genomic research by collecting the biological data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains and integrating the data with analytical, search, and retrieval resources through the NCBI Web site.
Abstract: The National Center for Biotechnology Information (NCBI) provides a data-rich environment in support of genomic research by collecting the biological data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains and integrating the data with analytical, search, and retrieval resources through the NCBI Web site. Entrez, an integrated search and retrieval system, enables text searches across various diverse biological databases maintained at NCBI. Map Viewer, the genome browser developed at NCBI, displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allows maps from all plant genomes available in the Map Viewer to be searched to produce a display of aligned maps from several species. Customized Plant Basic Local Alignment Search Tool (PlantBLAST) allows the user to perform sequence similarity searches in a special collection of mapped plant sequence data and to view the resulting alignments within a genomic context using Map Viewer. In addition, pre-computed sequence similarities, such as those for proteins offered by BLAST Link (BLink), enable fluid navigation from un-annotated to annotated sequences, quickening the pace of discovery. Plant Genome Central (PGC) is a Web portal that provides centralized access to all NCBI plant genome resources. Also, there are links to plant-specific Web resources external to NCBI such as organism-specific databases, genome-sequencing project Web pages, and homepages of genomic bioinformatics organizations.

16 citations


Cited by
More filters
Journal ArticleDOI
Minoru Kanehisa1, Miho Furumichi1, Mao Tanabe1, Yoko Sato2, Kanae Morishima1 
TL;DR: The content has been expanded and the quality improved irrespective of whether or not the KOs appear in the three molecular network databases, and the newly introduced addendum category of the GENES database is a collection of individual proteins whose functions are experimentally characterized and from which an increasing number of KOs are defined.
Abstract: KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an encyclopedia of genes and genomes. Assigning functional meanings to genes and genomes both at the molecular and higher levels is the primary objective of the KEGG database project. Molecular-level functions are stored in the KO (KEGG Orthology) database, where each KO is defined as a functional ortholog of genes and proteins. Higher-level functions are represented by networks of molecular interactions, reactions and relations in the forms of KEGG pathway maps, BRITE hierarchies and KEGG modules. In the past the KO database was developed for the purpose of defining nodes of molecular networks, but now the content has been expanded and the quality improved irrespective of whether or not the KOs appear in the three molecular network databases. The newly introduced addendum category of the GENES database is a collection of individual proteins whose functions are experimentally characterized and from which an increasing number of KOs are defined. Furthermore, the DISEASE and DRUG databases have been improved by systematic analysis of drug labels for better integration of diseases and drugs with the KEGG molecular networks. KEGG is moving towards becoming a comprehensive knowledge base for both functional interpretation and practical application of genomic information.

5,741 citations

Journal ArticleDOI
29 Jan 2009-Nature
TL;DR: An initial analysis of the ∼730-megabase Sorghum bicolor (L.) Moench genome is presented, placing ∼98% of genes in their chromosomal context using whole-genome shotgun sequence validated by genetic, physical and syntenic information.
Abstract: Sorghum, an African grass related to sugar cane and maize, is grown for food, feed, fibre and fuel. We present an initial analysis of the approximately 730-megabase Sorghum bicolor (L.) Moench genome, placing approximately 98% of genes in their chromosomal context using whole-genome shotgun sequence validated by genetic, physical and syntenic information. Genetic recombination is largely confined to about one-third of the sorghum genome with gene order and density similar to those of rice. Retrotransposon accumulation in recombinationally recalcitrant heterochromatin explains the approximately 75% larger genome size of sorghum compared with rice. Although gene and repetitive DNA distributions have been preserved since palaeopolyploidization approximately 70 million years ago, most duplicated gene sets lost one member before the sorghum-rice divergence. Concerted evolution makes one duplicated chromosomal segment appear to be only a few million years old. About 24% of genes are grass-specific and 7% are sorghum-specific. Recent gene and microRNA duplications may contribute to sorghum's drought tolerance.

2,809 citations

Journal ArticleDOI
TL;DR: The InterPro database integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs.
Abstract: The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or 'signatures' representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total approximately 58,000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein-protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).

1,834 citations

Journal ArticleDOI
TL;DR: The new version of the MPI Bioinformatics Toolkit is introduced, focusing on improved features for the comprehensive analysis of proteins, as well as on promoting teaching.

1,757 citations

Journal ArticleDOI
John P. Vogel1, David F. Garvin2, Todd C. Mockler2, Jeremy Schmutz, Daniel S. Rokhsar3, Michael W. Bevan4, Kerrie Barry5, Susan Lucas5, Miranda Harmon-Smith5, Kathleen Lail5, Hope Tice5, Jane Grimwood, Neil McKenzie4, Naxin Huo6, Yong Q. Gu6, Gerard R. Lazo6, Olin D. Anderson6, Frank M. You7, Ming-Cheng Luo7, Jan Dvorak7, Jonathan M. Wright4, Melanie Febrer4, Dominika Idziak8, Robert Hasterok8, Erika Lindquist5, Mei Wang5, Samuel E. Fox2, Henry D. Priest2, Sergei A. Filichkin2, Scott A. Givan2, Douglas W. Bryant2, Jeff H. Chang2, Haiyan Wu9, Wei Wu10, An-Ping Hsia10, Patrick S. Schnable9, Anantharaman Kalyanaraman11, Brad Barbazuk12, Todd P. Michael, Samuel P. Hazen13, Jennifer N. Bragg6, Debbie Laudencia-Chingcuanco6, Yiqun Weng14, Georg Haberer, Manuel Spannagl, Klaus F. X. Mayer, Thomas Rattei15, Therese Mitros3, Sang-Jik Lee16, Jocelyn K. C. Rose16, Lukas A. Mueller16, Thomas L. York16, Thomas Wicker17, Jan P. Buchmann17, Jaakko Tanskanen18, Alan H. Schulman18, Heidrun Gundlach, Michael W. Bevan4, Antonio Costa de Oliveira19, Luciano da C. Maia19, William R. Belknap6, Ning Jiang, Jinsheng Lai9, Liucun Zhu20, Jianxin Ma20, Cheng Sun21, Ellen J. Pritham21, Jérôme Salse, Florent Murat, Michael Abrouk, Rémy Bruggmann, Joachim Messing, Noah Fahlgren2, Christopher M. Sullivan2, James C. Carrington2, Elisabeth J. Chapman, Greg D. May22, Jixian Zhai23, Matthias Ganssmann23, Sai Guna Ranjan Gurazada23, Marcelo A German23, Blake C. Meyers23, Pamela J. Green23, Ludmila Tyler3, Jiajie Wu7, James A. Thomson6, Shan Chen13, Henrik Vibe Scheller24, Jesper Harholt25, Peter Ulvskov25, Jeffrey A. Kimbrel2, Laura E. Bartley24, Peijian Cao24, Ki-Hong Jung26, Manoj Sharma24, Miguel E. Vega-Sánchez24, Pamela C. Ronald24, Chris Dardick6, Stefanie De Bodt27, Wim Verelst27, Dirk Inzé27, Maren Heese28, Arp Schnittger28, Xiaohan Yang29, Udaya C. Kalluri29, Gerald A. Tuskan29, Zhihua Hua14, Richard D. Vierstra14, Yu Cui9, Shuhong Ouyang9, Qixin Sun9, Zhiyong Liu9, Alper Yilmaz30, Erich Grotewold30, Richard Sibout31, Kian Hématy31, Grégory Mouille31, Herman Höfte31, Todd P. Michael, Jérôme Pelloux32, Devin O'Connor3, James C. Schnable3, Scott C. Rowe3, Frank G. Harmon3, Cynthia L. Cass33, John C. Sedbrook33, Mary E. Byrne4, Sean Walsh4, Janet Higgins4, Pinghua Li16, Thomas P. Brutnell16, Turgay Unver34, Hikmet Budak34, Harry Belcram, Mathieu Charles, Boulos Chalhoub, Ivan Baxter35 
11 Feb 2010-Nature
TL;DR: The high-quality genome sequence will help Brachypodium reach its potential as an important model system for developing new energy and food crops and establishes a template for analysis of the large genomes of economically important pooid grasses such as wheat.
Abstract: Three subfamilies of grasses, the Ehrhartoideae, Panicoideae and Pooideae, provide the bulk of human nutrition and are poised to become major sources of renewable energy. Here we describe the genome sequence of the wild grass Brachypodium distachyon (Brachypodium), which is, to our knowledge, the first member of the Pooideae subfamily to be sequenced. Comparison of the Brachypodium, rice and sorghum genomes shows a precise history of genome evolution across a broad diversity of the grasses, and establishes a template for analysis of the large genomes of economically important pooid grasses such as wheat. The high-quality genome sequence, coupled with ease of cultivation and transformation, small size and rapid life cycle, will help Brachypodium reach its potential as an important model system for developing new energy and food crops.

1,603 citations