scispace - formally typeset
Search or ask a question
Author

Kenneth S. Katz

Bio: Kenneth S. Katz is an academic researcher from National Institutes of Health. The author has contributed to research in topics: RefSeq & Entrez. The author has an hindex of 13, co-authored 17 publications receiving 4668 citations.
Topics: RefSeq, Entrez, Metadata, MinHash, Entrez Gene

Papers
More filters
Journal ArticleDOI
TL;DR: ClinVar continues to make improvements to its search and retrieval functions.
Abstract: ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) is a freely available, public archive of human genetic variants and interpretations of their significance to disease, maintained at the National Institutes of Health. Interpretations of the clinical significance of variants are submitted by clinical testing laboratories, research laboratories, expert panels and other groups. ClinVar aggregates data by variant-disease pairs, and by variant (or set of variants). Data aggregated by variant are accessible on the website, in an improved set of variant call format files and as a new comprehensive XML report. ClinVar recently started accepting submissions that are focused primarily on providing phenotypic information for individuals who have had genetic testing. Submissions may come from clinical providers providing their own interpretation of the variant ('provider interpretation') or from groups such as patient registries that primarily provide phenotypic information from patients ('phenotyping only'). ClinVar continues to make improvements to its search and retrieval functions. Several new fields are now indexed for more precise searching, and filters allow the user to narrow down a large set of search results.

2,345 citations

Journal ArticleDOI
TL;DR: ClinVar at the National Center for Biotechnology Information (NCBI) is a freely available archive for interpretations of clinical significance of variants for reported conditions, which includes germline and somatic variants of any size, type or genomic location.
Abstract: ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) at the National Center for Biotechnology Information (NCBI) is a freely available archive for interpretations of clinical significance of variants for reported conditions. The database includes germline and somatic variants of any size, type or genomic location. Interpretations are submitted by clinical testing laboratories, research laboratories, locus-specific databases, OMIM®, GeneReviews™, UniProt, expert panels and practice guidelines. In NCBI's Variation submission portal, submitters upload batch submissions or use the Submission Wizard for single submissions. Each submitted interpretation is assigned an accession number prefixed with SCV. ClinVar staff review validation reports with data types such as HGVS (Human Genome Variation Society) expressions; however, clinical significance is reported directly from submitters. Interpretations are aggregated by variant-condition combination and assigned an accession number prefixed with RCV. Clinical significance is calculated for the aggregate record, indicating consensus or conflict in the submitted interpretations. ClinVar uses data standards, such as HGVS nomenclature for variants and MedGen identifiers for conditions. The data are available on the web as variant-specific views; the entire data set can be downloaded via ftp. Programmatic access for ClinVar records is available through NCBI's E-utilities. Future development includes providing a variant-centric XML archive and a web page for details of SCV submissions.

2,094 citations

Journal ArticleDOI
TL;DR: The National Center for Biotechnology Information's (NCBI) Gene database integrates gene-specific information from multiple data sources and represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI.
Abstract: The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP.

489 citations

Journal ArticleDOI
TL;DR: The goal of LocusLink and RefSeq is to include all known genes and their major products and to encourage collaborations with the scientific community to ensure that these resources are as comprehensive and accurate as possible.

258 citations

Journal ArticleDOI
TL;DR: The ‘Human Immunodeficiency Virus Type 1 (HIV-1), Human Protein Interaction Database’, available through the National Library of Medicine at www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions, was created to catalog all interactions between HIV-1 and human proteins published in the peer-reviewed literature.
Abstract: The ‘Human Immunodeficiency Virus Type 1 (HIV-1), Human Protein Interaction Database’, available through the National Library of Medicine at www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions, was created to catalog all interactions between HIV-1 and human proteins published in the peer-reviewed literature. The database serves the scientific community exploring the discovery of novel HIV vaccine candidates and therapeutic targets. To facilitate this discovery approach, the following information for each HIV-1 human protein interaction is provided and can be retrieved without restriction by web-based downloads and ftp protocols: Reference Sequence (RefSeq) protein accession numbers, Entrez Gene identification numbers, brief descriptions of the interactions, searchable keywords for interactions and PubMed identification numbers (PMIDs) of journal articles describing the interactions. Currently, 2589 unique HIV-1 to human protein interactions and 5135 brief descriptions of the interactions, with a total of 14 312 PMID references to the original articles reporting the interactions, are stored in this growing database. In addition, all protein–protein interactions documented in the database are integrated into Entrez Gene records and listed in the ‘HIV-1 protein interactions’ section of Entrez Gene reports. The database is also tightly linked to other databases through Entrez Gene, enabling users to search for an abundance of information related to HIV pathogenesis and replication.

249 citations


Cited by
More filters
Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations

Journal ArticleDOI
TL;DR: A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu.
Abstract: As vertebrate genome sequences near completion and research refocuses to their analysis, the issue of effective genome annotation display becomes critical. A mature web tool for rapid and reliable display of any requested portion of the genome at any scale, together with several dozen aligned annotation tracks, is provided at http://genome.ucsc.edu. This browser displays assembly contigs and gaps, mRNA and expressed sequence tag alignments, multiple gene predictions, cross-species homologies, single nucleotide polymorphisms, sequence-tagged sites, radiation hybrid data, transposon repeats, and more as a stack of coregistered tracks. Text and sequence-based searches provide quick and precise access to any region of specific interest. Secondary links from individual features lead to sequence details and supplementary off-site databases. One-half of the annotation tracks are computed at the University of California, Santa Cruz from publicly available sequence data; collaborators worldwide provide the rest. Users can stably add their own custom tracks to the browser for educational or research purposes. The conceptual and technical framework of the browser, its underlying MYSQL database, and overall use are described. The web site currently serves over 50,000 pages per day to over 3000 different users.

9,605 citations

Journal ArticleDOI
TL;DR: A significant update to one of the tools in this domain called Enrichr, a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries is presented.
Abstract: Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.

6,201 citations

Journal ArticleDOI
TL;DR: The approach to utilizing available RNA-Seq and other data types in the authors' manual curation process for vertebrate, plant, and other species is summarized, and a new direction for prokaryotic genomes and protein name management is described.
Abstract: The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.

4,104 citations

Journal ArticleDOI
04 Sep 2008-Nature
TL;DR: The impact of micro RNAs on the proteome indicated that for most interactions microRNAs act as rheostats to make fine-scale adjustments to protein output.
Abstract: MicroRNAs are endogenous ∼23-nucleotide RNAs that can pair to sites in the messenger RNAs of protein-coding genes to downregulate the expression from these messages. MicroRNAs are known to influence the evolution and stability of many mRNAs, but their global impact on protein output had not been examined. Here we use quantitative mass spectrometry to measure the response of thousands of proteins after introducing microRNAs into cultured cells and after deleting mir-223 in mouse neutrophils. The identities of the responsive proteins indicate that targeting is primarily through seed-matched sites located within favourable predicted contexts in 3′ untranslated regions. Hundreds of genes were directly repressed, albeit each to a modest degree, by individual microRNAs. Although some targets were repressed without detectable changes in mRNA levels, those translationally repressed by more than a third also displayed detectable mRNA destabilization, and, for the more highly repressed targets, mRNA destabilization usually comprised the major component of repression. The impact of microRNAs on the proteome indicated that for most interactions microRNAs act as rheostats to make fine-scale adjustments to protein output. MicroRNAs can regulate gene expression by either inhibiting translation of a messenger RNA, or inducing its degradation. While previous studies have measured regulation at the mRNA level, it was unknown how much regulation occurred at the protein level. Now two groups led by David Bartel and Nikolaus Rajewsky have used variants of the technique known as SILAC (stable isotope labelling with amino acids in cell culture) to measure proteome-wide changes in protein level as a function of expression of endogenous and exogenous microRNAs. They find that while microRNAs can directly repress the translation of hundreds of genes, additional indirect effects result in changes in expression of thousands of genes. Many of the changes observed are less than twofold in magnitude, however, indicating either directly or indirectly, microRNAs can act as rheostats to fine-tune protein synthesis to match the needs of the cell at any given time. In one of two studies, a technique known as SILAC is used to measure, on a large scale, changes in protein level as a function of expression of endogenous and exogenous miRNAs. It is found that although miRNAs directly repress the translation of hundreds of genes, additional indirect effects result in changes in expression of thousands of genes.

3,562 citations