Home
/
Authors
/
John B. Anderson

Author

John B. Anderson

Other affiliations: Johns Hopkins University School of Medicine

Bio: John B. Anderson is an academic researcher from National Institutes of Health. The author has contributed to research in topics: Conserved Domain Database & Entrez. The author has an hindex of 8, co-authored 8 publications receiving 6807 citations. Previous affiliations of John B. Anderson include Johns Hopkins University School of Medicine.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

CDD: a Conserved Domain Database for the functional annotation of proteins

[...]

Aron Marchler-Bauer¹, Shennan Lu¹, John B. Anderson¹, Farideh Chitsaz¹, Myra K. Derbyshire¹, Carol DeWeese-Scott¹, Jessica H. Fong¹, Lewis Y. Geer¹, Renata C. Geer¹, Noreen R. Gonzales¹, Marc Gwadz¹, David I. Hurwitz¹, John D. Jackson¹, Zhaoxi Ke¹, Christopher J. Lanczycki¹, Fu-Ping Lu¹, Gabriele H. Marchler¹, Mikhail Mullokandov¹, Marina V. Omelchenko¹, Cynthia L. Robertson¹, James S. Song¹, Narmada Thanki¹, Roxanne A. Yamashita¹, Dachuan Zhang¹, Naigong Zhang¹, Chanjuan Zheng¹, Stephen H. Bryant¹ - Show less +23 more•Institutions (1)

National Institutes of Health¹

01 Jan 2011-Nucleic Acids Research

TL;DR: NCBI’s Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints.

...read moreread less

Abstract: NCBI's Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default, domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein queries. CDD can be accessed via http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.

...read moreread less

2,934 citations

Journal Article•DOI•

CDD: a Conserved Domain Database for protein classification

[...]

Aron Marchler-Bauer¹, John B. Anderson¹, Praveen F. Cherukuri¹, Carol DeWeese-Scott¹, Lewis Y. Geer¹, Marc Gwadz¹, Siqian He¹, David I. Hurwitz¹, John D. Jackson¹, Zhaoxi Ke¹, Christopher J. Lanczycki¹, Cynthia A. Liebert¹, Chunlei Liu¹, Fu Lu¹, Gabriele H. Marchler¹, Mikhail Mullokandov¹, Benjamin A. Shoemaker¹, Vahan Simonyan¹, James S. Song¹, Paul A. Thiessen¹, Roxanne A. Yamashita¹, Jodie J. Yin¹, Dachuan Zhang¹, Stephen H. Bryant¹ - Show less +20 more•Institutions (1)

National Institutes of Health¹

17 Dec 2004-Nucleic Acids Research

TL;DR: The progress of the curation effort and associated improvements in the functionality of the CDD information retrieval system are reported on.

...read moreread less

Abstract: The Conserved Domain Database (CDD) is the protein classification component of NCBI's Entrez query and retrieval system. CDD is linked to other Entrez databases such as Proteins, Taxonomy and PubMed, and can be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. CD-Search, which is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, is a fast, interactive tool to identify conserved domains in new protein sequences. CD-Search results for protein sequences in Entrez are pre-computed to provide links between proteins and domain models, and computational annotation visible upon request. Protein-protein queries submitted to NCBI's BLAST search service at http://www.ncbi.nlm.nih.gov/BLAST are scanned for the presence of conserved domains by default. While CDD started out as essentially a mirror of publicly available domain alignment collections, such as SMART, Pfam and COG, we have continued an effort to update, and in some cases replace these models with domain hierarchies curated at the NCBI. Here, we report on the progress of the curation effort and associated improvements in the functionality of the CDD information retrieval system.

...read moreread less

1,193 citations

Journal Article•DOI•

CDD: specific functional annotation with the Conserved Domain Database.

[...]

Aron Marchler-Bauer¹, John B. Anderson¹, Farideh Chitsaz¹, Myra K. Derbyshire¹, Carol DeWeese-Scott¹, Jessica H. Fong¹, Lewis Y. Geer¹, Renata C. Geer¹, Noreen R. Gonzales¹, Marc Gwadz¹, Siqian He¹, David I. Hurwitz¹, John D. Jackson¹, Zhaoxi Ke¹, Christopher J. Lanczycki¹, Cynthia A. Liebert¹, Chunlei Liu¹, Fu-er Lu¹, Shennan Lu¹, Gabriele H. Marchler¹, Mikhail Mullokandov¹, James S. Song¹, Asba Tasneem¹, Narmada Thanki¹, Roxanne A. Yamashita¹, Dachuan Zhang¹, Naigong Zhang¹, Stephen H. Bryant¹ - Show less +24 more•Institutions (1)

National Institutes of Health¹

01 Jan 2009-Nucleic Acids Research

TL;DR: NCBI's Conserved Domain Database is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution, and provides annotation of domain footprints and conserved functional sites on protein sequences.

...read moreread less

Abstract: NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution The collection can be accessed at http://wwwncbinlmnihgov/Structure/cdd/cddshtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources CDD provides annotation of domain footprints and conserved functional sites on protein sequences Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://wwwncbinlmnihgov/Structure/cdd/wrpsbcgi Starting with the latest version of CDD, v214, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either ‘specific’ (identifying molecular function with high confidence) or as ‘non-specific’ (identifying superfamily membership only)

...read moreread less

1,115 citations

Journal Article•DOI•

CDD: a conserved domain database for interactive domain family analysis

[...]

Aron Marchler-Bauer¹, John B. Anderson¹, Myra K. Derbyshire¹, Carol DeWeese-Scott¹, Noreen R. Gonzales¹, Marc Gwadz¹, Luning Hao¹, Siqian He¹, David I. Hurwitz¹, John D. Jackson¹, Zhaoxi Ke¹, Dmitri M. Krylov¹, Christopher J. Lanczycki¹, Cynthia A. Liebert¹, Chunlei Liu¹, Fu Lu¹, Shennan Lu¹, Gabriele H. Marchler¹, Mikhail Mullokandov¹, James S. Song¹, Narmada Thanki¹, Roxanne A. Yamashita¹, Jodie J. Yin¹, Dachuan Zhang¹, Stephen H. Bryant¹ - Show less +21 more•Institutions (1)

National Institutes of Health¹

01 Jan 2007-Nucleic Acids Research

TL;DR: A novel helper application, CDTree, is presented, which enables users of the CDD resource to examine curated hierarchies and serve as a powerful tool in protein classification, as they allow users to analyze protein sequences in the context of domain family hierarchies.

...read moreread less

Abstract: The conserved domain database (CDD) is part of NCBI's Entrez database system and serves as a primary resource for the annotation of conserved domain footprints on protein sequences in Entrez. Entrez's global query interface can be accessed at http://www.ncbi.nlm.nih.gov/Entrez and will search CDD and many other databases. Domain annotation for proteins in Entrez has been pre-computed and is readily available in the form of ‘Conserved Domain’ links. Novel protein sequences can be scanned against CDD using the CD-Search service; this service searches databases of CDD-derived profile models with protein sequence queries using BLAST heuristics, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Protein query sequences submitted to NCBI's protein BLAST search service are scanned for conserved domain signatures by default. The CDD collection contains models imported from Pfam, SMART and COG, as well as domain models curated at NCBI. NCBI curated models are organized into hierarchies of domains related by common descent. Here we report on the status of the curation effort and present a novel helper application, CDTree, which enables users of the CDD resource to examine curated hierarchies. More importantly, CDD and CDTree used in concert, serve as a powerful tool in protein classification, as they allow users to analyze protein sequences in the context of domain family hierarchies.

...read moreread less

852 citations

Journal Article•DOI•

CDD: a curated Entrez database of conserved domain alignments

[...]

Aron Marchler-Bauer¹, John B. Anderson¹, Carol DeWeese-Scott¹, Natalie D. Fedorova¹, Lewis Y. Geer¹, Siqian He¹, David I. Hurwitz¹, John D. Jackson¹, Aviva R. Jacobs¹, Christopher J. Lanczycki¹, Cynthia A. Liebert¹, Chunlei Liu¹, Thomas Madej¹, Gabriele H. Marchler¹, Raja Mazumder¹, Anastasia N. Nikolskaya¹, Anna R. Panchenko¹, Bachoti S. Rao¹, Benjamin A. Shoemaker¹, Vahan Simonyan¹, James S. Song¹, Paul A. Thiessen¹, Sona Vasudevan¹, Yanli Wang¹, Roxanne A. Yamashita¹, Jodie J. Yin¹, Stephen H. Bryant¹ - Show less +23 more•Institutions (1)

National Institutes of Health¹

01 Jan 2003-Nucleic Acids Research

TL;DR: The Conserved Domain Database (CDD), which mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI, is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE(R).

...read moreread less

Abstract: The Conserved Domain Database (CDD) is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE®. This allows users to search for domain types by name, for example, or to view the domain architecture of any protein in Entrez's sequence database. CDD can be accessed on the WorldWideWeb at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. Users may also employ the CD-Search service to identify conserved domains in new sequences, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. CD-Search results, and pre-computed links from Entrez's protein database, are calculated using the RPS-BLAST algorithm and Position Specific Score Matrices (PSSMs) derived from CDD alignments. CD-Searches are also run by default for protein–protein queries submitted to BLAST® at http://www.ncbi.nlm.nih.gov/BLAST. CDD mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI. Structure information is used to identify the core substructure likely to be present in all family members, and to produce sequence alignments consistent with structure conservation. This alignment model allows NCBI curators to annotate ‘columns’ corresponding to functional sites conserved among family members.

...read moreread less

765 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Database resources of the National Center for Biotechnology Information

[...]

David L. Wheeler¹, Deanna M. Church¹, Ron Edgar¹, Scott Federhen¹, Wolfgang Helmberg¹, Thomas L. Madden¹, Joan Pontius¹, Gregory D. Schuler¹, Lynn M. Schriml¹, Edwin Sequeira¹, Tugba O. Suzek¹, Tatiana Tatusova¹, Lukas Wagner¹ - Show less +9 more•Institutions (1)

National Institutes of Health¹

01 Jan 2004-Nucleic Acids Research

TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.

...read moreread less

Abstract: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's website. NCBI resources include Entrez, PubMed, PubMed Central, LocusLink, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SARS Coronavirus Resource, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD) and the Conserved Domain Architecture Retrieval Tool (CDART). Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov.

...read moreread less

9,604 citations

Journal Article•DOI•

TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data.

[...]

Chengjie Chen¹, Hao Chen², Yi Zhang, Hannah R. Thomas³, Margaret H. Frank³, Yehua He¹, Rui Xia - Show less +3 more•Institutions (3)

South China Agricultural University¹, Hunan Agricultural University², Cornell University³

03 Aug 2020-Molecular Plant

TL;DR: The toolkit incorporates over 130 functions, which are designed to meet the increasing demand for big-data analyses, ranging from bulk sequence processing to interactive data visualization, and a new plotting engine developed to maximum their interactive ability.

...read moreread less

5,173 citations

Journal Article•DOI•

Protein structure prediction on the Web: a case study using the Phyre server.

[...]

Lawrence A. Kelley¹, Michael J.E. Sternberg¹•Institutions (1)

Imperial College London¹

01 Jan 2009-Nature Protocols

TL;DR: This protocol provides a guide to interpreting the output of structure prediction servers in general and one such tool in particular, the protein homology/analogy recognition engine (Phyre), which can reliably detect up to twice as many remote homologies as standard sequence-profile searching.

...read moreread less

Abstract: Determining the structure and function of a novel protein is a cornerstone of many aspects of modern biology. Over the past decades, a number of computational tools for structure prediction have been developed. It is critical that the biological community is aware of such tools and is able to interpret their results in an informed way. This protocol provides a guide to interpreting the output of structure prediction servers in general and one such tool in particular, the protein homology/analogy recognition engine (Phyre). New profile–profile matching algorithms have improved structure prediction considerably in recent years. Although the performance of Phyre is typical of many structure prediction systems using such algorithms, all these systems can reliably detect up to twice as many remote homologies as standard sequence-profile searching. Phyre is widely used by the biological community, with >150 submissions per day, and provides a simple interface to results. Phyre takes 30 min to predict the structure of a 250-residue protein.

...read moreread less

4,403 citations

Journal Article•DOI•

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

[...]

Kim D. Pruitt¹, Tatiana Tatusova¹, Donna Maglott¹•Institutions (1)

National Institutes of Health¹

17 Dec 2004-Nucleic Acids Research

TL;DR: The National Center for Biotechnology Information Reference Sequence (RefSeq) database provides a non-redundant collection of sequences representing genomic data, transcripts and proteins that pragmatically includes sequence data that are currently publicly available in the archival databases.

...read moreread less

Abstract: The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff.

...read moreread less

4,229 citations

Journal Article•DOI•

The COG database: an updated version includes eukaryotes

[...]

Roman L. Tatusov¹, Natalie D. Fedorova¹, John D. Jackson¹, Aviva R. Jacobs¹, Boris Kiryutin¹, Eugene V. Koonin¹, Dmitri M. Krylov¹, Raja Mazumder², Sergei L. Mekhedov¹, Anastasia N. Nikolskaya², B Sridhar Rao¹, Sergei Smirnov¹, Alexander V. Sverdlov¹, Sona Vasudevan¹, Yuri I. Wolf¹, Jodie J. Yin¹, Darren A. Natale² - Show less +13 more•Institutions (2)

National Institutes of Health¹, Georgetown University Medical Center²

11 Sep 2003-BMC Bioinformatics

TL;DR: A major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes is described and is expected to be a useful platform for functional annotation of newlysequenced genomes, including those of complex eukARYotes, and genome-wide evolutionary studies.

...read moreread less

Abstract: The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies. We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after euk aryotic o rthologous g roups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The euk aryotic o rthologous g roups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes. The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.

...read moreread less

4,167 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse