Selection of representative protein data sets

doi:10.1002/PRO.5560010313

Open AccessJournal ArticleDOI

Selection of representative protein data sets

Uwe Hobohm, +3 more

- 01 Mar 1992 -

Protein Science

- Vol. 1, Iss: 3, pp 409-417

TLDR

Two algorithms are developed to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy and are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory.

Abstract:

The Protein Data Bank currently contains about 600 data sets of three-dimensional protein coordinates determined by X-ray crystallography or NMR. There is considerable redundancy in the data base, as many protein pairs are identical or very similar in sequence. However, statistical analyses of protein sequence-structure relations require nonredundant data. We have developed two algorithms to extract from the data base representative sets of protein chains with maximum coverage and minimum redundancy. The first algorithm focuses on optimizing a particular property of the selected proteins and works by successive selection of proteins from an ordered list and exclusion of all neighbors of each selected protein. The other algorithm aims at maximizing the size of the selected set and works by successive thinning out of clusters of similar proteins. Both algorithms are generally applicable to other data bases in which criteria of similarity can be defined and relate to problems in graph theory. The largest nonredundant set extracted from the current release of the Protein Data Bank has 155 protein chains. In this set, no two proteins have sequence similarity higher than a certain cutoff (30% identical residues for aligned subsequences longer than 80 residues), yet all structurally unique protein families are represented. Periodically updated lists of representative data sets are available by electronic mail from the file server "netserv@embl-heidelberg.de." The selection may be useful in statistical approaches to protein folding as well as in the analysis and documentation of the known spectrum of three-dimensional protein structures.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

SignalP 4.0: discriminating signal peptides from transmembrane regions

Thomas Nordahl Petersen, +5 more

- 01 Oct 2011 -

Nature Methods

TL;DR: SignalP 4.0 was the best signal-peptide predictor for all three organism types but was not in all cases as good as SignalP 3.0 according to cleavage-site sensitivity or signal- peptide correlation when there are no transmembrane proteins present.

...read moreread less

Journal ArticleDOI

MOLMOL: a program for display and analysis of macromolecular structures.

Reto Koradi, +2 more

- 01 Feb 1996 -

Journal of Molecular Graphics

TL;DR: Special efforts were made to allow for appropriate display and analysis of the sets of typically 20-40 conformers that are conventionally used to represent the result of an NMR structure determination, using functions for superimposing sets of conformers, calculation of root mean square distance (RMSD) values, identification of hydrogen bonds, and identification and listing of short distances between pairs of hydrogen atoms.

...read moreread less

Journal ArticleDOI

Improved Prediction of Signal Peptides: SignalP 3.0

Jannick Dyrløv Bendtsen, +3 more

- 16 Jul 2004 -

Journal of Molecular Biology

TL;DR: Improvements of the currently most popular method for prediction of classically secreted proteins, SignalP, which consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated.

...read moreread less

Journal ArticleDOI

RNAmmer: consistent and rapid annotation of ribosomal RNA genes

Karin Lagesen, +5 more

- 01 May 2007 -

Nucleic Acids Research

TL;DR: Results from running RNAmmer on a large set of genomes indicate that the location of rRNAs can be predicted with a very high level of accuracy.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features

Wolfgang Kabsch, +1 more

- 01 Dec 1983 -

Biopolymers

TL;DR: A set of simple and physically motivated criteria for secondary structure, programmed as a pattern‐recognition process of hydrogen‐bonded and geometrical features extracted from x‐ray coordinates is developed.

...read moreread less

Journal ArticleDOI

Improved tools for biological sequence comparison.

William R. Pearson, +1 more

- 01 Apr 1988 -

Proceedings of the National Academy of S...

TL;DR: Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.

...read moreread less

Journal ArticleDOI

Identification of common molecular subsequences.

Temple F. Smith, +1 more

- 25 Mar 1981 -

Journal of Molecular Biology

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

...read moreread less

Journal ArticleDOI

The Protein Data Bank: a computer-based archival file for macromolecular structures.

Frances C. Bernstein, +8 more

- 25 May 1977 -

Journal of Molecular Biology

TL;DR: The Protein Data Bank is a computer-based archival file for macromolecular structures that stores in a uniform format atomic co-ordinates and partial bond connectivities, as derived from crystallographic studies.

...read moreread less

Journal ArticleDOI

Database of homology-derived protein structures and the structural meaning of sequence alignment.

Chris Sander, +1 more

- 01 Jan 1991 -

Proteins

TL;DR: A database of homology‐derived secondary structure of proteins (HSSP) is produced by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve, effectively increasing the number of known protein structures by a factor of five to more than 1800.

...read moreread less

Related Papers (5)

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features

Wolfgang Kabsch, +1 more

- 01 Dec 1983 -

Biopolymers

The Protein Data Bank: a computer-based archival file for macromolecular structures.

Frances C. Bernstein, +8 more

- 25 May 1977 -

Journal of Molecular Biology

The Protein Data Bank

Helen M. Berman, +7 more

- 01 Jan 2000 -

Nucleic Acids Research

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Stephen F. Altschul, +6 more

- 01 Sep 1997 -

Nucleic Acids Research

SCOP: a structural classification of proteins database for the investigation of sequences and structures.

Alexey G. Murzin, +3 more

- 07 Apr 1995 -

Journal of Molecular Biology

Selection of representative protein data sets

Citations

SignalP 4.0: discriminating signal peptides from transmembrane regions

STRING v10: protein–protein interaction networks, integrated over the tree of life

MOLMOL: a program for display and analysis of macromolecular structures.

Improved Prediction of Signal Peptides: SignalP 3.0

RNAmmer: consistent and rapid annotation of ribosomal RNA genes

References

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features

Improved tools for biological sequence comparison.

Identification of common molecular subsequences.

The Protein Data Bank: a computer-based archival file for macromolecular structures.

Database of homology-derived protein structures and the structural meaning of sequence alignment.

Related Papers (5)

Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features

The Protein Data Bank: a computer-based archival file for macromolecular structures.

The Protein Data Bank

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

SCOP: a structural classification of proteins database for the investigation of sequences and structures.