Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes

doi:10.1006/JMBI.2000.4315

Home
/
Papers
/
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes

Journal Article•DOI•

Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes

Anders Krogh¹, B. Larsson¹, G. von Heijne², Erik L. L. Sonnhammer³•Institutions (3)

Technical University of Denmark¹, Stockholm University², Karolinska Institutet³

19 Jan 2001-Journal of Molecular Biology (J Mol Biol)-Vol. 305, Iss: 3, pp 567-580

TL;DR: A new membrane protein topology prediction method, TMHMM, based on a hidden Markov model is described and validated, and it is discovered that proteins with N(in)-C(in) topologies are strongly preferred in all examined organisms, except Caenorhabditis elegans, where the large number of 7TM receptors increases the counts for N(out)-C-in topologies.

read less

About: This article is published in Journal of Molecular Biology.The article was published on 2001-01-19. It has received 11453 citations till now. The article focuses on the topics: Integral membrane protein & Membrane protein.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

The Pfam protein families database

[...]

Marco Punta¹, Penny Coggill¹, Ruth Y. Eberhardt¹, Jaina Mistry¹, John Tate¹, Chris Boursnell¹, Ningze Pang¹, Kristoffer Forslund¹, Goran Ceric¹, Jody Clements¹, Andreas Heger¹, Liisa Holm¹, Erik L. L. Sonnhammer¹, Sean R. Eddy¹, Alex Bateman¹, Robert D. Finn¹ - Show less +12 more•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jan 2000-Nucleic Acids Research

TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.

...read moreread less

Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

...read moreread less

14,075 citations

Cites methods from "Predicting transmembrane protein to..."

...These predictions are pre-computed over the sequence database by the following third party programs: TMHMM ( 10 ) (transmembrane regions), SignalP (11) (signal peptide regions), ncoils (12) (coiled-coil regions) and SEG (9) (low complexity regions)....
[...]

Journal Article•DOI•

Improved Prediction of Signal Peptides: SignalP 3.0

[...]

Jannick Dyrløv Bendtsen¹, Henrik Nielsen¹, Gunnar von Heijne², Søren Brunak¹•Institutions (2)

Technical University of Denmark¹, Stockholm University²

16 Jul 2004-Journal of Molecular Biology

TL;DR: Improvements of the currently most popular method for prediction of classically secreted proteins, SignalP, which consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated.

...read moreread less

6,492 citations

Journal Article•DOI•

InterProScan 5: genome-scale protein function classification

[...]

Philip Jones¹, David Binns¹, Hsin-Yu Chang¹, Matthew Fraser¹, Weizhong Li¹, Craig McAnulla¹, Hamish McWilliam¹, John Maslen¹, Alex L. Mitchell¹, Gift Nuka¹, Sebastien Pesseat¹, Antony F. Quinn¹, Amaia Sangrador-Vegas¹, Maxim Scheremetjew¹, Siew-Yit Yong¹, Rodrigo Lopez¹, Sarah Hunter¹ - Show less +13 more•Institutions (1)

Wellcome Trust Sanger Institute¹

01 May 2014-Bioinformatics

TL;DR: A new Java-based architecture for the widely used protein function prediction software package InterProScan is described, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis.

...read moreread less

Abstract: Motivation: Robust, large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterise many millions of sequences. Here we describe a new Java-based architecture for the widely-used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete re-implementation of the software framework, resulting in a flexible and stable system that is able to utilise both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the (open) source code is hosted at Google Code. Availability: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk

...read moreread less

5,434 citations

Cites background from "Predicting transmembrane protein to..."

...…(Haft et al., 2012), SMART (Letunic et al., 2012), PIRSF (Wu et al., 2004), Panther (Mi et al., 2012), HAMAP (Pedruzzi et al., 2012), Prosite (Sigrist et al., 2012), ProDom (Bru et al., 2005), PRINTS (Attwood et al., 2012), CATHGene3D (Lees et al., 2012) and SUPERFAMILY (De Lima…...
[...]

Book•

Graphical Models, Exponential Families, and Variational Inference

[...]

Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

16 Dec 2008

TL;DR: The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.

...read moreread less

Abstract: The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Graphical models have become a focus of research in many statistical, computational and mathematical fields, including bioinformatics, communication theory, statistical physics, combinatorial optimization, signal and image processing, information retrieval and statistical machine learning. Many problems that arise in specific instances — including the key problems of computing marginals and modes of probability distributions — are best studied in the general setting. Working with exponential family representations, and exploiting the conjugate duality between the cumulant function and the entropy for exponential families, we develop general variational representations of the problems of computing likelihoods, marginal probabilities and most probable configurations. We describe how a wide variety of algorithms — among them sum-product, cluster variational methods, expectation-propagation, mean field methods, max-product and linear programming relaxation, as well as conic programming relaxations — can all be understood in terms of exact or approximate forms of these variational representations. The variational approach provides a complementary alternative to Markov chain Monte Carlo as a general source of approximation methods for inference in large-scale statistical models.

...read moreread less

4,335 citations

Cites methods from "Predicting transmembrane protein to..."

...These and other biological facts are used to design the states and state transition matrix of the transmembrane hidden Markov model, an HMM for modeling membrane proteins [138]....
[...]

Journal Article•DOI•

Genome sequence of the human malaria parasite Plasmodium falciparum

[...]

Malcolm J. Gardner¹, Neil Hall¹, Eula Fung¹, Owen White¹, Matthew Berriman¹, Richard W. Hyman¹, Jane M. Carlton¹, Arnab Pain¹, Karen E. Nelson¹, Sharen Bowman¹, Ian T. Paulsen¹, Keith D. James¹, Jonathan A. Eisen¹, Kim Rutherford¹, Steven L. Salzberg¹, Alister Craig¹, Sue Kyes¹, Man Suen Chan¹, Vishvanath Nene¹, Shamira J. Shallom¹, Bernard B. Suh¹, Jeremy Peterson¹, Samuel V. Angiuoli¹, Mihaela Pertea¹, Jonathan E. Allen¹, Jeremy D. Selengut¹, Daniel H. Haft¹, Michael W. Mather¹, Akhil B. Vaidya¹, David M. A. Martin¹, Alan H. Fairlamb¹, Martin Fraunholz¹, David S. Roos¹, Stuart A. Ralph¹, Geoffrey I. McFadden¹, Leda M. Cummings¹, G. Mani Subramanian¹, Christopher J. Mungall¹, J. Craig Venter¹, Daniel J. Carucci¹, Stephen L. Hoffman¹, Chris I. Newbold¹, Ronald W. Davis¹, Claire M. Fraser¹, Bart Barrell¹ - Show less +41 more•Institutions (1)

J. Craig Venter Institute¹

03 Oct 2002-Nature

TL;DR: The genome sequence of P. falciparum clone 3D7 is reported, which is the most (A + T)-rich genome sequenced to date and is being exploited in the search for new drugs and vaccines to fight malaria.

...read moreread less

Abstract: The parasite Plasmodium falciparum is responsible for hundreds of millions of cases of malaria, and kills more than one million African children annually. Here we report an analysis of the genome sequence of P. falciparum clone 3D7. The 23-megabase nuclear genome consists of 14 chromosomes, encodes about 5,300 genes, and is the most (A + T)-rich genome sequenced to date. Genes involved in antigenic variation are concentrated in the subtelomeric regions of the chromosomes. Compared to the genomes of free-living eukaryotic microbes, the genome of this intracellular parasite encodes fewer enzymes and transporters, but a large proportion of genes are devoted to immune evasion and host-parasite interactions. Many nuclear-encoded proteins are targeted to the apicoplast, an organelle involved in fatty-acid and isoprenoid metabolism. The genome sequence provides the foundation for future studies of this organism, and is being exploited in the search for new drugs and vaccines to fight malaria.

...read moreread less

4,312 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A tutorial on hidden Markov models and selected applications in speech recognition

[...]

Lawrence R. Rabiner¹•Institutions (1)

Bell Labs¹

01 Feb 1989

TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.

...read moreread less

Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

...read moreread less

21,819 citations

Journal Article•DOI•

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.

[...]

Henrik Nielsen¹, Jacob Engelbrecht², Søren Brunak, G. von Heijne³•Institutions (3)

Technical University of Denmark¹, Novo Nordisk², Stockholm University³

01 Jan 1997-Protein Engineering

TL;DR: A new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence that performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets.

...read moreread less

Abstract: We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.

...read moreread less

5,480 citations

"Predicting transmembrane protein to..." refers methods in this paper

...A set of signal peptides used for training of SignalP (Nielsen et al., 1997) was used to test the discrimination between signal peptides and membrane helices....
[...]
...These proteins were analyzed with SignalP-HMM (Nielsen & Krogh, 1998), and if a signal peptide was predicted, it was removed from the protein....
[...]
...Such proteins were sent to SignalP- HMM (http://www.cbs.dtu.dk/services/SignalP-2.0/), and if a cleavage site was predicted with a probability of more than 0.5, the predicted signal peptide was cleaved off....
[...]
...This was done only for the eukaryotes and the Gram-positive and Gram-negative bacteria because SignalP is only developed for these groups of organisms (see Materials and Methods for details)....
[...]
...A preliminary test of the accuracy of SignalP- HMM reveals that about 80 % of the true signal peptides are found, and 20 % of transmembrane helices are mistaken for signal peptides in eukaryotes ....
[...]

SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

[...]

Henrik Nielsen, Jacob Engelbrecht, Søren Brunak

01 Jan 1997

TL;DR: In this paper, a new method for the identification of in performance compared with the weight matrix method signal peptides and their cleavage sites based on neural (Arrigo et al., 1991; Ladunga et al, 1991; Schneider and networks trained on separate sets of prokaryotic and eukaryotic sequence.

...read moreread less

Abstract: applicable prediction methods with significant improvements We have developed a new method for the identification of in performance compared with the weight matrix method signal peptides and their cleavage sites based on neural (Arrigo et al., 1991; Ladunga et al., 1991; Schneider and networks trained on separate sets of prokaryotic and Wrede, 1993). eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be Materials and methods applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal- The data were taken from SWISS-PROT version 29 (Bairoch anchor sequences is also possible, though with lower preci- and Boeckmann, 1994). The data sets were divided into sion. Predictions can be made on a publicly available prokaryotic and eukaryotic entries and the prokaryotic data sets WWW server.

...read moreread less

5,191 citations

Proceedings Article•

A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences

[...]

Erik L. L. Sonnhammer¹, Gunnar von Heijne², Anders Krogh³•Institutions (3)

National Institutes of Health¹, Stockholm University², Technical University of Denmark³

01 Jul 1998

TL;DR: The transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 83 proteins with known topology, and the same accuracy was achieved on a larger dataset of 160 proteins.

...read moreread less

Abstract: A novel method to model and predict the location and orientation of alpha helices in membrane- spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the non-cytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.

...read moreread less

2,518 citations

"Predicting transmembrane protein to..." refers methods or result in this paper

...In the third and ®nal stage of estimation, the model from stage two was further optimized using a discriminative method of estimation as described by Sonnhammer et al. (1998)....
[...]
...This compares well to other methods, the best of which use multiple alignments to achieve the same level of accuracy (Rost et al., 1996; Tusnady & Simon, 1998), see Sonnhammer et al. (1998) for comparisons ....
[...]
...Here we describe a new method, TMHMM, based on a hidden Markov model (HMM) approach (a preliminary description of TMHMM has been published by Sonnhammer et al., 1998)....
[...]

Journal Article•DOI•

Membrane protein structure prediction: Hydrophobicity analysis and the positive-inside rule

[...]

Gunnar von Heijne¹•Institutions (1)

Karolinska Institutet¹

20 May 1992-Journal of Molecular Biology

TL;DR: In this paper, a new strategy for predicting the topology of bacterial inner membrane proteins is proposed on the basis of hydrophobicity analysis, automatic generation of a set of possible topologies and ranking of these according to the positive inside rule.

...read moreread less

1,661 citations