An Introduction to Sequence Similarity (“Homology”) Searching

doi:10.1002/0471250953.BI0301S42

Home
/
Papers
/
An Introduction to Sequence Similarity (“Homology”) Searching

Journal Article•DOI•

An Introduction to Sequence Similarity (“Homology”) Searching

William R. Pearson¹•Institutions (1)

University of Virginia¹

01 Jun 2013-Current protocols in human genetics (NIH Public Access)-Vol. 42, Iss: 1

TL;DR: This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.

read less

Abstract: Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data.

[...]

Gustavo Arango-Argoty¹, Emily Garner¹, Amy Pruden¹, Lenwood S. Heath¹, Peter J. Vikesland¹, Liqing Zhang¹ - Show less +2 more•Institutions (1)

Virginia Tech¹

01 Feb 2018-Microbiome

TL;DR: The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice, and DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs.

...read moreread less

Abstract: Growing concerns about increasing rates of antibiotic resistance call for expanded and comprehensive global monitoring. Advancing methods for monitoring of environmental media (e.g., wastewater, agricultural waste, food, and water) is especially needed for identifying potential resources of novel antibiotic resistance genes (ARGs), hot spots for gene exchange, and as pathways for the spread of ARGs and human exposure. Next-generation sequencing now enables direct access and profiling of the total metagenomic DNA pool, where ARGs are typically identified or predicted based on the “best hits” of sequence searches against existing databases. Unfortunately, this approach produces a high rate of false negatives. To address such limitations, we propose here a deep learning approach, taking into account a dissimilarity matrix created using all known categories of ARGs. Two deep learning models, DeepARG-SS and DeepARG-LS, were constructed for short read sequences and full gene length sequences, respectively. Evaluation of the deep learning models over 30 antibiotic resistance categories demonstrates that the DeepARG models can predict ARGs with both high precision (> 0.97) and recall (> 0.90). The models displayed an advantage over the typical best hit approach, yielding consistently lower false negative rates and thus higher overall recall (> 0.9). As more data become available for under-represented ARG categories, the DeepARG models’ performance can be expected to be further enhanced due to the nature of the underlying neural networks. Our newly developed ARG database, DeepARG-DB, encompasses ARGs predicted with a high degree of confidence and extensive manual inspection, greatly expanding current ARG repositories. The deep learning models developed here offer more accurate antimicrobial resistance annotation relative to current bioinformatics practice. DeepARG does not require strict cutoffs, which enables identification of a much broader diversity of ARGs. The DeepARG models and database are available as a command line version and as a Web service at http://bench.cs.vt.edu/deeparg .

...read moreread less

402 citations

Journal Article•DOI•

Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota

[...]

Simone Rampelli¹, Stephanie L. Schnorr², Clarissa Consolandi³, Silvia Turroni¹, Marco Severgnini³, Clelia Peano³, Patrizia Brigidi¹, Alyssa N. Crittenden⁴, Amanda G. Henry², Marco Candela¹ - Show less +6 more•Institutions (4)

University of Bologna¹, Max Planck Society², National Research Council³, University of Nevada, Las Vegas⁴

29 Jun 2015-Current Biology

TL;DR: The first metagenomic analysis of GM from Hadza hunter-gatherers of Tanzania shows a unique enrichment in metabolic pathways that aligns with the dietary and environmental factors characteristic of their foraging lifestyle, providing a better understanding of the versatility of human life and subsistence.

...read moreread less

310 citations

Journal Article•DOI•

Consensus protein design

[...]

Benjamin T. Porebski¹, Ashley M. Buckle²•Institutions (2)

Laboratory of Molecular Biology¹, Monash University, Clayton campus²

01 Jul 2016-Protein Engineering Design & Selection

TL;DR: The consensus design approach is reviewed, its theoretical underpinnings, successes, limitations and challenges, as well as providing a detailed guide to its application in protein engineering.

...read moreread less

Abstract: A popular and successful strategy in semi-rational design of protein stability is the use of evolutionary information encapsulated in homologous protein sequences. Consensus design is based on the hypothesis that at a given position, the respective consensus amino acid contributes more than average to the stability of the protein than non-conserved amino acids. Here, we review the consensus design approach, its theoretical underpinnings, successes, limitations and challenges, as well as providing a detailed guide to its application in protein engineering.

...read moreread less

155 citations

Cites background from "An Introduction to Sequence Similar..."

...Unfortunately, MSA methods tend to vary significantly, and there is currently no quantitative measure for the quality of alignment (Nuin et al., 2006; Kemena and Notredame, 2009; Pearson, 2013)....
[...]
...Difficulties arise with MSAs containing sequences of varying length, or when there are clusters of sequences that are locally, but not globally, homologous (Rost, 1999; Pearson, 2013)....
[...]

Journal Article•DOI•

Engineering a Rugged Nanoscaffold to Enhance Plug-and-Display Vaccination

[...]

Theodora U. J. Bruun¹, Anne-Marie C. Andersson¹, Simon J. Draper¹, Mark Howarth¹•Institutions (1)

University of Oxford¹

20 Jul 2018-ACS Nano

TL;DR: SpyCatcher-mi3 nanoparticles showed high stability to temperature, freeze–thaw, lyophilization, and storage over time, and should facilitate broad application for nanobiotechnology and vaccine development.

...read moreread less

Abstract: Nanoscale organization is crucial to stimulating an immune response. Using self-assembling proteins as multimerization platforms provides a safe and immunogenic system to vaccinate against otherwise weakly immunogenic antigens. Such multimerization platforms are generally based on icosahedral viruses and have led to vaccines given to millions of people. It is unclear whether synthetic protein nanoassemblies would show similar potency. Here we take the computationally designed porous dodecahedral i301 60-mer and rationally engineer this particle, giving a mutated i301 (mi3) with improved particle uniformity and stability. To simplify the conjugation of this nanoparticle, we employ a SpyCatcher fusion of mi3, such that an antigen of interest linked to the SpyTag peptide can spontaneously couple through isopeptide bond formation (Plug-and-Display). SpyCatcher-mi3 expressed solubly to high yields in Escherichia coli, giving more than 10-fold greater yield than a comparable phage-derived icosahedral nanopartic...

...read moreread less

148 citations

Book Chapter•DOI•

Phage Lysis: Multiple Genes for Multiple Barriers.

[...]

Jesse Cahill¹, Ry Young¹•Institutions (1)

Texas A&M University¹

01 Jan 2019-Advances in Virus Research

TL;DR: Evidence is reviewed supporting a model in which the spanins function by fusing the inner membrane and outer membrane, and it is proposed that spanin function is inhibited by the meshwork of the peptidoglycan, thus coupling the spanin step to the first two steps mediated by the holin and endolysin.

...read moreread less

Abstract: The first steps in phage lysis involve a temporally controlled permeabilization of the cytoplasmic membrane followed by enzymatic degradation of the peptidoglycan. For Caudovirales of Gram-negative hosts, there are two different systems: the holin-endolysin and pinholin-SAR endolysin pathways. In the former, lysis is initiated when the holin forms micron-scale holes in the inner membrane, releasing active endolysin into the periplasm to degrade the peptidoglycan. In the latter, lysis begins when the pinholin causes depolarization of the membrane, which activates the secreted SAR endolysin. Historically, the disruption of the first two barriers of the cell envelope was thought to be necessary and sufficient for lysis of Gram-negative hosts. However, recently a third functional class of lysis proteins, the spanins, has been shown to be required for outer membrane disruption. Spanins are so named because they form a protein bridge that connects both membranes. Most phages produce a two-component spanin complex, composed of an outer membrane lipoprotein (o-spanin) and an inner membrane protein (i-spanin) with a predominantly coiled-coil periplasmic domain. Some phages have a different type of spanin which spans the periplasm as a single molecule, by virtue of an N-terminal lipoprotein signal and a C-terminal transmembrane domain. Evidence is reviewed supporting a model in which the spanins function by fusing the inner membrane and outer membrane. Moreover, it is proposed that spanin function is inhibited by the meshwork of the peptidoglycan, thus coupling the spanin step to the first two steps mediated by the holin and endolysin.

...read moreread less

140 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

[...]

Stephen F. Altschul¹, Thomas L. Madden, Alejandro A. Schäffer¹, Jinghui Zhang, Zheng Zhang², Webb Miller², David J. Lipman - Show less +3 more•Institutions (2)

National Institutes of Health¹, Pennsylvania State University²

01 Sep 1997-Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

...read moreread less

70,111 citations

Journal Article•DOI•

MUSCLE: multiple sequence alignment with high accuracy and high throughput

[...]

Robert C. Edgar

01 Mar 2004-Nucleic Acids Research

TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.

...read moreread less

Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

...read moreread less

37,524 citations

"An Introduction to Sequence Similar..." refers methods in this paper

...More recent multiple sequence alignment methods, like MAFFT (Katoh et al., 2002) and MUSCLE (Edgar, 2004), use iterative approaches that allow gaps to be re-positioned....
[...]
...MUSCLE: Multiple sequence alignment with high accuracy and high throughput....
[...]

Journal Article•DOI•

Clustal W and Clustal X version 2.0

[...]

Mark A. Larkin¹, Gordon Blackshields², Nigel P. Brown², R. Chenna², Paul A. McGettigan², Hamish McWilliam², Franck Valentin², Iain M. Wallace², Andreas Wilm², Rodrigo Lopez², J.D. Thompson², Toby J. Gibson², Desmond G. Higgins² - Show less +9 more•Institutions (2)

University College Dublin¹, European Bioinformatics Institute²

01 Nov 2007-Bioinformatics

TL;DR: The Clustal W and ClUSTal X multiple sequence alignment programs have been completely rewritten in C++ to facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems.

...read moreread less

Abstract: Summary: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. Availability: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Contact: clustalw@ucd.ie

...read moreread less

25,325 citations

"An Introduction to Sequence Similar..." refers methods in this paper

...During the 1980s, progressive alignment strategies, like ClustalW (Larkin et al., 2007; UNIT 2.3) were developed that simplified the problem to O(n2l2), where n is the number of sequences, and l is their average length....
[...]

Journal Article•DOI•

Improved tools for biological sequence comparison.

[...]

William R. Pearson¹, David J. Lipman•Institutions (1)

University of Virginia¹

01 Apr 1988-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.

...read moreread less

Abstract: We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.

...read moreread less

12,432 citations

"An Introduction to Sequence Similar..." refers background in this paper

...Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the smith-waterman and FASTA algorithms....
[...]
...BLAST, FASTA, SSEARCH, and other commonly used similarity searching programs produce accurate statistical estimates that can be used to reliably infer homology....
[...]
...However, the probability p(b) is not what is reported by BLAST, FASTA, or SSEARCH, because it reflects the probability of the score in a single pairwise alignment....
[...]
...BLAST, SSEARCH, FASTA, and HMMER calculate local sequence alignments; local alignments identify the most similar region between two sequences....
[...]
...Homology (common ancestry and similar structure) can be reliably inferred from statistically significant similarity in a BLAST, FASTA, SSEARCH, or HMMER search, but to infer that two proteins are homologous does not guarantee that every part of one protein has a homolog in the other....
[...]

Journal Article•DOI•

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

[...]

Kazutaka Katoh¹, Kazuharu Misawa, Kei-ichi Kuma¹, Takashi Miyata¹•Institutions (1)

Kyoto University¹

15 Jul 2002-Nucleic Acids Research

TL;DR: A simplified scoring system is proposed that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.

...read moreread less

Abstract: A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

...read moreread less

12,003 citations

"An Introduction to Sequence Similar..." refers methods in this paper

...More recent multiple sequence alignment methods, like MAFFT (Katoh et al., 2002) and MUSCLE (Edgar, 2004), use iterative approaches that allow gaps to be re-positioned....
[...]
...MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform....
[...]