scispace - formally typeset
Search or ask a question

Showing papers on "Smith–Waterman algorithm published in 2011"


Journal ArticleDOI
TL;DR: Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before.
Abstract: The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.

223 citations


Journal ArticleDOI
TL;DR: A dynamic-programming algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5′ and 3′ ends of two given sequences and introducing a ‘large-gap jump’ between the local end alignments to maximize the total alignment score.
Abstract: Motivation: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is an important problem, as it is a prerequisite for classifying SVs, evaluating their functional impact and reconstructing personal genome sequences. Given approximate breakpoint locations and a bridging assembly or split read, the problem essentially reduces to finding a correct sequence alignment. Classical algorithms for alignment and their generalizations guarantee finding the optimal (in terms of scoring) global or local alignment of two sequences. However, they cannot generally be applied to finding the biologically correct alignment of genomic sequences containing SVs because of the need to simultaneously span the SV (e.g. make a large gap) and perform precise local alignments at the flanking ends. Results: Here, we formulate the computations involved in this problem and describe a dynamic-programming algorithm for its solution. Specifically, our algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5′ and 3′ ends of two given sequences and introducing a ‘large-gap jump’ between the local end alignments to maximize the total alignment score. We also describe extensions allowing the application of AGE to tandem duplications, inversions and complex events involving two large gaps. We develop a memory-efficient implementation of AGE (allowing application to long contigs) and make it available as a downloadable software package. Finally, we applied AGE for breakpoint determination and standardization in the 1000 Genomes Project by aligning locally assembled contigs to the human genome. Availability and Implementation: AGE is freely available at http://sv.gersteinlab.org/age. Contact: gro.balnietsreg@ip Supplementary information: Supplementary data are available at Bioinformatics online.

99 citations


Journal ArticleDOI
TL;DR: Criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned are revealed.
Abstract: Algorithms of sequence alignment are the key instruments for computer-assisted studies of biopolymers. Obviously, it is important to take into account the "quality" of the obtained alignments, i.e. how closely the algorithms manage to restore the "gold standard" alignment (GS-alignment), which superimposes positions originating from the same position in the common ancestor of the compared sequences. As an approximation of the GS-alignment, a 3D-alignment is commonly used not quite reasonably. Among the currently used algorithms of a pair-wise alignment, the best quality is achieved by using the algorithm of optimal alignment based on affine penalties for deletions (the Smith-Waterman algorithm). Nevertheless, the expedience of using local or global versions of the algorithm has not been studied. Using model series of amino acid sequence pairs, we studied the relative "quality" of results produced by local and global alignments versus (1) the relative length of similar parts of the sequences (their "cores") and their nonhomologous parts, and (2) relative positions of the core regions in the compared sequences. We obtained numerical values of the average quality (measured as accuracy and confidence) of the global alignment method and the local alignment method for evolutionary distances between homologous sequence parts from 30 to 240 PAM and for the core length making from 10% to 70% of the total length of the sequences for all possible positions of homologous sequence parts relative to the centers of the sequences. We revealed criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned. It was demonstrated that when the core part of one sequence was positioned above the core of the other sequence, the global algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the local algorithm. On the contrary, when the cores were positioned asymmetrically, the local algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the global algorithm. This opens a possibility for creation of a combined method allowing generation of more accurate alignments.

62 citations


Proceedings ArticleDOI
16 May 2011
TL;DR: This paper proposes and evaluates a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity, and proposes optimizations that are able to reduce significantly the amount of data processed and that enforce full parallelism most of the time.
Abstract: Cross-species chromosome alignments can reveal ancestral relationships and may be used to identify the peculiarities of the species. It is thus an important problem in Bioinformatics. So far, aligning huge sequences, such as whole chromosomes, with exact methods has been regarded as unfeasible, due to huge computing and memory requirements. However, high performance computing platforms such as GPUs are being able to change this scenario, making it possible to obtain the exact result for huge sequences in reasonable time. In this paper, we propose and evaluate a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations that are able to reduce significantly the amount of data processed and that enforce full parallelism most of the time. Using the GTX 285 Board, our algorithm was able to produce the optimal alignment between sequences composed of 33 Millions of Base Pairs (MBP) and 47 MBP in 18.5 hours.

52 citations


Proceedings ArticleDOI
16 May 2011
TL;DR: The development of the kernel is described as a series of incremental changes that provide insight into a number of issues that must be considered when developing any algorithm for the CUDA architecture and shows that the use of the intra-task kernel substantially improves the overall performance of CUDASW++.
Abstract: CUDASW++ is a parallelization of the Smith-Waterman algorithm for CUDA graphical processing units that computes the similarity scores of a query sequence paired with each sequence in a database. The algorithm uses one of two kernel functions to compute the score between a given pair of sequences: the inter-task kernel or the intra-task kernel. We have identified the intra-task kernel as a major bottleneck in the CUDASW++ algorithm. We have developed a new intra-task kernel that is faster than the original intra-task kernel used in CUDASW++. We describe the development of our kernel as a series of incremental changes that provide insight into a number of issues that must be considered when developing any algorithm for the CUDA architecture. We analyze the performance of our kernel compared to the original and show that the use of our intra-task kernel substantially improves the overall performance of CUDASW++ on the order of three to four giga-cell updates per second on various benchmark databases.

38 citations


Journal ArticleDOI
TL;DR: This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date.
Abstract: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring of these databases. Thus to come up with an accurate and fast solution, it is highly desired to speed up the S-W algorithm. This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs). The new implementation improves performance by optimizing the database organization and reducing the number of memory accesses to eliminate bandwidth bottlenecks. The implementation is called Database Optimized Protein Alignment (DOPA) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date. In the new GPU-based implementation for protein sequence alignment (DOPA), the database is organized in equal length sequence sets. This equally distributes the workload among all the threads on the GPU's multiprocessors. The result is an improved performance which is better than the fastest available GPU implementation.

27 citations


Journal ArticleDOI
TL;DR: This article presents a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes.
Abstract: Motivation: Many comparative genomics studies rely on the correct identification of homologous genomic regions using accurate alignment tools. In such case, the alphabet of the input sequences consists of complete genes, rather than nucleotides or amino acids. As optimal multiple sequence alignment is computationally impractical, a progressive alignment strategy is often employed. However, such an approach is susceptible to the propagation of alignment errors in early pairwise alignment steps, especially when dealing with strongly diverged genomic regions. In this article, we present a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists. Results: Based on provable properties of the graph structure, several heuristics are developed to resolve local alignment conflicts that occur due to gene duplication and/or rearrangement events on the different genomic segments. The performance of the algorithm is assessed by comparing the alignment results of homologous genomic segments in Arabidopsis thaliana to those obtained by using both a progressive alignment method and an earlier graph-based implementation. Especially for datasets that contain strongly diverged segments, the proposed method achieves a substantially higher alignment accuracy, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes. Availability: http://bioinformatics.psb.ugent.be/software. The algorithm is implemented as a part of the i-ADHoRe 3.0 package. Contact: yves.vandepeer@psb.vib-ugent.be Supplementary information:Supplementary data are available at Bioinformatics online.

26 citations


Proceedings ArticleDOI
02 May 2011
TL;DR: This work introduces a novel and efficient technique to improve the performance of database applications by using a Hybrid GPU/CPU platform, and solves the problem of the low efficiency resulting from running short-length sequences in a database on a GPU.
Abstract: Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applications by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency resulting from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations.

22 citations


Proceedings ArticleDOI
10 May 2011
TL;DR: In this article, the authors proposed a hybrid Smith-Waterman algorithm that integrates the state-of-the-art CPU and GPU solutions for accelerating Smith-waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling them to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution.
Abstract: This paper describes the approach and the speedup obtained in performing Smith-Waterman database searches on heterogeneous platforms comprising of multi core CPU and multi GPU systems. Most of the advanced and optimized Smith Waterman algorithm versions have demonstrated remarkable speedup over NCBI BLAST versions, viz., SWPS3 based on x86 SSE2 instructions and CUDASW++ v2.0 CUDA implementation on GPU. This work proposes a hybrid Smith-Waterman algorithm that integrates the state-of-the art CPU and GPU solutions for accelerating Smith-Waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling us to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution. In this work, both CPU and GPU are graded equally in performance for Smith-Waterman rather than previous approaches of porting the computationally intensive portions onto the GPUs or a naive multi-core CPU approach.

15 citations


Proceedings ArticleDOI
21 Feb 2011
TL;DR: The adaptive local alignment is more sensitive than that of the previous local alignments that used a fixed similarity matrix, and the performance of the adaptiveLocal alignment is superior to Greedy-String Tiling for detecting various plagiarism cases.
Abstract: This paper proposes a new method for detecting plagiarized pairs of source codes among a large set of source codes. The typical algorithms for detecting code plagiarism, which are largely exploited up to now, are based on Greedy-String Tiling or on local alignments of the two strings. This paper introduces a variant of the local alignment, namely, the adaptive local alignment, which exploits an adaptive similarity matrix. Each entry of the adaptive similarity matrix is the logarithm of the probabilities of the keywords based on the frequencies in a given set of programs. We experimented with this method using a set of programs submitted to more than 10 real programming contests. According to the experimental results, the distribution of the adaptive local alignment is more sensitive than that of the previous local alignments that used a fixed similarity matrix (+1 for match, −1 for mismatch, and −2 for gap), and the performance of the adaptive local alignment is superior to Greedy-String Tiling for detecting various plagiarism cases.

14 citations


Book ChapterDOI
01 Jan 2011
TL;DR: The introduction of the new Fermi architecture significantly improved performance of the naive version of the Smith–Waterman algorithm, which suggests that automatic porting of applications to CUDA will have a better chance of success than for the previous generations of CUDA-enabled chips.
Abstract: Publisher Summary This chapter presents how the dynamic programming-based Smith–Waterman (SW) algorithm for protein sequence database scanning can be optimized on GPUs. Starting from a basic CUDA implementation, discussions are presented on several optimization techniques using shared memory, registers, loop unrolling, and CPU/GPU partitioning. The combination of these techniques leads to a fivefold performance improvement on the same hardware. Smith–Waterman is one of the most popular algorithms in bioinformatics, and therefore, the optimization techniques presented in this chapter are beneficial and instructive to researchers in this area. Because of the importance of SW in bioinformatics, there have been several attempts to improve its performance using a variety of parallel architectures. The highest performance of the multithreaded SSE2-vectorized CPU version is about 15 GCUPS on a modern quad-core CPU. This is similar to the performance of the best-optimized version of the algorithm described in this chapter. Even though the optimal alignment scores of the SW algorithm can be used to detect related sequences, the scores are biased by sequence length and composition. The Z-value has been proposed to estimate the statistical significance of these scores. The conclusion from this little experiment is that the introduction of the new Fermi architecture significantly improved performance of the naive version, which suggests that automatic porting of applications to CUDA will have a better chance of success than for the previous generations of CUDA-enabled chips. Nevertheless, the code optimized by hand still achieves more than a five-fold speedup in comparison with a naive port.

Book ChapterDOI
12 Sep 2011
TL;DR: The SITEBLAST algorithm (Michael et al., 2005) employs the Aho-Corasick algorithm to retrieve all motif anchors for a local alignment procedure for genomic sequences that makes use of prior knowledge.
Abstract: Multiple pattern matching is the computationally intensive kernel of many applications including information retrieval and intrusion detection systems, web and spam filters and virus scanners. The use of multiple pattern matching is very important in genomics where the algorithms are frequently used to locate nucleotide or amino acid sequence patterns in biological sequence databases. For example, when proteomics data is used for genome annotation in a process called proteogenomic mapping (Jaffe et al., 2004), a set of peptide identifications obtained using mass spectrometry is matched against a target genome translated in all six reading frames. Given a sequence database (or text) T = t1t2...tn of length n and a finite set of r patterns P = p1, p2, ..., pr , where each pi is a string pi = pi 1p i 2...p i m of length m over a finite character set Σ, the multiple pattern matching problem can be defined as the way to locate all the occurrences of any of the patterns in the sequence database. The naive solution to this problem is to perform r separate searches with one of the sequential algorithms (Navarro & Raffinot, 2002). While frequently used in the past, this technique is not efficient when a large pattern set is involved. The aim of all multiple pattern matching algorithms is to locate the occurrences of all patterns with a single pass of the sequence database. These algorithms are based of single-pattern matching algorithms, with some of their functions generalized to process multiple patterns simultaneously during the preprocessing phase, generally with the use of trie structures or hashing. Multiple pattern matching is widely used in computational biology for a variety of pattern matching tasks. Brundo and Morgenstern used a simplified version of the Aho-Corasick algorithm to identify anchor points in their CHAOS algorithm for fast alignment of large genomic sequences (Brudno & Morgenstern, 2002; Brudno et al., 2004). Hyyro et al. demonstrated that Aho-Corasick outperforms other algorithms for locating unique oligonucleotides in the yeast genome (Hyyro et al., 2005). The SITEBLAST algorithm (Michael et al., 2005) employs the Aho-Corasick algorithm to retrieve all motif anchors for a local alignment procedure for genomic sequences that makes use of prior knowledge. Buhler Parallel Processing of Multiple Pattern Matching Algorithms for Biological Sequences: Methods and Performance Results

Proceedings Article
15 Sep 2011
TL;DR: A new parallelization strategy (HI-M) of Smith-Waterman algorithm on a multi-core cluster is presented, configuring a pipeline with a hybrid communication model and compared with two previously presented parallel solutions.
Abstract: DNA sequence alignment is one of the most important operations of computational biology. In 1981, Smith and Waterman developed a method for sequences local alignment. Due to its computational power and memory requirements, various heuristics have been developed to reduce execution time at the expense of a loss of accuracy in the result. This is why heuristics do not ensure that the best alignment is found. For this reason, it is interesting to study how to apply the computer power of different parallel platforms to speed up the sequence alignment process without losing result accuracy. In this article, a new parallelization strategy (HI-M) of Smith-Waterman algorithm on a multi-core cluster is presented, configuring a pipeline with a hybrid communication model. Additionally, a performance analysis is carried out and compared with two previously presented parallel solutions. Finally, experimental results are presented, as well as future research lines.

Proceedings ArticleDOI
Fang Zheng, Xianbin Xu1, Yuanhua Yang1, Shuibing He1, Yuping Zhang1 
21 Oct 2011
TL;DR: A multi-threaded parallel design and implementation of the Smith-Waterman (SW) on CUDA to reduce execution time and results show this m implementation achieves more better performance than the other parallel implementation on the Graphics Processing Unit.
Abstract: In this paper, we have used Compute Unified Device Architecture (CUDA) GPU to accelerate pair wise sequence alignment using the Smith-Waterman (SW) algorithm Smith-Waterman(SW) is by far the best algorithm for its accuracy in similarity scoring But the executing time of this algorithm is too long in sequence alignment So we describe a multi-threaded parallel design and implementation of the Smith-Waterman (SW) on CUDA to reduce execution time And according the architecure of CUDA, we have divided the computation of a whole pair wise sequence alignment scoring matrix into multiple sub-matrices, using 32 threads to process on submatrice, more over we optimized memory distribution scheme, and used reduction to find the maximum element of the alignment scoring matrix We experiment the algorimthm on GeForce 9600 GT, connet to Windows xp 64-bit system The results show this mplementation achieves more better performance than the other parallel implementation on the Graphics Processing Unit

Proceedings ArticleDOI
16 Jun 2011
TL;DR: Investigation is made of the performance parameters of computing similarity indexes between query sequences and a reference sequence using the suggested parallel programming models and experimental analyses are aimed at searching for similarities of the human gamma interferon protein and influenza virus.
Abstract: The paper presents parallel computational models of Smith-Waterman algorithm for CPU and GPU. An investigation is made of the performance parameters of computing similarity indexes between query sequences and a reference sequence using the suggested parallel programming models. Implementations for GPU based sequence alignment using nVIDIA CUDA and OpenCL as well as CPU based sequence alignment using OpenMP multithreaded implementation are presented. The experimental analyses are aimed at searching for similarities of the human gamma interferon protein and influenza virus.

11 Jun 2011
TL;DR: The two basic alignment algorithms i.e. Smith Waterman for local alignment and Needleman Wunsch for global alignment have been developed and simulated using MATLAB for genome analysis and sequence alignment.
Abstract: Biological Sequence alignment is widely used operation in the field of Bioinformatics and computational biology as it is used to determine the similarity between the biological sequences. The two basic alignment algorithms i.e. Smith Waterman for local alignment and Needleman Wunsch for global alignment have been used in this paper. The algorithms have been developed and simulated using MATLAB for genome analysis and sequence alignment. The local and global alignment has been presented and the results are shown in the form of Dot plots and local and global scores for the sequences. The proposed work is a useful tool that can aid in the exploration, interpretation and visualization of data in the field of molecular biology.

Proceedings ArticleDOI
16 Nov 2011
TL;DR: A novel approach and analysis of High Performance and Low Power Matrix Filling for DNA Sequence Alignment Accelerator by using ASIC design flow and provides more efficient speed up compared to the traditional sequential implementation but at the same time maintaining the level of sensitivity.
Abstract: Efficient sequence alignment is one of the most important and challenging activities in bioinformatics. Many algorithms have been proposed to perform and accelerate sequence alignment activities. Among them Smith-Waterman (S-W) is the most sensitive (accurate) algorithm. This paper presents a novel approach and analysis of High Performance and Low Power Matrix Filling for DNA Sequence Alignment Accelerator by using ASIC design flow. The objective of this paper is to improve the performance of the DNA sequence alignment and to optimize power reduction of the existing technique by using Smith Waterman (SW) algorithm. The scope of study is by using the matrix filling method which is in parallel implementation of the Smith-Waterman algorithm. This method provides more efficient speed up compared to the traditional sequential implementation but at the same time maintaining the level of sensitivity. The methodology of this paper is using FPGA and Synopsis. This technique is used to implement the massive parallelism. The design was developed in Verilog HDL coding and synthesized by using LINUX tools. Matrix Cells with a design area 8808.307mm2 at 40ns clock period is the best design. Thus the power required at this clock period also smaller, dynamic power 111.1415uW and leakage power 212.9538 Nw. This is a large improvement over existing designs and improves data throughput by using ASIC design flow.

Proceedings ArticleDOI
20 Sep 2011
TL;DR: Evaluating both the serial and parallel BLAST algorithms onto a large Infiniband-based diskless High Performance Cluster that offers lower hardware cost and improved reliability, as opposed to traditional disk full clusters shows that BLAST runtime can still be retained with the use of the diskless clusters, while improving the runtime reliability.
Abstract: The Basic Local Alignment Search (BLAST) is one of the most widely used bioinformatics programs for searching all available sequence databases for similarities between a protein or DNA query and predefined sequences, using sequence alignment technique. Recently, many attempts have been made to make the algorithm practical to run against the publicly available genome databases on large parallel clusters. This paper presents our experience in evaluating both the serial and parallel BLAST algorithms onto a large Infiniband-based diskless High Performance Cluster (HPC) that offers lower hardware cost and improved reliability, as opposed to traditional disk full clusters. The paper also presents the evaluation methodology along with the experimental results to illustrate the scalability of the BLAST algorithm on our HPC system. For our measurement and comparison, we considered cluster sizes up to 32 compute nodes. Our results show that BLAST runtime can still be retained with the use of the diskless clusters, while improving the runtime reliability.

Book ChapterDOI
12 Sep 2011
TL;DR: Alternative approaches to the standard approach to the alignment and string matching problems as dealt with in computer science might be explored in biology, provided one is able to give a positive answer to the following question: can one exhibit a sequence distance which is at the same time easily computed and non-trivial?
Abstract: In general, when a new DNA sequence is given, the first step taken by a biologist would be to compare the new sequence with sequences that are already well studied and annotated. Sequences that are similar would probably have the same function, or, if two sequences from different organisms are similar, there may be a common ancestor sequence. Traditionally, this is made by using a distance function between the DNA chains, which implies in most cases that we apply it between two DNA sequences and try to interpret the obtained score. The standard method for sequence comparison is by sequence alignment. Sequence alignment is the procedure of comparing two sequences (pairwise alignment) or more sequences (multiple alignment) by searching for a series of individual characters or characters patterns that are in the same order in the sequences. Algorithmically, the standard pairwise alignment method is based on dynamic programming; the method compares every pair of characters of the two sequences and generates an alignment and a score, which is dependent on the scoring scheme used, i.e. a scoring matrix for the different base-pair combinations, match and mismatch scores, or a scheme for insertion or deletion (gap) penalties. The underlying string distance is called edit distance or also Levenshtein distance. Although dynamic programming for sequence alignment is mathematically optimal, it is far too slow for comparing a large number of bases. Typical DNAdatabase today contains billions of bases, and the number is still increasing rapidly. To enable sequence search and comparison to be performed in a reasonable time, fast heuristic local alignment algorithms have been developed, e.g. BLAST, freely available at http://www.ncbi.nlm.nih.gov/BLAST. With respect to the standard approach to the alignment and string matching problems as dealt with in computer science, alternative approaches might be explored in biology, provided one is able to give a positive answer to the following question: can one exhibit a sequence distance which is at the same time easily computed and non-trivial? The ranking of this problem on the first position in two lists of major open problems in bioinformatics (J.C. Wooley. Trends in computational biology: a summary based on a RECOMB plenary lecture. J. Comput. Biology, 6, 459-474, 1999 and E.V. Koonin. The emerging paradigm and open problems in comparative 6

Journal ArticleDOI
TL;DR: A novel integrated approach to predicting PPI based on sequence alignment by jointly using a k-Nearest Neighbor classifier (SA-kNN) and a Support Vector Machine (SVM), a machine learning technique used in a wide range of Bioinformatics applications, thanks to the ability to alleviate the overfitting problems.
Abstract: Protein-Protein Interaction (PPI) prediction is a well known problem in Bioinformatics, for which a large number of techniques have been proposed in the past. However, prediction results have not been sufficiently satisfactory for guiding biologists in web-lab experiments. One reason is that not all useful information, such as pairwise protein interaction information based on sequence alignment, has been integrated together in PPI prediction. Alignment is a basic concept to measure sequence similarity in Proteomics that has been used in a number of applications ranging from protein recognition to protein subcellular localization. In this article, we propose a novel integrated approach to predicting PPI based on sequence alignment by jointly using a k-Nearest Neighbor classifier (SA-kNN) and a Support Vector Machine (SVM). SVM is a machine learning technique used in a wide range of Bioinformatics applications, thanks to the ability to alleviate the overfitting problems. We demonstrate that in our approach the two methods, SA-kNN and SVM, are complementary, which are combined in an ensemble to overcome their respective limitations. While the SVM is trained on Amino Acid (AA) compositions and protein signatures mined from literature, the SA-kNN makes use of the similarity of two protein pairs through alignment. Experimentally, our technique leads to a significant gain in accuracy, precision and sensitivity measures at ∼5%, 16% and 10% respectively.

Proceedings ArticleDOI
01 Nov 2011
TL;DR: An algorithm based on level set with a novel similarity constraint term for identical objects segmentation is presented to embed the similarity constraint into curve evolution, where the evolving speed is high in regions of similar appearance and becomes low in areas with distinct contents.
Abstract: Unsupervised identical object segmentation remains a challenging problem in vision research due to the difficulties in obtaining high-level structural knowledge about the scene. In this paper, we present an algorithm based on level set with a novel similarity constraint term for identical objects segmentation. The key component of the proposed algorithm is to embed the similarity constraint into curve evolution, where the evolving speed is high in regions of similar appearance and becomes low in areas with distinct contents. The algorithm starts with a pair of seed matches (e.g. SIFT) and evolve the small initial circle to form large similar regions under the similarity constraint. The similarity constraint is related to local alignment with assumption that the warp between identical objects is affine transformation. The right warp aligns the identical objects and promotes the similar regions growth. The alignment and expansion alternate until the curve reaches the boundaries of similar objects. Real experiments validates the efficiency and effectiveness of the proposed algorithm.

Proceedings ArticleDOI
27 Feb 2011
TL;DR: A comprehensive study of a systolic design for Smith-Waterman algorithm is presented, with specific focus on enhancing parallelism and on optimizing the total size of memory and circuits; in particular, efficient realizations for compressing score matrices and for reducing affine gap cost functions are developed.
Abstract: The Smith-Waterman algorithm is a key technique for comparing genetic sequences. This paper presents a comprehensive study of a systolic design for Smith-Waterman algorithm. It is parameterized in terms of the sequence length, the amount of parallelism, and the number of FPGAs. Two methods of organizing the parallelism, the line-based and the lattice-based methods, are introduced. Our analytical treatment reveals how these two methods perform relative to peak performance when the level of parallelism varies. A novel systolic design is then described, showing how the parametric description can be effectively implemented, with specific focus on enhancing parallelism and on optimizing the total size of memory and circuits; in particular, we develop efficient realizations for compressing score matrices and for reducing affine gap cost functions. Promising results have been achieved showing, for example, a single XC5VLX330 FPGA at 131MHz can be three times faster than a platform with two NVIDIA GTX295 at 1242MHz.

01 Jan 2011
TL;DR: Experimental evaluations show that the proposed GPU accelerated scheme for the PKmeans gene clustering algorithm can attain an order of magnitude speedup as compared with the original PK-means algorithm.
Abstract: In this paper, a novel GPU accelerated scheme for the PK-means gene clustering algorithm is proposed According to the native particle-pair structure of the PKmeans algorithm, a fragment shader program is tailor-made to process a pair of particles in one pass for the computationintensive portion As the output channel of a fragment consisting of 4 floating-point values is fully utilized, overhead for each data points in searching for its nearest centroid throughout the particle-pair is reduced Experimental evaluations on three popular gene expression datasets show that the proposed GPU accelerated scheme can attain an order of magnitude speedup as compared with the original PK-means algorithm

Journal ArticleDOI
TL;DR: The algorithm generated is based on designing matrices in such a way that score matrix contains the maximum scores for alignment of the DNA sequences and the aligned sequences are generated by trace matrix generated based on the score matrix.
Abstract: The algorithm generated is based on designing matrices in such a way that score matrix contains the maximum scores for alignment of the DNA sequences and the aligned sequences are generated by trace matrix generated based on the score matrix. The score matrix is initialized by using Smith – Waterman algorithm and the scores used for filling up the score matrix are calculated using Needleman –Wunsch algorithm.

Journal ArticleDOI
TL;DR: It is shown that there are ncRNA families in which considering local structural alignment with gap penalty model can identify real hits more effectively than using global alignment or local alignment without gap Penalty model.
Abstract: Predicting new non-coding RNAs (ncRNAs) of a family can be done by aligning the potential candidate with a member of the family with known sequence and secondary structure. Existing tools either only consider the sequence similarity or cannot handle local alignment with gaps. In this paper, we consider the problem of finding the optimal local structural alignment between a query RNA sequence (with known secondary structure) and a target sequence (with unknown secondary structure) with the affine gap penalty model. We provide the algorithm to solve the problem. Based on an experiment, we show that there are ncRNA families in which considering local structural alignment with gap penalty model can identify real hits more effectively than using global alignment or local alignment without gap penalty model.

Book ChapterDOI
05 Aug 2011
TL;DR: The proposed Coding Region Sequence Analysis(CRSA) algorithm presents a method to reduce both time and space complexity by meaningfully reducing the size of sequences by removing not so significant exons using wavelet transforms.
Abstract: Discovering the functions of proteins in living organisms is an important tool for understanding cellular processes. The source data for such analysis are commonly the peptide sequences. Most common algorithms used to compare a pair of nucleotide sequence are Global alignment algorithm (Needleman-Wunch algorithm) or local alignment algorithm (Smith-Waterman algorithm). Analysis of these algorithms show that time complexity required to the above mentioned algorithms is O(mn) and space complexity required is O(mn), where m is size of one sequence and n is size of the other sequence. This is one of the major bottlenecks as most of the sequences are very large. The proposed Coding Region Sequence Analysis(CRSA) algorithm presents a method to reduce both time and space complexity by meaningfully reducing the size of sequences by removing not so significant exons using wavelet transforms. DSP techniques supply a strong basis for regions identification with three-base periodicity.

DOI
01 Jun 2011
TL;DR: An approach leading to significant acceleration of the execution of the Smith-Waterman algorithm, which finds the best local alignment of two sequences, such as amino acid or nucleotide sequences, is presented.
Abstract: CUDA is a technology introduced by NVIDIA Corporation, which allows software developers to take advantage of GPU resources relatively easily. This paper presents an approach leading to significant acceleration of the execution of the Smith-Waterman algorithm. The algorithm finds the best local alignment of two sequences, such as amino acid or nucleotide sequences. The results show that it is possible to search bio-informatics databases accurately within a reasonable time.

Journal ArticleDOI
TL;DR: An efficient on-line handwritten digit recognition base on Convex-Concave curves feature which is extracted by a chain code sequence using Smith-Waterman alignment algorithm is proposed.
Abstract: In this paper, we propose an efficient on-line handwritten digit recognition base on Convex-Concave curves feature which is extracted by a chain code sequence using Smith-Waterman alignment algorithm. The time sequential signal from mouse movement on the writing pad is described as a sequence of consecutive points on the x-y plane. So, we can create data-set which are successive and time-sequential pixel position data by preprocessing. Data preprocessed is used for Convex-Concave curves feature extraction. This feature is scale-, translation-, and rotation-invariant. The extracted specific feature is fed to a Smith-Waterman alignment algorithm, which in turn classifies it as one of the nine digits. In comparison with backpropagation neural network, Smith-Waterman alignment has the more outstanding performance.

Journal ArticleDOI
TL;DR: CorAL-M as discussed by the authors adopts a codon-based probabilistic filtration model and the local optimal alignment solution to align multiple genome sequences in linear time and finds more potential function sites than that of other commonly used tools.
Abstract: Multiple Sequence Alignment (MSA) is the computational biology tool for facilitating the study of DNA homology, phylogeny determinations and conserved motifs. Many MSA methods have been presented to align protein, DNA, and RNA sequences successfully but not for coding region sequences. Therefore, we propose a heuristic alignment method, CORAL-M, for multiple genome sequences, especially for coding regions. CORAL-M adopts a codon-based probabilistic filtration model and the local optimal alignment solution to align multiple genome sequences in linear time. The experimental results presents that CORAL-M can find more potential function sites than that of other commonly used tools by aligning Enterovirus strains.

01 Jan 2011
TL;DR: Criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned are revealed.
Abstract: Background: Algorithms of sequence alignment are the key instruments for computer-assisted studies of biopolymers. Obviously, it is important to take into account the “quality” of the obtained alignments, i.e. how closely the algorithms manage to restore the “gold standard” alignment (GS-alignment), which superimposes positions originating from the same position in the common ancestor of the compared sequences. As an approximation of the GS-alignment, a 3D-alignment is commonly used not quite reasonably. Among the currently used algorithms of a pair-wise alignment, the best quality is achieved by using the algorithm of optimal alignment based on affine penalties for deletions (the Smith-Waterman algorithm). Nevertheless, the expedience of using local or global versions of the algorithm has not been studied. Results: Using model series of amino acid sequence pairs, we studied the relative “quality” of results produced by local and global alignments versus (1) the relative length of similar parts of the sequences (their “cores”) and their nonhomologous parts, and (2) relative positions of the core regions in the compared sequences. We obtained numerical values of the average quality (measured as accuracy and confidence) of the global alignment method and the local alignment method for evolutionary distances between homologous sequence parts from 30 to 240 PAM and for the core length making from 10% to 70% of the total length of the sequences for all possible positions of homologous sequence parts relative to the centers of the sequences. Conclusion: We revealed criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned. It was demonstrated that when the core part of one sequence was positioned above the core of the other sequence, the global algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the local algorithm. On the contrary, when the cores were positioned asymmetrically, the local algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the global algorithm. This opens a possibility for creation of a combined method allowing generation of more accurate alignments.