Showing papers on "Smith–Waterman algorithm published in 2011"

PDF

Open Access

Journal Article•DOI•

Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation

[...]

Torbjørn Rognes¹, Torbjørn Rognes²•Institutions (2)

University of Oslo¹, Oslo University Hospital²

01 Jun 2011-BMC Bioinformatics

TL;DR: Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before.

...read moreread less

Abstract: The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.

...read moreread less

223 citations

Journal Article•DOI•

AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

[...]

Alexej Abyzov¹, Mark Gerstein¹•Institutions (1)

Yale University¹

01 Mar 2011-Bioinformatics

TL;DR: A dynamic-programming algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5′ and 3′ ends of two given sequences and introducing a ‘large-gap jump’ between the local end alignments to maximize the total alignment score.

...read moreread less

Abstract: Motivation: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is an important problem, as it is a prerequisite for classifying SVs, evaluating their functional impact and reconstructing personal genome sequences. Given approximate breakpoint locations and a bridging assembly or split read, the problem essentially reduces to finding a correct sequence alignment. Classical algorithms for alignment and their generalizations guarantee finding the optimal (in terms of scoring) global or local alignment of two sequences. However, they cannot generally be applied to finding the biologically correct alignment of genomic sequences containing SVs because of the need to simultaneously span the SV (e.g. make a large gap) and perform precise local alignments at the flanking ends. Results: Here, we formulate the computations involved in this problem and describe a dynamic-programming algorithm for its solution. Specifically, our algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5′ and 3′ ends of two given sequences and introducing a ‘large-gap jump’ between the local end alignments to maximize the total alignment score. We also describe extensions allowing the application of AGE to tandem duplications, inversions and complex events involving two large gaps. We develop a memory-efficient implementation of AGE (allowing application to long contigs) and make it available as a downloadable software package. Finally, we applied AGE for breakpoint determination and standardization in the 1000 Genomes Project by aligning locally assembled contigs to the human genome. Availability and Implementation: AGE is freely available at http://sv.gersteinlab.org/age. Contact: gro.balnietsreg@ip Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

99 citations

Journal Article•DOI•

Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences

[...]

Valery O Polyanovsky¹, Mikhail A. Roytberg, Vladimir G. Tumanyan¹•Institutions (1)

Engelhardt Institute of Molecular Biology¹

27 Oct 2011-Algorithms for Molecular Biology

TL;DR: Criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned are revealed.

...read moreread less

Abstract: Algorithms of sequence alignment are the key instruments for computer-assisted studies of biopolymers. Obviously, it is important to take into account the "quality" of the obtained alignments, i.e. how closely the algorithms manage to restore the "gold standard" alignment (GS-alignment), which superimposes positions originating from the same position in the common ancestor of the compared sequences. As an approximation of the GS-alignment, a 3D-alignment is commonly used not quite reasonably. Among the currently used algorithms of a pair-wise alignment, the best quality is achieved by using the algorithm of optimal alignment based on affine penalties for deletions (the Smith-Waterman algorithm). Nevertheless, the expedience of using local or global versions of the algorithm has not been studied. Using model series of amino acid sequence pairs, we studied the relative "quality" of results produced by local and global alignments versus (1) the relative length of similar parts of the sequences (their "cores") and their nonhomologous parts, and (2) relative positions of the core regions in the compared sequences. We obtained numerical values of the average quality (measured as accuracy and confidence) of the global alignment method and the local alignment method for evolutionary distances between homologous sequence parts from 30 to 240 PAM and for the core length making from 10% to 70% of the total length of the sequences for all possible positions of homologous sequence parts relative to the centers of the sequences. We revealed criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned. It was demonstrated that when the core part of one sequence was positioned above the core of the other sequence, the global algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the local algorithm. On the contrary, when the cores were positioned asymmetrically, the local algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the global algorithm. This opens a possibility for creation of a combined method allowing generation of more accurate alignments.

...read moreread less

62 citations

Proceedings Article•DOI•

Smith-Waterman Alignment of Huge Sequences with GPU in Linear Space

[...]

Edans Flavius de Oliveira Sandes¹, Alba Cristina Magalhaes Alves de Melo¹•Institutions (1)

University of Brasília¹

16 May 2011

TL;DR: This paper proposes and evaluates a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity, and proposes optimizations that are able to reduce significantly the amount of data processed and that enforce full parallelism most of the time.

...read moreread less

Abstract: Cross-species chromosome alignments can reveal ancestral relationships and may be used to identify the peculiarities of the species. It is thus an important problem in Bioinformatics. So far, aligning huge sequences, such as whole chromosomes, with exact methods has been regarded as unfeasible, due to huge computing and memory requirements. However, high performance computing platforms such as GPUs are being able to change this scenario, making it possible to obtain the exact result for huge sequences in reasonable time. In this paper, we propose and evaluate a parallel algorithm that uses GPU to align huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to achieve that, we propose optimizations that are able to reduce significantly the amount of data processed and that enforce full parallelism most of the time. Using the GTX 285 Board, our algorithm was able to produce the optimal alignment between sequences composed of 33 Millions of Base Pairs (MBP) and 47 MBP in 18.5 hours.

...read moreread less

52 citations

Proceedings Article•DOI•

Improving CUDASW++, a Parallelization of Smith-Waterman for CUDA Enabled Devices

[...]

Doug Hains¹, Zach Cashero¹, Mark Ottenberg¹, Wim Böhm¹, Sanjay Rajopadhye¹ - Show less +1 more•Institutions (1)

Colorado State University¹

16 May 2011

TL;DR: The development of the kernel is described as a series of incremental changes that provide insight into a number of issues that must be considered when developing any algorithm for the CUDA architecture and shows that the use of the intra-task kernel substantially improves the overall performance of CUDASW++.

...read moreread less

Abstract: CUDASW++ is a parallelization of the Smith-Waterman algorithm for CUDA graphical processing units that computes the similarity scores of a query sequence paired with each sequence in a database. The algorithm uses one of two kernel functions to compute the score between a given pair of sequences: the inter-task kernel or the intra-task kernel. We have identified the intra-task kernel as a major bottleneck in the CUDASW++ algorithm. We have developed a new intra-task kernel that is faster than the original intra-task kernel used in CUDASW++. We describe the development of our kernel as a series of incremental changes that provide insight into a number of issues that must be considered when developing any algorithm for the CUDA architecture. We analyze the performance of our kernel compared to the original and show that the use of our intra-task kernel substantially improves the overall performance of CUDASW++ on the order of three to four giga-cell updates per second on various benchmark databases.

...read moreread less

38 citations

Journal Article•DOI•

DOPA: GPU-based protein alignment using database and memory access optimizations.

[...]

Laiq Hasan¹, Marijn Kentie¹, Zaid Al-Ars¹•Institutions (1)

Delft University of Technology¹

28 Jul 2011-BMC Research Notes

TL;DR: This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date.

...read moreread less

Abstract: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring of these databases. Thus to come up with an accurate and fast solution, it is highly desired to speed up the S-W algorithm. This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs). The new implementation improves performance by optimizing the database organization and reducing the number of memory accesses to eliminate bandwidth bottlenecks. The implementation is called Database Optimized Protein Alignment (DOPA) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date. In the new GPU-based implementation for protein sequence alignment (DOPA), the database is organized in equal length sequence sets. This equally distributes the workload among all the threads on the GPU's multiprocessors. The result is an improved performance which is better than the fastest available GPU implementation.

...read moreread less

27 citations

Journal Article•DOI•

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

[...]

Jan Fostier¹, Sebastian Proost¹, Bart Dhoedt¹, Yvan Saeys¹, Piet Demeester¹, Yves Van de Peer¹, Klaas Vandepoele¹ - Show less +3 more•Institutions (1)

Ghent University¹

01 Mar 2011-Bioinformatics

TL;DR: This article presents a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes.

...read moreread less

Abstract: Motivation: Many comparative genomics studies rely on the correct identification of homologous genomic regions using accurate alignment tools. In such case, the alphabet of the input sequences consists of complete genes, rather than nucleotides or amino acids. As optimal multiple sequence alignment is computationally impractical, a progressive alignment strategy is often employed. However, such an approach is susceptible to the propagation of alignment errors in early pairwise alignment steps, especially when dealing with strongly diverged genomic regions. In this article, we present a novel accurate and efficient greedy, graph-based algorithm for the alignment of multiple homologous genomic segments, represented as ordered gene lists. Results: Based on provable properties of the graph structure, several heuristics are developed to resolve local alignment conflicts that occur due to gene duplication and/or rearrangement events on the different genomic segments. The performance of the algorithm is assessed by comparing the alignment results of homologous genomic segments in Arabidopsis thaliana to those obtained by using both a progressive alignment method and an earlier graph-based implementation. Especially for datasets that contain strongly diverged segments, the proposed method achieves a substantially higher alignment accuracy, and proves to be sufficiently fast for large datasets including a few dozens of eukaryotic genomes. Availability: http://bioinformatics.psb.ugent.be/software. The algorithm is implemented as a part of the i-ADHoRe 3.0 package. Contact: yves.vandepeer@psb.vib-ugent.be Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

26 citations

Proceedings Article•DOI•

High performance technique for database applications using a hybrid GPU/CPU platform

[...]

M. Affan Zidan¹, Talal Bonny¹, Khaled N. Salama¹•Institutions (1)

King Abdullah University of Science and Technology¹

02 May 2011

TL;DR: This work introduces a novel and efficient technique to improve the performance of database applications by using a Hybrid GPU/CPU platform, and solves the problem of the low efficiency resulting from running short-length sequences in a database on a GPU.

...read moreread less

Abstract: Many database applications, such as sequence comparing, sequence searching, and sequence matching, etc, process large database sequences. we introduce a novel and efficient technique to improve the performance of database applications by using a Hybrid GPU/CPU platform. In particular, our technique solves the problem of the low efficiency resulting from running short-length sequences in a database on a GPU. To verify our technique, we applied it to the widely used Smith-Waterman algorithm. The experimental results show that our Hybrid GPU/CPU technique improves the average performance by a factor of 2.2, and improves the peak performance by a factor of 2.8 when compared to earlier implementations.

...read moreread less

22 citations

Proceedings Article•DOI•

Accelerating Smith-Waterman on Heterogeneous CPU-GPU Systems

[...]

Jaideep Singh¹, Ipseeta Aruni¹•Institutions (1)

Indian Institute of Technology Roorkee¹

10 May 2011

TL;DR: In this article, the authors proposed a hybrid Smith-Waterman algorithm that integrates the state-of-the-art CPU and GPU solutions for accelerating Smith-waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling them to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution.

...read moreread less

Abstract: This paper describes the approach and the speedup obtained in performing Smith-Waterman database searches on heterogeneous platforms comprising of multi core CPU and multi GPU systems. Most of the advanced and optimized Smith Waterman algorithm versions have demonstrated remarkable speedup over NCBI BLAST versions, viz., SWPS3 based on x86 SSE2 instructions and CUDASW++ v2.0 CUDA implementation on GPU. This work proposes a hybrid Smith-Waterman algorithm that integrates the state-of-the art CPU and GPU solutions for accelerating Smith-Waterman algorithm in which GPU acts as a co-processor and shares the workload with the CPU enabling us to realize remarkable performance of over 70 GCUPS resulting from simultaneous CPU-GPU execution. In this work, both CPU and GPU are graded equally in performance for Smith-Waterman rather than previous approaches of porting the computationally intensive portions onto the GPUs or a naive multi-core CPU approach.

...read moreread less

15 citations

Proceedings Article•DOI•

Plagiarism detection among source codes using adaptive local alignment of keywords

[...]

Jin-Su Lim¹, Jeong-Hoon Ji², Hwan-Gue Cho¹, Gyun Woo¹•Institutions (2)

Pusan National University¹, Korean Intellectual Property Office²

21 Feb 2011

TL;DR: The adaptive local alignment is more sensitive than that of the previous local alignments that used a fixed similarity matrix, and the performance of the adaptiveLocal alignment is superior to Greedy-String Tiling for detecting various plagiarism cases.

...read moreread less

Abstract: This paper proposes a new method for detecting plagiarized pairs of source codes among a large set of source codes. The typical algorithms for detecting code plagiarism, which are largely exploited up to now, are based on Greedy-String Tiling or on local alignments of the two strings. This paper introduces a variant of the local alignment, namely, the adaptive local alignment, which exploits an adaptive similarity matrix. Each entry of the adaptive similarity matrix is the logarithm of the probabilities of the keywords based on the frequencies in a given set of programs. We experimented with this method using a set of programs submitted to more than 10 real programming contests. According to the experimental results, the distribution of the adaptive local alignment is more sensitive than that of the previous local alignments that used a fixed similarity matrix (+1 for match, −1 for mismatch, and −2 for gap), and the performance of the adaptive local alignment is superior to Greedy-String Tiling for detecting various plagiarism cases.

...read moreread less

14 citations

Book Chapter•DOI•

Accurate Scanning of Sequence Databases with the Smith-Waterman Algorithm

[...]

Łukasz Ligowski¹, Witold R. Rudnicki¹, Yongchao Liu, Bertil Schmidt•Institutions (1)

University of Warsaw¹

01 Jan 2011

TL;DR: The introduction of the new Fermi architecture significantly improved performance of the naive version of the Smith–Waterman algorithm, which suggests that automatic porting of applications to CUDA will have a better chance of success than for the previous generations of CUDA-enabled chips.

...read moreread less

Abstract: Publisher Summary This chapter presents how the dynamic programming-based Smith–Waterman (SW) algorithm for protein sequence database scanning can be optimized on GPUs. Starting from a basic CUDA implementation, discussions are presented on several optimization techniques using shared memory, registers, loop unrolling, and CPU/GPU partitioning. The combination of these techniques leads to a fivefold performance improvement on the same hardware. Smith–Waterman is one of the most popular algorithms in bioinformatics, and therefore, the optimization techniques presented in this chapter are beneficial and instructive to researchers in this area. Because of the importance of SW in bioinformatics, there have been several attempts to improve its performance using a variety of parallel architectures. The highest performance of the multithreaded SSE2-vectorized CPU version is about 15 GCUPS on a modern quad-core CPU. This is similar to the performance of the best-optimized version of the algorithm described in this chapter. Even though the optimal alignment scores of the SW algorithm can be used to detect related sequences, the scores are biased by sequence length and composition. The Z-value has been proposed to estimate the statistical significance of these scores. The conclusion from this little experiment is that the introduction of the new Fermi architecture significantly improved performance of the naive version, which suggests that automatic porting of applications to CUDA will have a better chance of success than for the previous generations of CUDA-enabled chips. Nevertheless, the code optimized by hand still achieves more than a five-fold speedup in comparison with a naive port.

...read moreread less

Book Chapter•DOI•

Parallel Processing of Multiple Pattern Matching Algorithms for Biological Sequences: Methods and Performance Results

[...]

Charalampos S. Kouzinopoulos, Panagiotis D. Michailidis, Konstantinos G. Margaritis

12 Sep 2011

TL;DR: The SITEBLAST algorithm (Michael et al., 2005) employs the Aho-Corasick algorithm to retrieve all motif anchors for a local alignment procedure for genomic sequences that makes use of prior knowledge.

...read moreread less

Abstract: Multiple pattern matching is the computationally intensive kernel of many applications including information retrieval and intrusion detection systems, web and spam filters and virus scanners. The use of multiple pattern matching is very important in genomics where the algorithms are frequently used to locate nucleotide or amino acid sequence patterns in biological sequence databases. For example, when proteomics data is used for genome annotation in a process called proteogenomic mapping (Jaffe et al., 2004), a set of peptide identifications obtained using mass spectrometry is matched against a target genome translated in all six reading frames. Given a sequence database (or text) T = t1t2...tn of length n and a finite set of r patterns P = p1, p2, ..., pr , where each pi is a string pi = pi 1p i 2...p i m of length m over a finite character set Σ, the multiple pattern matching problem can be defined as the way to locate all the occurrences of any of the patterns in the sequence database. The naive solution to this problem is to perform r separate searches with one of the sequential algorithms (Navarro & Raffinot, 2002). While frequently used in the past, this technique is not efficient when a large pattern set is involved. The aim of all multiple pattern matching algorithms is to locate the occurrences of all patterns with a single pass of the sequence database. These algorithms are based of single-pattern matching algorithms, with some of their functions generalized to process multiple patterns simultaneously during the preprocessing phase, generally with the use of trie structures or hashing. Multiple pattern matching is widely used in computational biology for a variety of pattern matching tasks. Brundo and Morgenstern used a simplified version of the Aho-Corasick algorithm to identify anchor points in their CHAOS algorithm for fast alignment of large genomic sequences (Brudno & Morgenstern, 2002; Brudno et al., 2004). Hyyro et al. demonstrated that Aho-Corasick outperforms other algorithms for locating unique oligonucleotides in the yeast genome (Hyyro et al., 2005). The SITEBLAST algorithm (Michael et al., 2005) employs the Aho-Corasick algorithm to retrieve all motif anchors for a local alignment procedure for genomic sequences that makes use of prior knowledge. Buhler Parallel Processing of Multiple Pattern Matching Algorithms for Biological Sequences: Methods and Performance Results

...read moreread less

Proceedings Article•

DNA sequence alignment: hybrid parallel programming on a multicore cluster

[...]

Enzo Rucci¹, Armando Eduardo De Giusti¹, Franco Chichizola¹, Marcelo Naiouf¹, Laura Cristina De Giusti¹ - Show less +1 more•Institutions (1)

National University of La Plata¹

15 Sep 2011

TL;DR: A new parallelization strategy (HI-M) of Smith-Waterman algorithm on a multi-core cluster is presented, configuring a pipeline with a hybrid communication model and compared with two previously presented parallel solutions.

...read moreread less

Abstract: DNA sequence alignment is one of the most important operations of computational biology. In 1981, Smith and Waterman developed a method for sequences local alignment. Due to its computational power and memory requirements, various heuristics have been developed to reduce execution time at the expense of a loss of accuracy in the result. This is why heuristics do not ensure that the best alignment is found. For this reason, it is interesting to study how to apply the computer power of different parallel platforms to speed up the sequence alignment process without losing result accuracy. In this article, a new parallelization strategy (HI-M) of Smith-Waterman algorithm on a multi-core cluster is presented, configuring a pipeline with a hybrid communication model. Additionally, a performance analysis is carried out and compared with two previously presented parallel solutions. Finally, experimental results are presented, as well as future research lines.

...read moreread less

Proceedings Article•DOI•

Accelerating Biological Sequence Alignment Algorithm on GPU with CUDA

[...]

Fang Zheng, Xianbin Xu¹, Yuanhua Yang¹, Shuibing He¹, Yuping Zhang¹ - Show less +1 more•Institutions (1)

Wuhan University¹

21 Oct 2011

TL;DR: A multi-threaded parallel design and implementation of the Smith-Waterman (SW) on CUDA to reduce execution time and results show this m implementation achieves more better performance than the other parallel implementation on the Graphics Processing Unit.

...read moreread less

Abstract: In this paper, we have used Compute Unified Device Architecture (CUDA) GPU to accelerate pair wise sequence alignment using the Smith-Waterman (SW) algorithm Smith-Waterman(SW) is by far the best algorithm for its accuracy in similarity scoring But the executing time of this algorithm is too long in sequence alignment So we describe a multi-threaded parallel design and implementation of the Smith-Waterman (SW) on CUDA to reduce execution time And according the architecure of CUDA, we have divided the computation of a whole pair wise sequence alignment scoring matrix into multiple sub-matrices, using 32 threads to process on submatrice, more over we optimized memory distribution scheme, and used reduction to find the maximum element of the alignment scoring matrix We experiment the algorimthm on GeForce 9600 GT, connet to Windows xp 64-bit system The results show this mplementation achieves more better performance than the other parallel implementation on the Graphics Processing Unit

...read moreread less

Proceedings Article•DOI•

Parallel models for sequence alignment on CPU and GPU

[...]

Plamenka Borovska¹, Milena Lazarova¹•Institutions (1)

Technical University of Sofia¹

16 Jun 2011

TL;DR: Investigation is made of the performance parameters of computing similarity indexes between query sequences and a reference sequence using the suggested parallel programming models and experimental analyses are aimed at searching for similarities of the human gamma interferon protein and influenza virus.

...read moreread less

Abstract: The paper presents parallel computational models of Smith-Waterman algorithm for CPU and GPU. An investigation is made of the performance parameters of computing similarity indexes between query sequences and a reference sequence using the suggested parallel programming models. Implementations for GPU based sequence alignment using nVIDIA CUDA and OpenCL as well as CPU based sequence alignment using OpenMP multithreaded implementation are presented. The experimental analyses are aimed at searching for similarities of the human gamma interferon protein and influenza virus.

...read moreread less

Biological Sequence Alignment for Bioinformatics Applications Using MATLAB

[...]

Sonali Vijan

11 Jun 2011

TL;DR: The two basic alignment algorithms i.e. Smith Waterman for local alignment and Needleman Wunsch for global alignment have been developed and simulated using MATLAB for genome analysis and sequence alignment.

...read moreread less

Abstract: Biological Sequence alignment is widely used operation in the field of Bioinformatics and computational biology as it is used to determine the similarity between the biological sequences. The two basic alignment algorithms i.e. Smith Waterman for local alignment and Needleman Wunsch for global alignment have been used in this paper. The algorithms have been developed and simulated using MATLAB for genome analysis and sequence alignment. The local and global alignment has been presented and the results are shown in the form of Dot plots and local and global scores for the sequences. The proposed work is a useful tool that can aid in the exploration, interpretation and visualization of data in the field of molecular biology.

...read moreread less

Proceedings Article•DOI•

Design and Analysis of High Performance and Low Power Matrix Filling for DNA Sequence Alignment Accelerator Using ASIC Design Flow

[...]

Norhazlin Khairudin¹, M.A. Haron¹, S.A.M. Al Junid¹, Abdul Karimi Halim¹, M. F. M. Idros¹, N.F. Abdul Razak¹ - Show less +2 more•Institutions (1)

Universiti Teknologi MARA¹

16 Nov 2011

TL;DR: A novel approach and analysis of High Performance and Low Power Matrix Filling for DNA Sequence Alignment Accelerator by using ASIC design flow and provides more efficient speed up compared to the traditional sequential implementation but at the same time maintaining the level of sensitivity.

...read moreread less

Abstract: Efficient sequence alignment is one of the most important and challenging activities in bioinformatics. Many algorithms have been proposed to perform and accelerate sequence alignment activities. Among them Smith-Waterman (S-W) is the most sensitive (accurate) algorithm. This paper presents a novel approach and analysis of High Performance and Low Power Matrix Filling for DNA Sequence Alignment Accelerator by using ASIC design flow. The objective of this paper is to improve the performance of the DNA sequence alignment and to optimize power reduction of the existing technique by using Smith Waterman (SW) algorithm. The scope of study is by using the matrix filling method which is in parallel implementation of the Smith-Waterman algorithm. This method provides more efficient speed up compared to the traditional sequential implementation but at the same time maintaining the level of sensitivity. The methodology of this paper is using FPGA and Synopsis. This technique is used to implement the massive parallelism. The design was developed in Verilog HDL coding and synthesized by using LINUX tools. Matrix Cells with a design area 8808.307mm2 at 40ns clock period is the best design. Thus the power required at this clock period also smaller, dynamic power 111.1415uW and leakage power 212.9538 Nw. This is a large improvement over existing designs and improves data throughput by using ASIC design flow.

...read moreread less

Proceedings Article•DOI•

Evaluating BLAST Runtime Using NAS-Based High Performance Clusters

[...]

Sadiq M. Sait¹, Muhammed Al-Mulhem¹, Raed Abdullah AlShaikh¹•Institutions (1)

King Fahd University of Petroleum and Minerals¹

20 Sep 2011

TL;DR: Evaluating both the serial and parallel BLAST algorithms onto a large Infiniband-based diskless High Performance Cluster that offers lower hardware cost and improved reliability, as opposed to traditional disk full clusters shows that BLAST runtime can still be retained with the use of the diskless clusters, while improving the runtime reliability.

...read moreread less

Abstract: The Basic Local Alignment Search (BLAST) is one of the most widely used bioinformatics programs for searching all available sequence databases for similarities between a protein or DNA query and predefined sequences, using sequence alignment technique. Recently, many attempts have been made to make the algorithm practical to run against the publicly available genome databases on large parallel clusters. This paper presents our experience in evaluating both the serial and parallel BLAST algorithms onto a large Infiniband-based diskless High Performance Cluster (HPC) that offers lower hardware cost and improved reliability, as opposed to traditional disk full clusters. The paper also presents the evaluation methodology along with the experimental results to illustrate the scalability of the BLAST algorithm on our HPC system. For our measurement and comparison, we considered cluster sizes up to 32 compute nodes. Our results show that BLAST runtime can still be retained with the use of the diskless clusters, while improving the runtime reliability.

...read moreread less

Book Chapter•DOI•

[...]

Liviu P. Dinu, Andrea Sgarro

12 Sep 2011

TL;DR: Alternative approaches to the standard approach to the alignment and string matching problems as dealt with in computer science might be explored in biology, provided one is able to give a positive answer to the following question: can one exhibit a sequence distance which is at the same time easily computed and non-trivial?

...read moreread less

Abstract: In general, when a new DNA sequence is given, the first step taken by a biologist would be to compare the new sequence with sequences that are already well studied and annotated. Sequences that are similar would probably have the same function, or, if two sequences from different organisms are similar, there may be a common ancestor sequence. Traditionally, this is made by using a distance function between the DNA chains, which implies in most cases that we apply it between two DNA sequences and try to interpret the obtained score. The standard method for sequence comparison is by sequence alignment. Sequence alignment is the procedure of comparing two sequences (pairwise alignment) or more sequences (multiple alignment) by searching for a series of individual characters or characters patterns that are in the same order in the sequences. Algorithmically, the standard pairwise alignment method is based on dynamic programming; the method compares every pair of characters of the two sequences and generates an alignment and a score, which is dependent on the scoring scheme used, i.e. a scoring matrix for the different base-pair combinations, match and mismatch scores, or a scheme for insertion or deletion (gap) penalties. The underlying string distance is called edit distance or also Levenshtein distance. Although dynamic programming for sequence alignment is mathematically optimal, it is far too slow for comparing a large number of bases. Typical DNAdatabase today contains billions of bases, and the number is still increasing rapidly. To enable sequence search and comparison to be performed in a reasonable time, fast heuristic local alignment algorithms have been developed, e.g. BLAST, freely available at http://www.ncbi.nlm.nih.gov/BLAST. With respect to the standard approach to the alignment and string matching problems as dealt with in computer science, alternative approaches might be explored in biology, provided one is able to give a positive answer to the following question: can one exhibit a sequence distance which is at the same time easily computed and non-trivial? The ranking of this problem on the first position in two lists of major open problems in bioinformatics (J.C. Wooley. Trends in computational biology: a summary based on a RECOMB plenary lecture. J. Comput. Biology, 6, 459-474, 1999 and E.V. Koonin. The emerging paradigm and open problems in comparative 6

...read moreread less

Journal Article•DOI•

In Silico Protein-Protein Interaction Prediction with Sequence Alignment and Classifier Stacking

[...]

Simone Marini¹, Qian Xu¹, Qiang Yang¹•Institutions (1)

Hong Kong University of Science and Technology¹

31 Oct 2011-Current Protein & Peptide Science

TL;DR: A novel integrated approach to predicting PPI based on sequence alignment by jointly using a k-Nearest Neighbor classifier (SA-kNN) and a Support Vector Machine (SVM), a machine learning technique used in a wide range of Bioinformatics applications, thanks to the ability to alleviate the overfitting problems.

...read moreread less

Abstract: Protein-Protein Interaction (PPI) prediction is a well known problem in Bioinformatics, for which a large number of techniques have been proposed in the past. However, prediction results have not been sufficiently satisfactory for guiding biologists in web-lab experiments. One reason is that not all useful information, such as pairwise protein interaction information based on sequence alignment, has been integrated together in PPI prediction. Alignment is a basic concept to measure sequence similarity in Proteomics that has been used in a number of applications ranging from protein recognition to protein subcellular localization. In this article, we propose a novel integrated approach to predicting PPI based on sequence alignment by jointly using a k-Nearest Neighbor classifier (SA-kNN) and a Support Vector Machine (SVM). SVM is a machine learning technique used in a wide range of Bioinformatics applications, thanks to the ability to alleviate the overfitting problems. We demonstrate that in our approach the two methods, SA-kNN and SVM, are complementary, which are combined in an ensemble to overcome their respective limitations. While the SVM is trained on Amino Acid (AA) compositions and protein signatures mined from literature, the SA-kNN makes use of the similarity of two protein pairs through alignment. Experimentally, our technique leads to a significant gain in accuracy, precision and sensitivity measures at ∼5%, 16% and 10% respectively.

...read moreread less

Proceedings Article•DOI•

Identical object segmentation through level sets with similarity constraint

[...]

Hongbin Xie¹, Gang Zeng¹, Rui Gan¹, Hongbin Zha¹•Institutions (1)

Peking University¹

01 Nov 2011

TL;DR: An algorithm based on level set with a novel similarity constraint term for identical objects segmentation is presented to embed the similarity constraint into curve evolution, where the evolving speed is high in regions of similar appearance and becomes low in areas with distinct contents.

...read moreread less

Abstract: Unsupervised identical object segmentation remains a challenging problem in vision research due to the difficulties in obtaining high-level structural knowledge about the scene. In this paper, we present an algorithm based on level set with a novel similarity constraint term for identical objects segmentation. The key component of the proposed algorithm is to embed the similarity constraint into curve evolution, where the evolving speed is high in regions of similar appearance and becomes low in areas with distinct contents. The algorithm starts with a pair of seed matches (e.g. SIFT) and evolve the small initial circle to form large similar regions under the similarity constraint. The similarity constraint is related to local alignment with assumption that the warp between identical objects is affine transformation. The right warp aligns the identical objects and promotes the similar regions growth. The alignment and expansion alternate until the curve reaches the boundaries of similar objects. Real experiments validates the efficiency and effectiveness of the proposed algorithm.

...read moreread less

Proceedings Article•DOI•

A comparison of FPGAs, GPUS and CPUS for Smith-Waterman algorithm (abstract only)

[...]

Yoshiki Yamaguchi¹, Kuen Hung Tsoi², Wayne Luk²•Institutions (2)

University of Tsukuba¹, Imperial College London²

27 Feb 2011

TL;DR: A comprehensive study of a systolic design for Smith-Waterman algorithm is presented, with specific focus on enhancing parallelism and on optimizing the total size of memory and circuits; in particular, efficient realizations for compressing score matrices and for reducing affine gap cost functions are developed.

...read moreread less

Abstract: The Smith-Waterman algorithm is a key technique for comparing genetic sequences. This paper presents a comprehensive study of a systolic design for Smith-Waterman algorithm. It is parameterized in terms of the sequence length, the amount of parallelism, and the number of FPGAs. Two methods of organizing the parallelism, the line-based and the lattice-based methods, are introduced. Our analytical treatment reveals how these two methods perform relative to peak performance when the level of parallelism varies. A novel systolic design is then described, showing how the parametric description can be effectively implemented, with specific focus on enhancing parallelism and on optimizing the total size of memory and circuits; in particular, we develop efficient realizations for compressing score matrices and for reducing affine gap cost functions. Promising results have been achieved showing, for example, a single XC5VLX330 FPGA at 131MHz can be three times faster than a platform with two NVIDIA GTX295 at 1242MHz.

...read moreread less

GPU Accelerated PK-means Algorithm for Gene Clustering

[...]

Wuchao Situ, Yau-King Lam, Yi Xiao, Peter Wai Ming Tsang, Chi-Sing Leung - Show less +1 more

01 Jan 2011

TL;DR: Experimental evaluations show that the proposed GPU accelerated scheme for the PKmeans gene clustering algorithm can attain an order of magnitude speedup as compared with the original PK-means algorithm.

...read moreread less

Abstract: In this paper, a novel GPU accelerated scheme for the PK-means gene clustering algorithm is proposed According to the native particle-pair structure of the PKmeans algorithm, a fragment shader program is tailor-made to process a pair of particles in one pass for the computationintensive portion As the output channel of a fragment consisting of 4 floating-point values is fully utilized, overhead for each data points in searching for its nearest centroid throughout the particle-pair is reduced Experimental evaluations on three popular gene expression datasets show that the proposed GPU accelerated scheme can attain an order of magnitude speedup as compared with the original PK-means algorithm

...read moreread less

Journal Article•DOI•

Alignment of DNA Sequence Using the Features of Global and Local Algorithms along with Matrices

[...]

Kavita Sharma, Amit Saxena, Praveen Kumar

01 Nov 2011-Advanced Materials Research

TL;DR: The algorithm generated is based on designing matrices in such a way that score matrix contains the maximum scores for alignment of the DNA sequences and the aligned sequences are generated by trace matrix generated based on the score matrix.

...read moreread less

Abstract: The algorithm generated is based on designing matrices in such a way that score matrix contains the maximum scores for alignment of the DNA sequences and the aligned sequences are generated by trace matrix generated based on the score matrix. The score matrix is initialized by using Smith – Waterman algorithm and the scores used for filling up the score matrix are calculated using Needleman –Wunsch algorithm.

...read moreread less

Journal Article•DOI•

Local structural alignment of RNA with affine gap model

[...]

Thomas K. F. Wong¹, Brenda W. Y. Cheung¹, Tak-Wah Lam¹, Siu-Ming Yiu¹•Institutions (1)

University of Hong Kong¹

28 Apr 2011-BMC Proceedings

TL;DR: It is shown that there are ncRNA families in which considering local structural alignment with gap penalty model can identify real hits more effectively than using global alignment or local alignment without gap Penalty model.

...read moreread less

Abstract: Predicting new non-coding RNAs (ncRNAs) of a family can be done by aligning the potential candidate with a member of the family with known sequence and secondary structure. Existing tools either only consider the sequence similarity or cannot handle local alignment with gaps. In this paper, we consider the problem of finding the optimal local structural alignment between a query RNA sequence (with known secondary structure) and a target sequence (with unknown secondary structure) with the affine gap penalty model. We provide the algorithm to solve the problem. Based on an experiment, we show that there are ncRNA families in which considering local structural alignment with gap penalty model can identify real hits more effectively than using global alignment or local alignment without gap penalty model.

...read moreread less

Book Chapter•DOI•

Computational Methods to Locate and Reconstruct Genes for Complexity Reduction in Comparative Genomics

[...]

A Vidya¹, D Usha¹, B M Rashma¹, P. Deepa Shenoy¹, K. B. Raja¹, K. R. Venugopal¹, S. Sitharama Iyengar², Lalit M. Patnaik³ - Show less +4 more•Institutions (3)

University Visvesvaraya College of Engineering¹, Louisiana State University², Defence Institute of Advanced Technology³

05 Aug 2011

TL;DR: The proposed Coding Region Sequence Analysis(CRSA) algorithm presents a method to reduce both time and space complexity by meaningfully reducing the size of sequences by removing not so significant exons using wavelet transforms.

...read moreread less

Abstract: Discovering the functions of proteins in living organisms is an important tool for understanding cellular processes. The source data for such analysis are commonly the peptide sequences. Most common algorithms used to compare a pair of nucleotide sequence are Global alignment algorithm (Needleman-Wunch algorithm) or local alignment algorithm (Smith-Waterman algorithm). Analysis of these algorithms show that time complexity required to the above mentioned algorithms is O(mn) and space complexity required is O(mn), where m is size of one sequence and n is size of the other sequence. This is one of the major bottlenecks as most of the sequences are very large. The proposed Coding Region Sequence Analysis(CRSA) algorithm presents a method to reduce both time and space complexity by meaningfully reducing the size of sequences by removing not so significant exons using wavelet transforms. DSP techniques supply a strong basis for regions identification with three-base periodicity.

...read moreread less

DOI•

Accelerating Smith-Waterman algorithm with the use of graphics processing unit

[...]

Robert Pawlowski¹, Dariusz Mrozek¹•Institutions (1)

Silesian University of Technology¹

01 Jun 2011

TL;DR: An approach leading to significant acceleration of the execution of the Smith-Waterman algorithm, which finds the best local alignment of two sequences, such as amino acid or nucleotide sequences, is presented.

...read moreread less

Abstract: CUDA is a technology introduced by NVIDIA Corporation, which allows software developers to take advantage of GPU resources relatively easily. This paper presents an approach leading to significant acceleration of the execution of the Smith-Waterman algorithm. The algorithm finds the best local alignment of two sequences, such as amino acid or nucleotide sequences. The results show that it is possible to search bio-informatics databases accurately within a reasonable time.

...read moreread less

Journal Article•DOI•

Online Handwritten Digit Recognition by Smith-Waterman Alignment

[...]

Won-Ho Mun, Yeon-Seok Choi, Sang-Geol Lee, Eui-Young Cha

30 Sep 2011-Journal of the Korea Society of Computer and Information

TL;DR: An efficient on-line handwritten digit recognition base on Convex-Concave curves feature which is extracted by a chain code sequence using Smith-Waterman alignment algorithm is proposed.

...read moreread less

Abstract: In this paper, we propose an efficient on-line handwritten digit recognition base on Convex-Concave curves feature which is extracted by a chain code sequence using Smith-Waterman alignment algorithm. The time sequential signal from mouse movement on the writing pad is described as a sequence of consecutive points on the x-y plane. So, we can create data-set which are successive and time-sequential pixel position data by preprocessing. Data preprocessed is used for Convex-Concave curves feature extraction. This feature is scale-, translation-, and rotation-invariant. The extracted specific feature is fed to a Smith-Waterman alignment algorithm, which in turn classifies it as one of the nine digits. In comparison with backpropagation neural network, Smith-Waterman alignment has the more outstanding performance.

...read moreread less

Journal Article•DOI•

Multiple genome sequences alignment algorithm based on coding regions

[...]

Che Lun Hung¹, Chun-Yuan Lin², Shih Cheng Chang², Yeh-Ching Chung³, Shu Ju Hsieh, Chuan Yi Tang¹, Yaw-Ling Lin¹ - Show less +3 more•Institutions (3)

Providence College¹, Chang Gung University², National Tsing Hua University³

29 Jun 2011-International Journal of Computational Biology and Drug Design

TL;DR: CorAL-M as discussed by the authors adopts a codon-based probabilistic filtration model and the local optimal alignment solution to align multiple genome sequences in linear time and finds more potential function sites than that of other commonly used tools.

...read moreread less

Abstract: Multiple Sequence Alignment (MSA) is the computational biology tool for facilitating the study of DNA homology, phylogeny determinations and conserved motifs. Many MSA methods have been presented to align protein, DNA, and RNA sequences successfully but not for coding region sequences. Therefore, we propose a heuristic alignment method, CORAL-M, for multiple genome sequences, especially for coding regions. CORAL-M adopts a codon-based probabilistic filtration model and the local optimal alignment solution to align multiple genome sequences in linear time. The experimental results presents that CORAL-M can find more potential function sites than that of other commonly used tools by aligning Enterovirus strains.

...read moreread less

Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of

[...]

Valery O Polyanovsky, Mikhail A. Roytberg, Vladimir G. Tumanyan

01 Jan 2011

...read moreread less

Abstract: Background: Algorithms of sequence alignment are the key instruments for computer-assisted studies of biopolymers. Obviously, it is important to take into account the “quality” of the obtained alignments, i.e. how closely the algorithms manage to restore the “gold standard” alignment (GS-alignment), which superimposes positions originating from the same position in the common ancestor of the compared sequences. As an approximation of the GS-alignment, a 3D-alignment is commonly used not quite reasonably. Among the currently used algorithms of a pair-wise alignment, the best quality is achieved by using the algorithm of optimal alignment based on affine penalties for deletions (the Smith-Waterman algorithm). Nevertheless, the expedience of using local or global versions of the algorithm has not been studied. Results: Using model series of amino acid sequence pairs, we studied the relative “quality” of results produced by local and global alignments versus (1) the relative length of similar parts of the sequences (their “cores”) and their nonhomologous parts, and (2) relative positions of the core regions in the compared sequences. We obtained numerical values of the average quality (measured as accuracy and confidence) of the global alignment method and the local alignment method for evolutionary distances between homologous sequence parts from 30 to 240 PAM and for the core length making from 10% to 70% of the total length of the sequences for all possible positions of homologous sequence parts relative to the centers of the sequences. Conclusion: We revealed criteria allowing to specify conditions of preferred applicability for the local and the global alignment algorithms depending on positions and relative lengths of the cores and nonhomologous parts of the sequences to be aligned. It was demonstrated that when the core part of one sequence was positioned above the core of the other sequence, the global algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the local algorithm. On the contrary, when the cores were positioned asymmetrically, the local algorithm was more stable at longer evolutionary distances and larger nonhomologous parts than the global algorithm. This opens a possibility for creation of a combined method allowing generation of more accurate alignments.

...read moreread less