scispace - formally typeset
Search or ask a question

Showing papers by "Costas S. Iliopoulos published in 2009"


Journal Article
TL;DR: The current result on the different matching problems is extended to handle the presence of “don’t care” symbols, and efficient algorithms are presented that calculate Iδ, Iγ , and I( δ,γ) = I δ∩Iγ , for pattern P with occurrences of ”don�’s cares”.
Abstract: Here we consider string matching problems that arise naturally in applications to music retrieval. The δ-Matching problem calculates, for a given text T1..n and a pattern P1..m on an alphabet of integers, the list of all indices Iδ = {1 ≤ i ≤ n−m+1 : max m j=1 |Pj−Ti+j−1| ≤ δ}. The γ-Matching problem computes, for given T and P , the list of all indices Iγ = {1 ≤ i ≤ n − m + 1 : Pm j=1 |Pj − Ti+j−1| ≤ γ}. In this paper, we extend the current result on the different matching problems to handle the presence of “don’t care” symbols. We present efficient algorithms that calculate Iδ, Iγ , and I(δ,γ) = Iδ∩Iγ , for pattern P with occurrences of “don’t cares”.

53 citations


Journal ArticleDOI
10 Jun 2009
TL;DR: This paper presents a new and efficient algorithm for solving the Longest Common Subsequence problem for two strings in O(ℛlog log’n+n) time, where ℛ is the total number of ordered pairs of positions at which the two strings match.
Abstract: The Longest Common Subsequence (LCS) problem is a classic and well-studied problem in computer science. The LCS problem is a common task in DNA sequence analysis with many applications to genetics and molecular biology. In this paper, we present a new and efficient algorithm for solving the LCS problem for two strings. Our algorithm runs in O(ℛlog log n+n) time, where ℛ is the total number of ordered pairs of positions at which the two strings match.

45 citations


Book ChapterDOI
10 Nov 2009
TL;DR: Efficient algorithms for storing past segments of a text using two previously computed read-only arrays composing the Suffix Array of the text and an O(nlogn) strong in-place computation of the LPF table are presented.
Abstract: We present efficient algorithms for storing past segments of a text. They are computed using two previously computed read-only arrays (SUF and LCP) composing the Suffix Array of the text. They compute the maximal length of the previous factor (subword) occurring at each position of the text in a table called LPF. This notion is central both in many conservative text compression techniques and in the most efficient algorithms for detecting motifs and repetitions occurring in a text. The main results are: a linear-time algorithm that computes explicitly the permutation that transforms the LCP table into the LPF table; a time-space optimal computation of the LPF table; and an O(nlogn) strong in-place computation of the LPF table.

34 citations


Book ChapterDOI
08 Dec 2009
TL;DR: Two new tables storing different types of previous factors (past segments) of a string are computed efficiently in linear time on any integer alphabet, helpful to improve, for example, gapped palindrome detection and text compression using reverse factors.
Abstract: Suffix arrays provide a powerful data structure to solve several questions related to the structure of all the factors of a string. We show how they can be used to compute efficiently two new tables storing different types of previous factors (past segments) of a string. The concept of a longest previous factor is inherent to Ziv-Lempel factorization of strings in text compression, as well as in statistics of repetitions and symmetries. The longest previous reverse factor for a given position i is the longest factor starting at i, such that its reverse copy occurs before, while the longest previous non-overlapping factor is the longest factor v starting at i which has an exact copy occurring before. The previous copies of the factors are required to occur in the prefix ending at position i ? 1. We design algorithms computing the table of longest previous reverse factors (LPrF table) and the table of longest previous non-overlapping factors (LPnF table). The latter table is useful to compute repetitions while the former is a useful tool for extracting symmetries. These tables are computed, using two previously computed read-only arrays (SUF and LCP) composing the suffix array, in linear time on any integer alphabet. The tables have not been explicitly considered before, but they have several applications and they are natural extensions of the LPF table which has been studied thoroughly before. Our results improve on the previous ones in several ways. The running time of the computation no longer depends on the size of the alphabet, which drops a log factor. Moreover the newly introduced tables store additional information on the structure of the string, helpful to improve, for example, gapped palindrome detection and text compression using reverse factors.

28 citations


Journal ArticleDOI
TL;DR: A new improved indexing scheme for the gapped-factors is presented, which generalizes the indexing data structure in the sense that, unlike GFT, it is independent of the parameters k and k′.
Abstract: Indexing of factors or substrings is a widely used and useful technique in stringology and can be seen as a tool in solving diverse text algorithmic problems. A gapped-factor is a concatenation of a factor of length k, a gap of length d and another factor of length k′. Such a gapped factor is called a (k−d−k′)-gapped-factor. The problem of indexing the gapped-factors was considered recently by Peterlongo et al. (In: Stringology, pp. 182–196, 2006). In particular, Peterlongo et al. devised a data structure, namely a gapped factor tree (GFT) to index the gapped-factors. Given a text $\mathcal{T}$of length n over the alphabet Σ and the values of the parameters k, d and k′, the construction of GFT requires O(n|Σ|) time. Once GFT is constructed, a given (k−d−k′)-gapped-factor can be reported in O(k+k′+Occ) time, where Occ is the number of occurrences of that factor in $\mathcal{T}$. In this paper, we present a new improved indexing scheme for the gapped-factors. The improvements we achieve come from two aspects. Firstly, we generalize the indexing data structure in the sense that, unlike GFT, it is independent of the parameters k and k′. Secondly, our data structure can be constructed in O(nlog 1+e n) time and space, where 0

27 citations


Proceedings ArticleDOI
01 Nov 2009
TL;DR: This paper defines and solves the Massive Exact Unique Pattern Matching problem in genomes, and presents a practical algorithm for addressing the problem of efficiently mapping uniquely occuring short reads to a reference genome.
Abstract: Novel high throughput sequencing technology methods have redefined the way genome sequencing is performed. They are able to produce tens of millions of short sequences (reads) in a single experiment and with a much lower cost than previous sequencing methods. Due to this massive amount of data generated by the above systems, efficient algorithms for mapping short sequences to a reference genome are in great demand. In this paper, we present a practical algorithm for addressing the problem of efficiently mapping uniquely occuring short reads to a reference genome. This requires the classification of these short reads into unique and duplicate matches. In particular, we define and solve the Massive Exact Unique Pattern Matching problem in genomes.

20 citations


24 Jun 2009
TL;DR: A new, combinatorial model for analyzing and interpreting an electrocardiogram (ECG) is presented and an application of the model is QRS peak detection, demonstrated with an online algorithm, which is shown to be space as well as time efficient.
Abstract: A new, combinatorial model for analyzing and interpreting an electrocardiogram (ECG) is presented. An application of the model is QRS peak detection. This is demonstrated with an online algorithm, which is shown to be space as well as time efficient. Experimental results on the MIT-BIH Arrhythmia database show that this novel approach is promising. Further uses for this approach are discussed, such as taking advantage of its small memory requirements and interpreting large amounts of pre-recorded ECG data. Keywords—Combinatorics, ECG analysis, MIT-BIH Arrhythmia Database, QRS Detection, String Algorithms

9 citations


Journal ArticleDOI
TL;DR: A family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints are described, which are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem.
Abstract: A repeat in a string is a substring that occurs more than once. A repeat is extendible if every occurrence of the repeat has an identical letter either on the left or on the right; otherwise, it is maximal. A multirepeat is a repeat that occurs at least mmin times (m$_{min}$g 2) in each of at least q g 1 strings in a given set of strings. In this paper, we describe a family of efficient algorithms based on suffix arrays to compute maximal multirepeats under various constraints. Our algorithms are faster, more flexible and much more space-efficient than algorithms recently proposed for this problem. The results extend recent work by two of the authors computing all maximal repeats in a single string.

8 citations


Journal ArticleDOI
TL;DR: This paper proposes a general framework for polyphonic music using the substitution score scheme set for monophonic music, which allows new operations by extending the operations proposed by Mongeau and Sankoff [15].
Abstract: Existing symbolic music comparison systems generally consider monophonic music or monophonic reduction of polyphonic music. Adaptation of alignment algorithms to music leads to accurate systems, but their extensions to polyphonic music raise new problems. Indeed, a chord may match several consecutive notes, or the difference between two similar motifs may be a few swapped notes. Moreover, the substitution scores between chords are difficult to set up. In this paper, we propose a general framework for polyphonic music using the substitution score scheme set for monophonic music, which allows new operations by extending the operations proposed by Mongeau and Sankoff [15]. From a practical point of view, the limitation of chord sizes and the number of notes that can be merged consecutively lead to a complexity that remains quadratic.

7 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of efficiently mapping and classifying millions of short sequences to a reference genome, based on whether they occur exactly once in the genome or not, and by taking into consideration probability scores.
Abstract: Novel high-throughput (Deep) sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a much lower cost than previous methods. In this paper, we address the problem of efficiently mapping and classifying millions of short sequences to a reference genome, based on whether they occur exactly once in the genome or not, and by taking into consideration probability scores. In particular, we design algorithms for Massive Exact and Approximate Pattern Matching of short degenerate and weighted sequences, derived from Deep sequencing technologies, to a reference genome.

6 citations


Book ChapterDOI
07 Jul 2009
TL;DR: This paper shows how to implement the algorithms of Iliopoulos and Rytter on the MPI environment, and includes the modification of algorithms due to the lack of shared memory, small number of processors, communication costs between processors.
Abstract: Suffix trees and suffix arrays are two well-known index data structures for strings. It is known that the latter can be easily transformed into the former: Iliopoulos and Rytter [5] showed two simple transformation algorithms on the CREW PRAM model. However, the PRAM model is a theoretical one and we need a practical parallel model. The Message Passing Interface (MPI) is a standard widely used on both massively parallel machines and on clusters. In this paper, we show how to implement the algorithms of Iliopoulos and Rytter on the MPI environment. Our contribution includes the modification of algorithms due to the lack of shared memory, small number of processors, communication costs between processors.

Proceedings Article
01 Dec 2009
TL;DR: This paper addresses the problem of efficiently mapping and classifying millions of degenerate and weighted sequences to a reference genome, based on whether they occur exactly once in the genome or not, and by taking into consideration probability scores.
Abstract: Novel high throughput sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a much lower cost than previous methods. In this paper, we address the problem of efficiently mapping and classifying millions of degenerate and weighted sequences to a reference genome, based on whether they occur exactly once in the genome or not, and by taking into consideration probability scores. In particular, we design parallel algorithms for Massive Exact and Approximate Unique Pattern Matching for degenerate and weighted sequences derived from high throughput sequencing technologies.

Proceedings Article
01 May 2009
TL;DR: This work introduces combinatorial problems involving overlays (non-overlapping substrings) and the covering of a text t by them and shows that decision problems of this type can be solved using an Aho-Corasick keyword automaton.
Abstract: Motivated by the identification of the musical structure of pop songs, we introduce combinatorial problems involving overlays (non-overlapping substrings) and the covering of a text t by them. We present 4 problems and suggest solu- tions based on string pattern matching techniques. We show that decision problems of this type can be solved using an Aho-Corasick keyword automaton. We conjecture that one general optimization problem of the type, is NP-complete and introduce a simpler, more pragmatic optimization prob- lem. We solve the latter using suffix trees and finally, we suggest other open problems for further investigation.

Proceedings ArticleDOI
03 Aug 2009
TL;DR: This paper defines and solves the Massive Exact and Approximate Unique Pattern Matching problem for degenerate and weighted sequences derived from high throughput sequencing technologies.
Abstract: High throughput, (or next generation) sequencing technologies have opened new and exciting opportunities in the use of DNA sequences. The new emerging technologies mark the beginning of a new era of high throughput short read sequencing: they have the potential to assemble a bacterial genome during a single experiment and at a moderate cost. In this paper, we address the problem of efficiently mapping millions of degenerate and weighted sequences to a reference genome with respect to whether they occur exactly once in the genome or not, and by taking probability scores into consideration. In particular, we define and solve the Massive Exact and Approximate Unique Pattern Matching problem for degenerate and weighted sequences derived from high throughput sequencing technologies.

01 Jan 2009
TL;DR: In this article, it was shown that any digraph with out-degree at most d ≥ 2, diameter k ≥ 2 and order at most one or two less than Moore bound must have all vertices of outdegree d.
Abstract: Since Moore digraphs do not exist for k 6= 1 and d 6= 1, the problem of finding digraphs of out-degree d ≥ 2, diameter k ≥ 2 and order close to the Moore bound, becomes an interesting problem. To prove the non-existence of such digraphs or to assist in their construction (if they exist), we first may wish to establish some properties that such digraphs must possess. In this paper we consider the diregularity of such digraphs. It is easy to show that any digraph with out-degree at most d ≥ 2, diameter k ≥ 2 and order one or two less than Moore bound must have all vertices of out-degree d. However, establishing the regularity or otherwise of the in-degree of such a digraph is not easy. In this paper we prove that all digraphs of defect two are either diregular or almost diregular. Additionally, in the case of defect one we present a new, simpler and shorter, proof that a digraph of defect one must be diregular, and in the case of defect two and for d = 2 and k ≥ 3, we present an alternative proof that a digraph of defect two must be diregular. This research was partly supported by the Leverhulme Visiting Professorship of the second author.

Proceedings Article
31 Aug 2009
TL;DR: The algorithm presented here validates a one-dimensional image x of length n, over a given set of objects all of equal length and each composed of two parts separated by a transparent hole.
Abstract: A partially occluded image consists of a set of objects where some may be partially occluded by others Validating occluded images distinguishes whether a given image can be covered by the members of a finite set of objects, where both the image and the object range over identical alphabet The algorithm presented here validates a one-dimensional image x of length n, over a given set of objects all of equal length and each composed of two parts separated by a transparent hole

Posted Content
TL;DR: The upper bound on the maximal number of runs in a string of length n is shown and a sequence of words is constructed for which the lower bound is obtained.
Abstract: A run is a maximal occurrence of a repetition $v$ with a period $p$ such that $2p \le |v|$. The maximal number of runs in a string of length $n$ was studied by several authors and it is known to be between $0.944 n$ and $1.029 n$. We investigate highly periodic runs, in which the shortest period $p$ satisfies $3p \le |v|$. We show the upper bound $0.5n$ on the maximal number of such runs in a string of length $n$ and construct a sequence of words for which we obtain the lower bound $0.406 n$.