scispace - formally typeset
Search or ask a question

Showing papers by "William F. Smyth published in 2006"


Book ChapterDOI
15 Aug 2006
TL;DR: New algorithms to handle the pattern matching problem where the pattern can contain variable length gaps are presented and are shown to be useful in many other contexts.
Abstract: In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max$_{\rm 1<={\it i}<={\it l}}$(bi–ai))) time where n is the length of the text, m is the summation of the lengths of the component subpatterns, α is the total number of occurrences of the component subpatterns in the text and ai and bi are, respectively, the minimum and maximum number of don’t cares allowed between the ith and (i+1)st component of the pattern. We also present another algorithm which, given a suffix array of the text, can report whether P occurs in T in O(m + α loglogn) time. Both the algorithms record information to report all the occurrences of P in T. Furthermore, the techniques used in our algorithms are shown to be useful in many other contexts.

54 citations


Book ChapterDOI
11 Oct 2006
TL;DR: The resource requirements of compressed suffix array algorithms are examined against compressed inverted file data structures for general pattern matching in genomic and English texts, finding inverted files are faster at reporting the location of patterns when the number of occurrences of patterns is high.
Abstract: Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

45 citations


Journal ArticleDOI
TL;DR: A periodicity lemma is presented that establishes limitations on the number and range of periodicities that can occur over a specified range of positions in {\mbox{\boldmath $x$}} and is applied to specify corresponding limitations in the occurrence of runs.
Abstract: Given a string $\s{x}=\s{x}[1..n]$, a repetition of period $p$ in {\mbox{\boldmath $x$}} is a substring ${\mbox{\boldmath $u$}}^r = \break {\mbox{\boldmath $x$}}[i..i\+ rp\- 1]$, $p = |{\mbox{\boldmath $u$}}|$, $r \ge 2$, where neither ${\mbox{\boldmath $u$}} = {\mbox{\boldmath $x$}}[i..i\+ p\- 1]$ nor ${\mbox{\boldmath $x$}}[i..i\+ (r\+ 1)p\- 1]$ is a repetition. The maximum number of repetitions in any string {\mbox{\boldmath $x$}} is well known to be $\Theta(n\log n)$. A run or maximal periodicity of period $p$ in {\mbox{\boldmath $x$}} is a substring ${\mbox{\boldmath $u$}}^r{\mbox{\boldmath $t$}} = {\mbox{\boldmath $x$}}[i..i\+ rp\+ |{\mbox{\boldmath $t$}}|\- 1]$ of {\mbox{\boldmath $x$}}, where ${\mbox{\boldmath $u$}}^r$ is a repetition, {\mbox{\boldmath $t$}} is a proper prefix of {\mbox{\boldmath $u$}}, and no repetition of period $p$ begins at position $i\- 1$ of {\mbox{\boldmath $x$}} or ends at position $i\+ rp\+ |{\mbox{\boldmath $t$}}|$. In 2000 Kolpakov and Kucherov [J. Discrete Algorithms, 1 (2000), pp. 159-186] showed that the maximum number $\rho(n)$ of runs in any string {\mbox{\boldmath $x$}} is $O(n)$, but their proof was nonconstructive and provided no specific constant of proportionality. At the same time, they presented experimental data strongly suggesting that $\rho(n) < n$. Related work by Fraenkel and Simpson [J. Combin. Theory Ser. A., 82 (1998), pp. 112-120] showed that the maximum number $\sigma(n)$ of distinct squares in any string {\mbox{\boldmath $x$}} satisfies $\sigma(n) < 2n$, while experiment again encourages the belief that in fact $\sigma(n) < n$. In this paper, as a first step toward proving these conjectures, we present a periodicity lemma that establishes limitations on the number and range of periodicities that can occur over a specified range of positions in {\mbox{\boldmath $x$}}. We then apply this result to specify corresponding limitations on the occurrence of runs.

34 citations


Journal ArticleDOI
01 Dec 2006
TL;DR: In this article, the authors consider the construction of a suffix array based on a given reordering of the alphabet, and describe simple time and space-efficient algorithms that accomplish it.
Abstract: For certain problems (for example, computing repetitions and repeats, data compression applications) it is not necessary that the suffixes of a string represented in a suffix tree or suffix array should occur in lexicographical order (lexorder). It thus becomes of interest to study possible alternate orderings of the suffixes in these data structures, that may be easier to construct or more efficient to use. In this paper we consider the "reconstruction" of a suffix array based on a given reordering of the alphabet, and we describe simple time- and space-efficient algorithms that accomplish it.

11 citations


Proceedings Article
01 Jan 2006
TL;DR: This talk investigates whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays.
Abstract: Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays.

2 citations


Journal Article
TL;DR: In this paper, it was shown that if G is a k-optimum summable graph of order n, k ≥ 3 and k ≥ 2, then (1) n ≥ 2 k ; (2) the complete bipartite graph K k,n − k is not a spanning subgraph of G.

1 citations


Proceedings Article
01 Aug 2006
TL;DR: This paper presents a linear algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r, and can then be used to find which rhythm, from a given set of such rhythms, covers the largest part of the music sequence under question, and thus best describes that sequence.
Abstract: A fundamental problem in music is to classify songs according to their rhythm. A rhythm is represented by a sequence of Quick (Q) and Slow (S) symbols, which correspond to the (relative) duration of notes, such that S=QQ. In this paper we present a linear algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r. An efficient algorithm to solve this problem, can then be used to find which rhythm, from a given set of such rhythms, covers the largest part of the music sequence under question, and thus best describes that sequence.

1 citations