Showing papers by "William F. Smyth published in 2006"

PDF

Open Access

Book Chapter•DOI•

Finding patterns with variable length gaps or don’t cares

[...]

M. Sohel Rahman¹, Costas S. Iliopoulos¹, In-Bok Lee², Manal Mohamed¹, William F. Smyth³ - Show less +1 more•Institutions (3)

King's College London¹, Seoul National University², McMaster University³

15 Aug 2006

TL;DR: New algorithms to handle the pattern matching problem where the pattern can contain variable length gaps are presented and are shown to be useful in many other contexts.

...read moreread less

Abstract: In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max$_{\rm 1<={\it i}<={\it l}}$(bi–ai))) time where n is the length of the text, m is the summation of the lengths of the component subpatterns, α is the total number of occurrences of the component subpatterns in the text and ai and bi are, respectively, the minimum and maximum number of don’t cares allowed between the ith and (i+1)st component of the pattern. We also present another algorithm which, given a suffix array of the text, can report whether P occurs in T in O(m + α loglogn) time. Both the algorithms record information to report all the occurrences of P in T. Furthermore, the techniques used in our algorithms are shown to be useful in many other contexts.

...read moreread less

54 citations

Book Chapter•DOI•

Inverted files versus suffix arrays for locating patterns in primary memory

[...]

Simon J. Puglisi¹, William F. Smyth¹, Andrew Turpin²•Institutions (2)

Curtin University¹, RMIT University²

11 Oct 2006

TL;DR: The resource requirements of compressed suffix array algorithms are examined against compressed inverted file data structures for general pattern matching in genomic and English texts, finding inverted files are faster at reporting the location of patterns when the number of occurrences of patterns is high.

...read moreread less

Abstract: Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

...read moreread less

45 citations

Journal Article•DOI•

A New Periodicity Lemma

[...]

Kangmin Fan, Simon J. Puglisi, William F. Smyth, Andrew Turpin

01 Mar 2006-SIAM Journal on Discrete Mathematics

TL;DR: A periodicity lemma is presented that establishes limitations on the number and range of periodicities that can occur over a specified range of positions in {\mbox{\boldmath $x$}} and is applied to specify corresponding limitations in the occurrence of runs.

...read moreread less

Abstract: Given a string $\s{x}=\s{x}[1..n]$, a repetition of period $p$ in {\mbox{\boldmath $x$}} is a substring ${\mbox{\boldmath $u$}}^r = \break {\mbox{\boldmath $x$}}[i..i\+ rp\- 1]$, $p = |{\mbox{\boldmath $u$}}|$, $r \ge 2$, where neither ${\mbox{\boldmath $u$}} = {\mbox{\boldmath $x$}}[i..i\+ p\- 1]$ nor ${\mbox{\boldmath $x$}}[i..i\+ (r\+ 1)p\- 1]$ is a repetition. The maximum number of repetitions in any string {\mbox{\boldmath $x$}} is well known to be $\Theta(n\log n)$. A run or maximal periodicity of period $p$ in {\mbox{\boldmath $x$}} is a substring ${\mbox{\boldmath $u$}}^r{\mbox{\boldmath $t$}} = {\mbox{\boldmath $x$}}[i..i\+ rp\+ |{\mbox{\boldmath $t$}}|\- 1]$ of {\mbox{\boldmath $x$}}, where ${\mbox{\boldmath $u$}}^r$ is a repetition, {\mbox{\boldmath $t$}} is a proper prefix of {\mbox{\boldmath $u$}}, and no repetition of period $p$ begins at position $i\- 1$ of {\mbox{\boldmath $x$}} or ends at position $i\+ rp\+ |{\mbox{\boldmath $t$}}|$. In 2000 Kolpakov and Kucherov [J. Discrete Algorithms, 1 (2000), pp. 159-186] showed that the maximum number $\rho(n)$ of runs in any string {\mbox{\boldmath $x$}} is $O(n)$, but their proof was nonconstructive and provided no specific constant of proportionality. At the same time, they presented experimental data strongly suggesting that $\rho(n) < n$. Related work by Fraenkel and Simpson [J. Combin. Theory Ser. A., 82 (1998), pp. 112-120] showed that the maximum number $\sigma(n)$ of distinct squares in any string {\mbox{\boldmath $x$}} satisfies $\sigma(n) < 2n$, while experiment again encourages the belief that in fact $\sigma(n) < n$. In this paper, as a first step toward proving these conjectures, we present a periodicity lemma that establishes limitations on the number and range of periodicities that can occur over a specified range of positions in {\mbox{\boldmath $x$}}. We then apply this result to specify corresponding limitations on the occurrence of runs.

...read moreread less

34 citations

Journal Article•DOI•

Reconstructing a suffix array

[...]

Frantisek Franek¹, William F. Smyth¹•Institutions (1)

McMaster University¹

01 Dec 2006

TL;DR: In this article, the authors consider the construction of a suffix array based on a given reordering of the alphabet, and describe simple time and space-efficient algorithms that accomplish it.

...read moreread less

Abstract: For certain problems (for example, computing repetitions and repeats, data compression applications) it is not necessary that the suffixes of a string represented in a suffix tree or suffix array should occur in lexicographical order (lexorder). It thus becomes of interest to study possible alternate orderings of the suffixes in these data structures, that may be easier to construct or more efficient to use. In this paper we consider the "reconstruction" of a suffix array based on a given reordering of the alphabet, and we describe simple time- and space-efficient algorithms that accomplish it.

...read moreread less

11 citations

Proceedings Article•

Suffix arrays: what are they good for?

[...]

Simon J. Puglisi¹, William F. Smyth¹, Andrew Turpin²•Institutions (2)

Curtin University¹, RMIT University²

01 Jan 2006

TL;DR: This talk investigates whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays.

...read moreread less

Abstract: Recently the theoretical community has displayed a flurry of interest in suffix arrays, and compressed suffix arrays. New, asymptotically optimal algorithms for construction, search, and compression of suffix arrays have been proposed. In this talk we will present our investigations into the practicalities of these latest developments. In particular, we investigate whether suffix arrays can indeed replace inverted files, as suggested in recent literature on suffix arrays.

...read moreread less

2 citations

Journal Article•

On optimal summable graphs

[...]

K.M. Koh, Mirka Miller, William F. Smyth, Y. Wang

01 Jan 2006-AKCE International Journal of Graphs and Combinatorics

TL;DR: In this paper, it was shown that if G is a k-optimum summable graph of order n, k ≥ 3 and k ≥ 2, then (1) n ≥ 2 k ; (2) the complete bipartite graph K k,n − k is not a spanning subgraph of G.

...read moreread less

1 citations

Proceedings Article•

Song Classifications for Dancing

[...]

Manolis Christodoulakis, Costas S. Iliopoulos, Sohel Rahman, William F. Smyth¹•Institutions (1)

King's College London¹

01 Aug 2006

TL;DR: This paper presents a linear algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r, and can then be used to find which rhythm, from a given set of such rhythms, covers the largest part of the music sequence under question, and thus best describes that sequence.

...read moreread less

Abstract: A fundamental problem in music is to classify songs according to their rhythm. A rhythm is represented by a sequence of Quick (Q) and Slow (S) symbols, which correspond to the (relative) duration of notes, such that S=QQ. In this paper we present a linear algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r. An efficient algorithm to solve this problem, can then be used to find which rhythm, from a given set of such rhythms, covers the largest part of the music sequence under question, and thus best describes that sequence.

...read moreread less

1 citations