Showing papers by "William F. Smyth published in 2008"

PDF

Open Access

Proceedings Article•DOI•

A Simple Algorithm for Computing the Lempel Ziv Factorization

[...]

Maxime Crochemore¹, Lucian Ilie², William F. Smyth•Institutions (2)

King's College London¹, University of Western Ontario²

25 Mar 2008

TL;DR: A space-efficient simple algorithm for computing the Lempel-Ziv factorization of a string for a string of length n over an integer alphabet that runs in O(n) time independently of alphabet size and uses o( n) additional space.

...read moreread less

Abstract: We give a space-efficient simple algorithm for computing the Lempel-Ziv factorization of a string. For a string of length n over an integer alphabet, it runs in O(n) time independently of alphabet size and uses o(n) additional space.

...read moreread less

85 citations

Journal Article•DOI•

How many runs can a string contain

[...]

Simon J. Puglisi¹, Jamie Simpson², William F. Smyth²•Institutions (2)

RMIT University¹, Curtin University²

01 Jul 2008-Theoretical Computer Science

TL;DR: In 2000 Kolpakov and Kucherov showed that the maximum number @r(n) of runs in any string x[1..n] is O(n), but their proof was nonconstructive and provided no specific constant of proportionality.

...read moreread less

75 citations

Journal Article•DOI•

Lempel-Ziv Factorization Using Less Time & Space

[...]

Gang Chen¹, Simon J. Puglisi², William F. Smyth¹, William F. Smyth³•Institutions (3)

McMaster University¹, RMIT University², Curtin University³

11 Apr 2008-Mathematics in Computer Science

TL;DR: A collection of fast space-efficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed are introduced; one family achieves true Θ(n)-time alphabet-independent processing in the worst case by avoiding tree structures altogether.

...read moreread less

Abstract: For 30 years the Lempel–Ziv factorization LZ x of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZ x was based on Θ(n)-time (or, depending on the measure used, O(n log n)-time) processing of the suffix tree ST x of x. Recently Abouelhoda et al. proposed an efficient Lempel–Ziv factorization algorithm based on an “enhanced” suffix array – that is, a suffix array SA x together with supporting data structures, principally an “interval tree”. In this paper we introduce a collection of fast space-efficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed; one family out of this collection achieves true Θ(n)-time alphabet-independent processing in the worst case by avoiding tree structures altogether.

...read moreread less

64 citations

Journal Article•DOI•

Fast pattern-matching on indeterminate strings

[...]

Jan Holub¹, William F. Smyth², Shu Wang²•Institutions (2)

Czech Technical University in Prague¹, McMaster University²

01 Mar 2008-Journal of Discrete Algorithms

TL;DR: In a string x on an alphabet @S, a position i is said to be indeterminate iff x[i] may be any one of a specified subset {@l"1,@ l"2,..., @l"j} of @S.

...read moreread less

60 citations

Proceedings Article•

An adaptive hybrid pattern-matching algorithm on indeterminate strings

[...]

William F. Smyth, Shu Wang¹, Mao Yu•Institutions (1)

McMaster University¹

01 Jan 2008

TL;DR: A hybrid pattern-matching algorithm that works on both regular and indeterminate strings that avoids using the border array and is superior in overall performance to its two component algorithms — perhaps a general advantage of hybrid algorithms.

...read moreread less

Abstract: We describe a hybrid pattern-matching algorithm that works on both regular and indeterminate strings. This algorithm is inspired by the recently proposed hybrid algorithm FJS [11] and its indeterminate successor [15]. However, as discussed in this paper, because of the special properties of indeterminate strings, it is not straightforward to directly migrate FJS to an indeterminate version. Our new algorithm combines two fast pattern-matching algorithms, Shift-And and BMS (the Sunday variant of the Boyer-Moore algorithm), and is highly adaptive to the nature of the text being processed. It avoids using the border array, therefore avoids some of the cases that are awkward for indeterminate strings. Although not always the fastest in individual test cases, our new algorithm is superior in overall performance to its two component algorithms — perhaps a general advantage of hybrid algorithms.

...read moreread less

28 citations

Proceedings Article•

Fast Optimal Algorithms for Computing All the Repeats in a String

[...]

Simon J. Puglisi¹, William F. Smyth, Munina Yusufu²•Institutions (2)

RMIT University¹, McMaster University²

01 Jan 2008

TL;DR: A new algorithm PSY1 is described that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p greater than or equal to pmin, which is an order of magnitude faster than the two other algorithms previously proposed for this problem.

...read moreread less

Abstract: Given a string x = x[1..n] on an alphabet of size alpha, and a threshold pmin 1 greater than or equal to, we first describe a new algorithm PSY1 that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p greater than or equal to pmin. PSY1 executes in theta(n) time independent of alphabet size and is an order of magnitude faster than the two other algorithms previously proposed for this problem. Second, we describe a new fast algorithm PSY2 for computing all complete supernonextendible repeats in x that also executes in theta(n) time independent of alphabet size, thus asymptotically faster than methods previously proposed. Both algorithms require 9n bytes of storage, including preprocessing (with a minor caveat for PSY1). We conclude with a brief discussion of applications to bioinformatics and data compression.

...read moreread less

26 citations

Book Chapter•DOI•

New Perspectives on the Prefix Array

[...]

William F. Smyth¹, Shu Wang¹•Institutions (1)

McMaster University¹

10 Nov 2008

TL;DR: It is described twoθ (n )-time algorithms PL1 & PL2 tocompute POS/LEN for regular strings using only 8m bytes of storage in addition to the n bytes required for x, and an extension IPL of PL1 that computes POS/ LEN in O (n 2) worst-case time (though generally much faster), still using only 7mbytes of additional storage.

...read moreread less

Abstract: In this paper we consider the prefix array π =π[1..n] of a string x =x[1..n] in which π[1]=0 and, for i >1, π[i = k iff k is the largest integersuch that x[i..i+k-1]. The prefix array πis closely related to the border array β: an integerarray [1..n ] such that β[i = kiff the length of the longest border of x[1..i] isk . Border arrays or their variants are used in many stringalgorithms and prefix arrays can be used directly forpattern-matching. It is well known that for regular strings πprovides all the information that β does; we showhowever that for indeterminate strings (those containing entriesthat match a subset of the alphabet) π actually provides moreinformation, in fact still enabling all the borders of every prefixof x to be specified. Since a lot of the entries of π areexpected to be zeros, it is natural to represent π in compressedform using integer arrays POS[1..m] and LEN[1..m],where m is the number of nonzero entries in π andπ[POS[j]] = LEN [j] iff the $j^{\mbox{th}}$nonzero entry in π occurs in position POS[j] and takesthe value LEN [j]. The expected value of m isn /σ - 1, where σ is thealphabet size. The straightforward way of computing POS/LENrequires computing π first, therefore requiresO (n ) extra space. We describe twoθ (n )-time algorithms PL1 & PL2 tocompute POS/LEN for regular strings using only 8m bytes ofstorage in addition to the n bytes required for x.PL1 requires about one-third the time of the standard border arrayalgorithm MP on English-language strings; PL2 executes faster thanMP on both English-language and highly periodic strings on{a ,b }. For indeterminate strings, we describe anextension IPL of PL1 that computes POS/LEN in O (n 2) worst-case time (though generally much faster), stillusing only 8m bytes of additional storage. For bothregular and indeterminate strings, the compressed form of π canbe used for efficient pattern-matching.

...read moreread less

17 citations

Journal Article•DOI•

Identifying rhythms in musical texts

[...]

Manolis Christodoulakis¹, Costas S. Iliopoulos, Mohammad Sohel Rahman¹, William F. Smyth•Institutions (1)

King's College London¹

01 Feb 2008

TL;DR: This paper presents an efficient algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r.

...read moreread less

Abstract: A fundamental problem in music is to classify songs according to their rhythm. A rhythm is represented by a sequence of “Quick” (Q) and “Slow” (S) symbols, which correspond to the (relative) duration of notes, such that S = 2Q. In this paper, we present an efficient algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r.

...read moreread less

8 citations