scispace - formally typeset
Search or ask a question

Showing papers by "William F. Smyth published in 2008"


Proceedings ArticleDOI
25 Mar 2008
TL;DR: A space-efficient simple algorithm for computing the Lempel-Ziv factorization of a string for a string of length n over an integer alphabet that runs in O(n) time independently of alphabet size and uses o( n) additional space.
Abstract: We give a space-efficient simple algorithm for computing the Lempel-Ziv factorization of a string. For a string of length n over an integer alphabet, it runs in O(n) time independently of alphabet size and uses o(n) additional space.

85 citations


Journal ArticleDOI
TL;DR: In 2000 Kolpakov and Kucherov showed that the maximum number @r(n) of runs in any string x[1..n] is O(n), but their proof was nonconstructive and provided no specific constant of proportionality.

75 citations


Journal ArticleDOI
TL;DR: A collection of fast space-efficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed are introduced; one family achieves true Θ(n)-time alphabet-independent processing in the worst case by avoiding tree structures altogether.
Abstract: For 30 years the Lempel–Ziv factorization LZ x of a string x = x[1..n] has been a fundamental data structure of string processing, especially valuable for string compression and for computing all the repetitions (runs) in x. Traditionally the standard method for computing LZ x was based on Θ(n)-time (or, depending on the measure used, O(n log n)-time) processing of the suffix tree ST x of x. Recently Abouelhoda et al. proposed an efficient Lempel–Ziv factorization algorithm based on an “enhanced” suffix array – that is, a suffix array SA x together with supporting data structures, principally an “interval tree”. In this paper we introduce a collection of fast space-efficient algorithms for LZ factorization, also based on suffix arrays, that in theory as well as in many practical circumstances are superior to those previously proposed; one family out of this collection achieves true Θ(n)-time alphabet-independent processing in the worst case by avoiding tree structures altogether.

64 citations


Journal ArticleDOI
TL;DR: In a string x on an alphabet @S, a position i is said to be indeterminate iff x[i] may be any one of a specified subset {@l"1,@ l"2,..., @l"j} of @S.

60 citations


Proceedings Article
01 Jan 2008
TL;DR: A hybrid pattern-matching algorithm that works on both regular and indeterminate strings that avoids using the border array and is superior in overall performance to its two component algorithms — perhaps a general advantage of hybrid algorithms.
Abstract: We describe a hybrid pattern-matching algorithm that works on both regular and indeterminate strings. This algorithm is inspired by the recently proposed hybrid algorithm FJS [11] and its indeterminate successor [15]. However, as discussed in this paper, because of the special properties of indeterminate strings, it is not straightforward to directly migrate FJS to an indeterminate version. Our new algorithm combines two fast pattern-matching algorithms, Shift-And and BMS (the Sunday variant of the Boyer-Moore algorithm), and is highly adaptive to the nature of the text being processed. It avoids using the border array, therefore avoids some of the cases that are awkward for indeterminate strings. Although not always the fastest in individual test cases, our new algorithm is superior in overall performance to its two component algorithms — perhaps a general advantage of hybrid algorithms.

28 citations


Proceedings Article
01 Jan 2008
TL;DR: A new algorithm PSY1 is described that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p greater than or equal to pmin, which is an order of magnitude faster than the two other algorithms previously proposed for this problem.
Abstract: Given a string x = x[1..n] on an alphabet of size alpha, and a threshold pmin 1 greater than or equal to, we first describe a new algorithm PSY1 that, based on suffix array construction, computes all the complete nonextendible repeats in x of length p greater than or equal to pmin. PSY1 executes in theta(n) time independent of alphabet size and is an order of magnitude faster than the two other algorithms previously proposed for this problem. Second, we describe a new fast algorithm PSY2 for computing all complete supernonextendible repeats in x that also executes in theta(n) time independent of alphabet size, thus asymptotically faster than methods previously proposed. Both algorithms require 9n bytes of storage, including preprocessing (with a minor caveat for PSY1). We conclude with a brief discussion of applications to bioinformatics and data compression.

26 citations


Book ChapterDOI
10 Nov 2008
TL;DR: It is described twoθ (n )-time algorithms PL1 & PL2 tocompute POS/LEN for regular strings using only 8m bytes of storage in addition to the n bytes required for x, and an extension IPL of PL1 that computes POS/ LEN in O (n 2) worst-case time (though generally much faster), still using only 7mbytes of additional storage.
Abstract: In this paper we consider the prefix array π =π[1..n] of a string x =x[1..n] in which π[1]=0 and, for i >1, π[i = k iff k is the largest integersuch that x[i..i+k-1]. The prefix array πis closely related to the border array β: an integerarray [1..n ] such that β[i = kiff the length of the longest border of x[1..i] isk . Border arrays or their variants are used in many stringalgorithms and prefix arrays can be used directly forpattern-matching. It is well known that for regular strings πprovides all the information that β does; we showhowever that for indeterminate strings (those containing entriesthat match a subset of the alphabet) π actually provides moreinformation, in fact still enabling all the borders of every prefixof x to be specified. Since a lot of the entries of π areexpected to be zeros, it is natural to represent π in compressedform using integer arrays POS[1..m] and LEN[1..m],where m is the number of nonzero entries in π andπ[POS[j]] = LEN [j] iff the $j^{\mbox{th}}$nonzero entry in π occurs in position POS[j] and takesthe value LEN [j]. The expected value of m isn /σ - 1, where σ is thealphabet size. The straightforward way of computing POS/LENrequires computing π first, therefore requiresO (n ) extra space. We describe twoθ (n )-time algorithms PL1 & PL2 tocompute POS/LEN for regular strings using only 8m bytes ofstorage in addition to the n bytes required for x.PL1 requires about one-third the time of the standard border arrayalgorithm MP on English-language strings; PL2 executes faster thanMP on both English-language and highly periodic strings on{a ,b }. For indeterminate strings, we describe anextension IPL of PL1 that computes POS/LEN in O (n 2) worst-case time (though generally much faster), stillusing only 8m bytes of additional storage. For bothregular and indeterminate strings, the compressed form of π canbe used for efficient pattern-matching.

17 citations


Journal ArticleDOI
01 Feb 2008
TL;DR: This paper presents an efficient algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r.
Abstract: A fundamental problem in music is to classify songs according to their rhythm. A rhythm is represented by a sequence of “Quick” (Q) and “Slow” (S) symbols, which correspond to the (relative) duration of notes, such that S = 2Q. In this paper, we present an efficient algorithm for locating the maximum-length substring of a music text t that can be covered by a given rhythm r.

8 citations