scispace - formally typeset
Search or ask a question
Author

William F. Smyth

Other affiliations: University of Western Australia, Murdoch University, IBM  ...read more
Bio: William F. Smyth is an academic researcher from McMaster University. The author has contributed to research in topics: String (computer science) & Substring. The author has an hindex of 32, co-authored 177 publications receiving 3547 citations. Previous affiliations of William F. Smyth include University of Western Australia & Murdoch University.


Papers
More filters
Journal ArticleDOI
TL;DR: A survey of suffix array construction algorithms can be found in this article, with a comparison of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.
Abstract: In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.

323 citations

Book
01 Mar 2003
TL;DR: A basic general introduction to the algorithms (methods) that efficiently compute patterns in strings, which is fundamental to many fields molecular biology, cryptography, data compression, computer vision, speech recognition, computational geometry, and many others.
Abstract: A string is just a sequence of letters. But strings can be massive. Plant and animal genomes are strings billions of letters long on the simple alphabet Internet traffic among billions of websites is a collection of strings that amount to quadrillions of computer bits every day. Such strings are regularly searched, probably millions of times a day, for patterns of all kinds - genomic codes for genes and chromosomes, indicators of terrorist activity, and many others. The search for patterns is fundamental to many fields molecular biology, cryptography, data compression, computer vision, speech recognition, computational geometry. This book provides a basic general introduction to the algorithms (methods) that efficiently compute patterns in strings. It focuses on results that can be explained with reasonable economy and simplicity, but its 250 references also enable the reader to access current state-of-the-art methodology.

245 citations

Journal ArticleDOI
TL;DR: A simple FAS algorithm which guarantees a good (though not optimal) performance bound and executes in time O(m) is presented and achieves the same asymptotic performance bound that Berger-Shor does.

211 citations

Journal ArticleDOI
TL;DR: This paper introduces an array γ = γ[1..n] called the cover array in which each element γ, 1 ≤ i ≤ n, is the length of the longest proper cover of x[1…i] or zero if no such cover exists.
Abstract: Let x denote a given nonempty string of length n = |x| . A proper substring u of x is a proper cover of x if and only if every position of x lies within an occurrence of u within x . This paper introduces an array γ = γ[1..n] called the cover array in which each element γ[i] , 1 ≤ i ≤ n , is the length of the longest proper cover of x[1..i] or zero if no such cover exists. In fact it turns out that γ describes all the covers of every prefix of x . Several interesting properties of γ are established, and a simple algorithm is presented that computes γ on-line in Θ(n) time using Θ(n) additional space. Thus the new algorithm computes for all prefixes of x information that previous cover algorithms could compute only for x itself, and does so with no increase in space or time complexity.

95 citations

Journal ArticleDOI
TL;DR: This paper provides a characterization of all the squares in F, hence in every prefix Fn; this characterization naturally gives rise to a algorithm which specifiesall the squares of Fn in an appropriate encoding.

88 citations


Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Journal Article
TL;DR: In this survey I have collected everything I could find on graph labelings techniques that have appeared in journals that are not widely available.
Abstract: A graph labeling is an assignment of integers to the vertices or edges, or both, subject to certain conditions. Graph labelings were first introduced in the late 1960s. In the intervening years dozens of graph labelings techniques have been studied in over 1000 papers. Finding out what has been done for any particular kind of labeling and keeping up with new discoveries is difficult because of the sheer number of papers and because many of the papers have appeared in journals that are not widely available. In this survey I have collected everything I could find on graph labeling. For the convenience of the reader the survey includes a detailed table of contents and index.

2,367 citations

Journal ArticleDOI
TL;DR: LAST, the open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition, and guarantees that the number of matches increases linearly, instead of quadratically, with sequence length.
Abstract: The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

1,097 citations

Journal ArticleDOI
TL;DR: A qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools is provided, and a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors are evaluated.

989 citations