scispace - formally typeset
Search or ask a question

Showing papers by "William F. Smyth published in 2021"


Journal ArticleDOI
TL;DR: V-order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words, and a methodology for efficiently computing the FM-Index in V- order is described.

3 citations


Journal ArticleDOI
TL;DR: A new model for the representation of strings, regular or indeterminate, is proposed, and a linear time algorithm is described to determine whether or not a string x = x [ 1 . . n ] is regular and, if so, to replace it by a lexicographically least (lex-least) string y whose entries are all single letters.

1 citations


Proceedings ArticleDOI
18 May 2021
TL;DR: In this paper, the authors provide an overview of three central methodological areas: pattern matching; re repetitions (of both adjacent and non-adjacent repeating substrings); string covering and compression.
Abstract: Data analytics may conveniently be divided into four stages: preparation, preprocessing, analysis, and post-processing. Especially in the second and third of these, where the data is cleaned, filtered and analyzed, string processing algorithms are fundamental. Applicable string methodology especially includes pattern matching (dozens of competing algorithms) and algorithms that compute repetitions and other forms of regularity. These are supported by powerful data structures (suffix array, prefix table, Burrows-Wheeler Transform, Lyndon array, and many others), developed and refined over the last 50 years.In this paper we provide an overview of three central methodological areas:•pattern matching;•repetitions (of both adjacent and non-adjacent repeating substrings);•string covering and compression.Each of these methodologies deals with both exact and approximate matches in the data provided. We outline several current applications to data analytics, in particular bioinformatics, information security and image analysis — all of them therefore positioned for future extension as string methodologies continue their rapid development.

1 citations