Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

MoTeX: A word-based HPC tool for MoTif eXtraction

[...]

Solon P. Pissis¹, Alexandros Stamatakis¹, Pavlos Pavlidis²•Institutions (2)

Heidelberg Institute for Theoretical Studies¹, Foundation for Research & Technology – Hellas²

22 Sep 2013

TL;DR: MoTeX, the first high-performance computing (HPC) tool for MoTif eXtraction from large-scale datasets, is introduced and it is shown that it matches or outperforms competing tools in terms of runtime efficiency.

...read moreread less

Abstract: Motivation: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motifs may correspond to functional elements in DNA, RNA, or protein molecules. Motifs may also correspond to whole loci whose sequences are highly similar because of recent duplication (e.g., transposable elements or recently duplicated genes). A DNA motif is a nucleic acid sequence that has a specific biological function, for instance encoding the DNA binding sites for a regulatory protein (transcription factor). Results: In this article, we introduce MoTeX, the first high-performance computing (HPC) tool for MoTif eXtraction from large-scale datasets. It uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. MoTeX comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. We show that MoTeX produces similar and partially identical results to current state-of-the-art tools with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms competing tools in terms of runtime efficiency. The MPI-based version of MoTeX requires only one hour to process all human genes on 1056 processors, while current sequential programmes require more than two months for this task. Availability: http://www.exelixis-lab.org/motex (open-source code)

...read moreread less

12 citations

Journal Article•DOI•

siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves

[...]

Yoshimasa Takabatake¹, Kenta Nakashima¹, Tetsuji Kuboyama², Yasuo Tabei, Hiroshi Sakamoto¹ - Show less +1 more•Institutions (2)

Kyushu Institute of Technology¹, Gakushuin University²

15 Apr 2016-Algorithms

TL;DR: The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM.

...read moreread less

Abstract: Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM’s efficiency.

...read moreread less

12 citations

Journal Article•DOI•

Swiftly Computing Center Strings

[...]

Franziska Hufsky¹, Franziska Hufsky², Leon Kuchenbecker³, Katharina Jahn³, Jens Stoye³, Sebastian Böcker² - Show less +2 more•Institutions (3)

Max Planck Society¹, University of Jena², Bielefeld University³

19 Apr 2011-BMC Bioinformatics

TL;DR: This paper introduces data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions, and describes a novel iterative search strategy that is effecient in practice, where some of the reduction techniques can be applied.

...read moreread less

Abstract: The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete. In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is effecient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application. We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.

...read moreread less

12 citations

Journal Article•DOI•

Exploring pianist performance styles with evolutionary string matching

[...]

Søren Tjagvad Madsen¹, Gerhard Widmer²•Institutions (2)

Austrian Research Institute for Artificial Intelligence¹, Johannes Kepler University of Linz²

01 Aug 2006-International Journal on Artificial Intelligence Tools

TL;DR: A way of measuring each pianist's habit of playing similar phrases in similar ways is presented and a ranking of the performers based on that is proposed.

...read moreread less

Abstract: We propose novel machine learning methods for exploring the domain of music performance praxis. Based on simple measurements of timing and intensity in 12 recordings of a Schubert piano piece, short performance sequences are fed into a SOM algorithm in order to calculate 'performance archetypes'. The archetypes are labeled with letters and approximate string matching done by an evolutionary algorithm is applied to find similarities in the performances represented by these letters. We present a way of measuring each pianist's habit of playing similar phrases in similar ways and propose a ranking of the performers based on that. Finally, an experiment revealing common expression patterns is briefly described.

...read moreread less

12 citations

Journal Article•DOI•

A Memory-Efficient Deterministic Finite Automaton-Based Bit-Split String Matching Scheme Using Pattern Uniqueness in Deep Packet Inspection

[...]

HyunJin Kim¹, Kang-Il Choi², Sang-Il Choi¹•Institutions (2)

Dankook University¹, Electronics and Telecommunications Research Institute²

04 May 2015-PLOS ONE

TL;DR: The experimental results show that the proposed string matching scheme can reduce the storage cost significantly compared to the previous bit-split string matching methods.

...read moreread less

Abstract: This paper proposes a memory-efficient bit-split string matching scheme for deep packet inspection (DPI). When the number of target patterns becomes large, the memory requirements of the string matching engine become a critical issue. The proposed string matching scheme reduces the memory requirements using the uniqueness of the target patterns in the deterministic finite automaton (DFA)-based bit-split string matching. The pattern grouping extracts a set of unique patterns from the target patterns. In the set of unique patterns, a pattern is not the suffix of any other patterns. Therefore, in the DFA constructed with the set of unique patterns, when only one pattern can be matched in an output state. In the bit-split string matching, multiple finite-state machine (FSM) tiles with several input bit groups are adopted in order to reduce the number of stored state transitions. However, the memory requirements for storing the matching vectors can be large because each bit in the matching vector is used to identify whether its own pattern is matched or not. In our research, the proposed pattern grouping is applied to the multiple FSM tiles in the bit-split string matching. For the set of unique patterns, the memory-based bit-split string matching engine stores only the pattern match index for each state to indicate the match with its own unique pattern. Therefore, the memory requirements are significantly decreased by not storing the matching vectors in the string matchers for the set of unique patterns. The experimental results show that the proposed string matching scheme can reduce the storage cost significantly compared to the previous bit-split string matching methods.

...read moreread less

12 citations

Collapse

Network Information

Performance

Metrics

1,944

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	33
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics