Topic

Approximate string matching

About: Approximate string matching is a research topic. Over the lifetime, 1903 publications have been published within this topic receiving 62352 citations. The topic is also known as: fuzzy string-searching algorithm & fuzzy string-matching algorithm.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Probabilistic correlation-based similarity measure on text records q

[...]

Shaoxu Song, Han Zhu, Lei Chen

01 Jan 2014

TL;DR: In this article, a probabilistic correlation-based similarity measure was proposed for unstructured text record similarity evaluation, which enriches the information of records by considering correlations of tokens.

...read moreread less

Abstract: Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity, can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

...read moreread less

22 citations

Proceedings Article•

Approximate String Matching in Musical Sequences

[...]

Maxime Crochemore, Costas S. Iliopoulos, Thierry Lecroq, Yoan J. Pinzón

01 Jan 2001

TL;DR: These are two new notions of approximate matching that arise naturally in applications of computer assisted music analysis and are presented as fast, efficient and practical algorithms for these two notion of approximate string matching.

...read moreread less

Abstract: Here we consider computational problems on δ-approximate and (δ, γ)-approximate string matching. These are two new notions of approximate matching that arise naturally in applications of computer assisted music analysis. We present fast, efficient and practical algorithms for these two notions of approximate string matching

...read moreread less

22 citations

Proceedings Article•DOI•

The Approximate String Matching on the Hierarchical Memory Machine, with Performance Evaluation

[...]

Duhu Man¹, Koji Nakano¹, Yasuaki Ito¹•Institutions (1)

Hiroshima University¹

26 Sep 2013

TL;DR: This paper shows an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU and shows that the implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

...read moreread less

Abstract: The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The approximate string matching (ASM) for two strings X and Y of length m and n is a task to find a sub string of Y most similar to X. The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU. Our algorithm runs in O(n/w+mn/dw + nL/p + mnl/p) on the HMM with d streaming processors, memory band width w, global memory access latency L, and shared memory access latency l. Further, we implement our algorithm on GeForce GTX 580 GPU and evaluate the performance. The experimental results show that the ASM of two strings of 1024 and 4M (=222) characters can be computed in 419.6ms, while the sequential algorithm can compute it in 27720ms. Thus, our implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

...read moreread less

22 citations

Patent•

System and method for performing longest common prefix strings searches

[...]

Christopher Harris, Hal Lonas

09 Apr 2010

TL;DR: In this article, a method and system for compressing and searching a plurality of strings is presented, which is based on the idea of prefix-preserving compressed string search (PPSS).

...read moreread less

Abstract: A method and system a method for compressing and searching a plurality of strings. The method includes inputting a plurality of strings into a compression engine. The method also includes converting each of the plurality of strings into a new, prefix-preserving compressed string, using the compression engine. For every string P that is a strict prefix of a string S, P's resulting compressed string is a strict prefix of S's resulting compressed string.

...read moreread less

22 citations

Dissertation•DOI•

Approximate string matching for high-throughput sequencing

[...]

Enrico Siragusa

01 Jan 2015

TL;DR: This thesis presents novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures, and provides all implementations within SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de.

...read moreread less

Abstract: Over thepast years, high-throughput sequencing (HTS)hasbecomean invaluablemethod of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual’s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally ef icient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design ef icient and accurate read mapping tools. In the irst part of this thesis, I cover practical indexing and iltering methods for exact and approximate stringmatching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementationswithin SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the irst study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of readmapping tools. First, I present a novel method to ind all mapping locations per read within a user-de ined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamedMasai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-de ined error rate; this method, packaged in a tool called Yara, provides amore practical, yet sound solution to the readmapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.

...read moreread less

22 citations

Collapse

Network Information

Performance

Metrics

1,942

Papers

64,998

Citations

No. of papers in the topic in previous years
Year	Papers
2023	8
2022	30
2021	32
2020	30
2019	48
2018	39

Approximate string matching

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics