Showing papers on "Approximate string matching published in 2012"

PDF

Open Access

Proceedings Article•DOI•

LINSEN: An efficient approach to split identifiers and expand abbreviations

[...]

Anna Corazza¹, Sergio Di Martino¹, Valerio Maggio¹•Institutions (1)

23 Sep 2012

TL;DR: This paper proposes an approach to automatically split identifiers in their composing words, and expand abbreviations in linear time with respect to the size of the dictionary, taking advantage of an approximate string matching technique.

...read moreread less

Abstract: Information Retrieval (IR) techniques are being exploited by an increasing number of tools supporting Software Maintenance activities. Indeed the lexical information embedded in the source code can be valuable for tasks such as concept location, clustering or recovery of traceability links. The application of such IR-based techniques relies on the consistency of the lexicon available in the different artifacts, and their effectiveness can worsen if programmers introduce abbreviations (e.g: rect) and/or do not strictly follow naming conventions such as Camel Case (e.g: UTFtoASCII). In this paper we propose an approach to automatically split identifiers in their composing words, and expand abbreviations. The solution is based on a graph model and performs in linear time with respect to the size of the dictionary, taking advantage of an approximate string matching technique. The proposed technique exploits a number of different dictionaries, referring to increasingly broader contexts, in order to achieve a disambiguation strategy based on the knowledge gathered from the most appropriate domain. The approach has been compared to other splitting and expansion techniques, using freely available oracles for the identifiers extracted from 24 C/C++ and Java open source systems. Results show an improvement in both splitting and expanding performance, in addition to a strong enhancement in the computational efficiency.

...read moreread less

72 citations

Journal Article•DOI•

Iris Recognition Using Possibilistic Fuzzy Matching on Local Features

[...]

Chung-Chih Tsai¹, Heng-Yi Lin¹, Jin-Shiuh Taur¹, Chin-Wang Tao•Institutions (1)

National Chung Hsing University¹

01 Feb 2012

TL;DR: A novel possibilistic fuzzy matching strategy with invariant properties is proposed, which can provide a robust and effective matching scheme for two sets of iris feature points and is comparable to those of the typical systems.

...read moreread less

Abstract: In this paper, we propose a novel possibilistic fuzzy matching strategy with invariant properties, which can provide a robust and effective matching scheme for two sets of iris feature points. In addition, the nonlinear normalization model is adopted to provide more accurate position before matching. Moreover, an effective iris segmentation method is proposed to refine the detected inner and outer boundaries to smooth curves. For feature extraction, the Gabor filters are adopted to detect the local feature points from the segmented iris image in the Cartesian coordinate system and to generate a rotation-invariant descriptor for each detected point. After that, the proposed matching algorithm is used to compute a similarity score for two sets of feature points from a pair of iris images. The experimental results show that the performance of our system is better than those of the systems based on the local features and is comparable to those of the typical systems.

...read moreread less

62 citations

Journal Article•DOI•

On Approximate Jumbled Pattern Matching in Strings

[...]

Péter Burcsi¹, Ferdinando Cicalese², Gabriele Fici³, Zsuzsanna Lipták⁴•Institutions (4)

Eötvös Loránd University¹, University of Salerno², University of Nice Sophia Antipolis³, Bielefeld University⁴

01 Jan 2012

TL;DR: This work presents an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.

...read moreread less

Abstract: Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t)=q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u,v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u≤q≤v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then discuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t)≤v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.

...read moreread less

55 citations

Book Chapter•DOI•

5PM: secure pattern matching

[...]

Joshua W. Baron¹, Karim El Defrawy², Kirill Minkovich², Rafail Ostrovsky¹, Eric Tressler² - Show less +1 more•Institutions (2)

University of California, Los Angeles¹, HRL Laboratories²

05 Sep 2012

TL;DR: The techniques reduction pattern matching and generalized Hamming distance problem to a novel linear algebra formulation that allows for generic solutions based on any additively homomorphic encryption are believed to be of independent interest.

...read moreread less

Abstract: In this paper we consider the problem of secure pattern matching that allows single character wildcards and substring matching in the malicious (stand-alone) setting. Our protocol, called 5PM, is executed between two parties: Server, holding a text of length n, and Client, holding a pattern of length m to be matched against the text, where our notion of matching is more general and includes non-binary alphabets, non-binary Hamming distance and non-binary substring matching. 5PM is the first protocol with communication complexity sub-linear in circuit size to compute non-binary substring matching in the malicious model (general MPC has communication complexity which is at least linear in the circuit size). 5PM is also the first sublinear protocol to compute non-binary Hamming distance in the malicious model. Additionally, in the honest-but-curious (semi-honest) model, 5PM is asymptotically more efficient than the best known scheme when amortized for applications that require single charcter wildcards or substring pattern matching. 5PM in the malicious model requires O((m+n)k2) bandwidth and O(m+n) encryptions, where m is the pattern length and n is the text length. Further, 5PM can hide pattern size with no asymptotic additional costs in either computation or bandwidth. Finally, 5PM requires only 2 rounds of communication in the honest-but-curious model and 8 rounds in the malicious model. Our techniques reduce pattern matching and generalized Hamming distance problem to a novel linear algebra formulation that allows for generic solutions based on any additively homomorphic encryption. We believe our efficient algebraic techniques are of independent interest.

...read moreread less

52 citations

Journal Article•DOI•

String matching with variable length gaps

[...]

Philip Bille¹, Inge Li Gørtz¹, Hjalte Wedel Vildhøj¹, David Kofoed Wind•Institutions (1)

Technical University of Denmark¹

01 Jul 2012-Theoretical Computer Science

TL;DR: A new algorithm is presented achieving time O([email protected]) and space O(m+A), where A is the sum of the lower bounds of the lengths of the gaps in P and @a is the total number of occurrences of the strings in P within T.

...read moreread less

47 citations

Journal Article•DOI•

Entity Synonyms for Structured Web Search

[...]

Tao Cheng¹, Hady W. Lauw², Stelios Paparizos¹•Institutions (2)

Microsoft¹, Institute for Infocomm Research Singapore²

01 Oct 2012-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes an offline, data-driven approach that mines query logs for instances where content creators and web users apply a variety of strings to refer to the same webpages, and generates an expanded set of equivalent strings (entity synonyms) for each entity.

...read moreread less

Abstract: Nowadays, there are many queries issued to search engines targeting at finding values from structured data (e.g., movie showtime of a specific location). In such scenarios, there is often a mismatch between the values of structured data (how content creators describe entities) and the web queries (how different users try to retrieve them). Therefore, recognizing the alternative ways people use to reference an entity, is crucial for structured web search. In this paper, we study the problem of automatic generation of entity synonyms over structured data toward closing the gap between users and structured data. We propose an offline, data-driven approach that mines query logs for instances where content creators and web users apply a variety of strings to refer to the same webpages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings (entity synonyms) for each entity. Our framework consists of three modules: candidate generation, candidate selection, and noise cleaning. We further study the cause of the problem through the identification of different entity synonym classes. The proposed method is verified with experiments on real-life data sets showing that we can significantly increase the coverage of structured web queries with good precision.

...read moreread less

43 citations

Journal Article•DOI•

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

[...]

Susana Vinga¹, Susana Vinga², Alexandra M. Carvalho³, Alexandre P. Francisco³, Alexandre P. Francisco¹, Luís M. S. Russo³, Luís M. S. Russo¹, Jonas S. Almeida⁴ - Show less +4 more•Institutions (4)

INESC-ID¹, Universidade Nova de Lisboa², Technical University of Lisbon³, University of Alabama at Birmingham⁴

02 May 2012-Algorithms for Molecular Biology

TL;DR: It is found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches.

...read moreread less

Abstract: Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2 -L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

...read moreread less

41 citations

Proceedings Article•DOI•

Efficient event pattern matching with match windows

[...]

Bruno Cadonna¹, Johann Gamper¹, Michael H. Böhlen²•Institutions (2)

Free University of Bozen-Bolzano¹, University of Zurich²

12 Aug 2012

TL;DR: This paper proposes a general pattern matching strategy that consists of a pre-processing step and a pattern matching step that significantly reduces the number of events that need to be processed by A as well as thenumber of calls to A.

...read moreread less

Abstract: In event pattern matching a sequence of input events is matched against a complex query pattern that specifies constraints on extent, order, values, and quantification of matching events. In this paper we propose a general pattern matching strategy that consists of a pre-processing step and a pattern matching step. Instead of eagerly matching incoming events, the pre-processing step buffers events in a match window to apply different pruning techniques (filtering, partitioning, and testing for necessary match conditions). In the second step, an event pattern matching algorithm, A, is called only for match windows that satisfy the necessary match conditions. This two-phase strategy with a lazy call of the matching algorithm significantly reduces the number of events that need to be processed by A as well as the number of calls to A. This is important since pattern matching algorithms tend to be expensive in terms of runtime and memory complexity, whereas the pre-processing can be done very efficiently. We conduct extensive experiments using real-world data with pattern matching algorithms for, respectively, automata and join trees. The experimental results confirm the effectiveness of our strategy for both types of pattern matching algorithms.

...read moreread less

41 citations

Journal Article•DOI•

Generalized median string computation by means of string embedding in vector spaces

[...]

Xiaoyi Jiang¹, Jöran Wentker¹, Miquel Ferrer²•Institutions (2)

University of Münster¹, Autonomous University of Barcelona²

01 May 2012-Pattern Recognition Letters

TL;DR: A new approach for the computation of median string based on string embedding is proposed, which applies three different inverse transformations to go from the vector domain back to the string domain in order to obtain a final approximation of the median string.

...read moreread less

31 citations

Patent•

Displaying a prediction candidate after a typing mistake

[...]

Jerome Pasquero¹, David Ryan Walker¹, Gil Pinheiro¹•Institutions (1)

BlackBerry Limited¹

07 Aug 2012

TL;DR: In this article, a method is proposed to generate at least one string based on the input string, where input string is not a substring of the generated string, responsive to a determination that generated string was previously generated based on input string.

...read moreread less

Abstract: A method includes receiving an input string from a virtual keyboard, generating at least one string based on the input string, where the input string is not a substring of the generated string, responsive to a determination that the generated string was previously generated based on the input string, selecting a candidate character associated with the input string and with the generated string, and displaying the generated string at a location on the virtual keyboard that is associated with the selected candidate character.

...read moreread less

31 citations

Patent•

Systems and methods for matching media content data

[...]

Mark A. Melnychenko, Beth A. Gehman, Susan E. Grant, Jason Mowry, Jill Mowry, Kenneth Murphy, Victoria Tang, Xiaomin Wang, Jennifer M. Wolfe, Paul Wolfe - Show less +6 more

29 Jun 2012

TL;DR: In this article, the authors describe methods and systems for managing an aggregation database using fuzzy matching rules that describe filters to determine how to match a media content record received from an external source to a stored record in the aggregation database.

...read moreread less

Abstract: Methods and systems are described herein for managing an aggregation database. Matching rules that describe filters may be defined to determine how to match a media content record received from an external source to a stored record in the aggregation database. Fuzzy matching may be used to match attribute fields of the received record and stored records. Based on the results of the fuzzy matching, the received primary media content record may be linked to a stored record in the aggregation database.

...read moreread less

Patent•

Scalable string matching as a component for unsupervised learning in semantic meta-model development

[...]

Philip Ogren¹, Rivas Luis E¹, Edward A. Green¹•Institutions (1)

Business International Corporation¹

28 Aug 2012

TL;DR: In this article, a string analysis tool for calculating a similarity metric between a source string and a plurality of target strings is presented, which is based on a minimum similarity metric threshold.

...read moreread less

Abstract: A string analysis tool for calculating a similarity metric between a source string and a plurality of target strings. The string analysis tool may include optimizations that may reduce the number of calculations to be carried out when calculating the similarity metric for large volumes of data. In this regard, the string analysis tool may represent strings as features. As such, analysis may be performed relative to features (e.g., of either the source string or plurality of target strings) such that features from the strings may be eliminated from consideration when identifying target strings for which a similarity metric is to be calculated. The elimination of features may be based on a minimum similarity metric threshold, wherein features that are incapable of contributing to a similarity metric above the minimum similarity metric threshold are eliminated from consideration.

...read moreread less

Book Chapter•DOI•

Computing the maximal-exponent repeats of an overlap-free string in linear time

[...]

Golnaz Badkobeh¹, Maxime Crochemore¹, Chalita Toopsuwan¹•Institutions (1)

King's College London¹

21 Oct 2012

TL;DR: It is shown there is a linear number of maximal-exponent repeats in an overlap-free string and the algorithm can locate all of them in linear time.

...read moreread less

Abstract: The exponent of a string is the quotient of the string's length over the string's smallest period. The exponent and the period of a string can be computed in time proportional to the string's length. We design an algorithm to compute the maximal exponent of factors of an overlap-free string. Our algorithm runs in linear-time on a fixed-size alphabet, while a naive solution of the question would run in cubic time. The solution for non overlap-free strings derives from algorithms to compute all maximal repetitions, also called runs, occurring in the string. We show there is a linear number of maximal-exponent repeats in an overlap-free string. The algorithm can locate all of them in linear time.

...read moreread less

Proceedings Article•DOI•

Efficient Implementations of the Approximate String Matching on the Memory Machine Models

[...]

Koji Nakano¹•Institutions (1)

Hiroshima University¹

05 Dec 2012

TL;DR: This paper shows efficient implementations of approximate string matching on the memory machine models DMM and UMM for strings X and Y with length m and n, respectively.

...read moreread less

Abstract: The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory access and the global memory access of GPUs The approximate string matching for two strings $X$ and $Y$ is a task to find a sub string of $Y$ most similar to $X$ The main contribution of this paper is to show efficient implementations of approximate string matching on the memory machine models Our best implementation for strings $X$ and $Y$ with length $m$ and $n$ ($m\leq n$), respectively, runs in $O({mn\over w}+ml)$ time units using $n$ threads both on the DMM on the UMM with width $w$ and latency $l$

...read moreread less

Patent•

Ranking predictions based on typing speed and typing confidence

[...]

Jerome Pasquero¹, Jason Tyler Griffin¹, Donald Somerset McCulloch Mckenzie¹•Institutions (1)

BlackBerry Limited¹

31 Aug 2012

TL;DR: A method that includes receiving an input string, ranking, by the processor, a predicted string associated with the input string and displaying the ranked predicted string is presented in this paper. But the ranking depends on whether the input text is a substring of the predicted string and at least on one of a typing speed and a typing confidence.

...read moreread less

Abstract: A method that includes receiving an input string, ranking, by the processor, a predicted string associated with the input string, wherein the ranking depends on whether the input string is a substring of the predicted string and at least on one of a typing speed and a typing confidence, and displaying the ranked predicted string.

...read moreread less

Book Chapter•DOI•

Cross-Document pattern matching

[...]

Gregory Kucherov¹, Yakov Nekrich², Tatiana Starikovskaya³•Institutions (3)

University of Paris¹, University of Chile², Moscow State University³

03 Jul 2012

TL;DR: A new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document, is studied.

...read moreread less

Abstract: We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic) As a side result, we propose an improved solution to the weighted level ancestor problem

...read moreread less

Proceedings Article•

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

[...]

Amir H. Moin, Günter Neumann

18 Nov 2012

TL;DR: This paper proposes and implements a recommender prototype which collects the natural language textual information available in the summary and description fields of the previously resolve bug reports and classifies that information in a number of separate inverted lists with respect to the resolver of each issue.

...read moreread less

Abstract: In this paper, we propose a novel approach for assisting human bug triagers in large open source software projects by semi-automating the bug assignment process. Ou r approach employs a simple and efficient n-gram-based algorithm for approximate string matching on the character level. We propose and implement a recommender prototype which collects the natural language textual information available in the summary and description fields of the previously resolve d bug reports and classifies that information in a number of separate inverted lists with respect to the resolver of each issue. These inverted lists are considered as vocabulary-b ased expertise and interest models of the developers. Given a new bug report, the recommender creates all possible n-grams of the strings, evaluates their similarities to the available expertise models concerning a number of well-known string similarity measures, namely Cosine, Dice, Jaccard and Over lap coefficients. Finally, the top three developers are recomme nded as proper candidates for resolving this new issue. Experime ntal results on 5200 bug reports of the Eclipse JDT project show weighted average precision value of90.1% and weighted average recall value of45.5%. Keywords-software deployment and maintenance; semiautomated bug triage; approximate string retrieval; open source software.

...read moreread less

Proceedings Article•DOI•

Baeza-Yates and Navarro approximate string matching for spam filtering

[...]

Monther Aldwairi¹, Yahya Flaifel¹•Institutions (1)

Jordan University of Science and Technology¹

01 Sep 2012

TL;DR: A bit-parallel string matching spam filtering system based on the improved Baeza-Yates and Navarro approximate string matching algorithm that has a low computational cost, is easy to implement, and has the potential to catch misspelled keywords.

...read moreread less

Abstract: Spam has evolved in terms of contents, methods, delivery networks and volume. Reports indicate that up to 90% of the World Wide Web email traffic is spam [1]. The contents are covering a wider range and are deviating from the conventional pharmaceuticals and adult content into more formal marketing campaigns. This illegal advertising is evolving into an underground market for bot masters who rent or sell spam agents. Progressively, spam campaigns engage new methods to ensure efficient mass delivery and dodge conventional spam detectors. They employ very complicated and vast infrastructure of Botnets and Fast Flux Networks to deliver as many emails as possible. The main concerns for spam detection process are detection and misclassification accuracies, and those remain a challenge because of the evolving techniques employed by spammers. In this paper we propose a bit-parallel string matching spam filtering system based on the improved Baeza-Yates and Navarro approximate string matching algorithm. This method has a low computational cost, is easy to implement, and has the potential to catch misspelled keywords. The proposed approach achieves 97.2% overall accuracy with a simple Naive Bayes classifier.

...read moreread less

Book Chapter•DOI•

On the closest string via rank distance

[...]

Liviu P. Dinu¹, Alexandru Popa²•Institutions (2)

University of Bucharest¹, Aalto University²

03 Jul 2012

TL;DR: The CSP and CSSP via rank distance are NP-hard and a polynomial time k-approximation algorithm for the CSP is presented, which is a parametrized algorithm if the alphabet is binary and each string has the same number of 0's and 1's.

...read moreread less

Abstract: Given a set S of k strings of maximum length n, the goal of the closest substring problem (CSSP) is to find the smallest integer d (and a corresponding string t of length l≤n) such that each string s∈S has a substring of length l of "distance" at most d to t. The closest string problem (CSP) is a special case of CSSP where l=n. CSP and CSSP arise in many applications in bioinformatics and are extensively studied in the context of Hamming and edit distance. In this paper we consider a recently introduced distance measure, namely the rank distance. First, we show that the CSP and CSSP via rank distance are NP-hard. Then, we present a polynomial time k-approximation algorithm for the CSP problem. Finally, we give a parametrized algorithm for the CSP (the parameter is the number of input strings) if the alphabet is binary and each string has the same number of 0's and 1's.

...read moreread less

Journal Article•DOI•

A randomized Numerical Aligner (rNA)

[...]

Alberto Policriti, Alexandru I. Tomescu¹, Francesco Vezzi•Institutions (1)

University of Bucharest¹

01 Nov 2012-Journal of Computer and System Sciences

TL;DR: A generalization of the classical Rabin-Karp string matching algorithm to solve the k-mismatch problem, with average complexity O(n+m) (n text and m pattern lengths, respectively) and is in general faster and more accurate than other available tools like SOAP2, BWA, and BOWTIE.

...read moreread less

Multiple Pattern String Matching Methodologies: A Comparative Analysis

[...]

Zeeshan Ahmed Khan, R. K. Pateriya, Maulana Azad

01 Jan 2012

TL;DR: A comparison of Aho-Corasick, Commentz-Walter, Bit- Parallel(Shift-OR), Rabin-Karp, Wu-Manber etc. type of string matching algorithms is presented on different parameters.

...read moreread less

Abstract: String matching algorithms in software applications like virus scanners (anti-virus) or intrusion detection systems is stressed for improving data security over the internet. String-matching techniques are used for sequence analysis, gene finding, evolutionary biology studies and analysis of protein expression. Other fields such as Music Technology, Computational Linguistics, Artificial Intelligence, Artificial Vision, have been using string matching algorithm as their integral part of theoretical and practical tools. There are various problems in string matching appeared as a result of such continuous, exhaustive use, which in turn were promptly solved by the computer scientists. The more practical solutions to the real world problems can be solved by the multiple pattern string matching algorithms. String Matching Algorithms like Aho-Corasick, Commentz-Walter, Bit parallel, Rabin-Karp, Wu-Manber etc. are to be focused in this paper. Aho-Corasick algorithm is based on finite state machines (automata). Commentz Walter algorithm is based on the idea of Knutt-Morris-Pratt and finite state machines. Bit parallel algorithm like shift-or makes use of wide machine words (CPU registers) to parallelize the work. Rabin- Karp uses hashing to find any one of a set of pattern strings in a text. Wu-Manber looking text in blocks instead of one by one character combining idea of Aho-Corasick and Boyer-Moore. Each algorithm has certain advantages and disadvantages. This paper presents the comparative analysis of various multiple pattern string matching algorithms. A comparison of Aho-Corasick, Commentz-Walter, Bit- Parallel(Shift-OR), Rabin-Karp, Wu-Manber etc. type of string matching algorithms is presented on different parameters.

...read moreread less

Journal Article•DOI•

Quick-Skip Search Hybrid Algorithm for the Exact String Matching Problem

[...]

Mustafa Abdul Sahib Naser, Nur'Aini Abdul Rashid, Mohammed Faiz Aboalmaaly

01 Jan 2012-International Journal of Computer Theory and Engineering

TL;DR: This research proposes a hybrid exact string matching algorithm by combining the good properties of the Quick Search and the Skip Search algorithms to demonstrate and devise a better method to solve the string matching problem with higher speed and lower cost.

...read moreread less

Abstract: The string matching problem occupies a corner stone in many computer science fields because of the fundamental role it plays in various computer applications. Thus, several string matching algorithms have been produced and applied in most operating systems, information retrieval, editors, internet searching engines, firewall interception and searching nucleotide or amino acid sequence patterns in genome and protein sequence databases. Several important factors are considered during the matching process such as number of character comparisons, number of attempts and the consumed time. This research proposes a hybrid exact string matching algorithm by combining the good properties of the Quick Search and the Skip Search algorithms to demonstrate and devise a better method to solve the string matching problem with higher speed and lower cost. The hybrid algorithm was tested using different types of standard data. The hybrid algorithm provides efficient results and reliability compared with the original algorithms in terms of number of character comparisons and number of attempts when the hybrid algorithm applied with different pattern lengths. Additionally, the hybrid algorithm produced better quality in performance through providing less time complexity for the worst and best cases comparing with other hybrid algorithms.

...read moreread less

Proceedings Article•DOI•

An Efficient Coarse-to-Fine Indexing Technique for Fast Text Retrieval in Historical Documents

[...]

Partha Pratim Roy¹, Frédéric Rayar¹, Jean-Yves Ramel¹•Institutions (1)

François Rabelais University¹

27 Mar 2012

TL;DR: A fast text retrieval system to index and browse degraded historical documents, designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process.

...read moreread less

Abstract: In this paper, we present a fast text retrieval system to index and browse degraded historical documents. The indexing and retrieval strategy is designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process. During the indexing step, the text parts in the images are encoded into sequences of primitives, obtained from two different codebooks: a coarse one corresponding to connected components and a fine one corresponding to glyph primitives. A glyph consists of a single character or a part of a character according to the shape complexity. During the querying step, the coarse and the fine signature are generated from the query image using both codebooks. Then, a bi-level approximate string matching algorithm is applied to find similar words, using coarse approach first, and then the fine approach if necessary, by exploiting predetermined hypothetical locations. An experimental evaluation on datasets of real life document images, gathered from historical books of different scripts, demonstrated the speed improvement and good accuracy in presence of degradation.

...read moreread less

Book Chapter•DOI•

Approximate function matching under δ- and γ- distances

[...]

Juan Mendivelso¹, In-Bok Lee², Yoan J. Pinzón¹•Institutions (2)

National University of Colombia¹, Korea Aerospace University²

21 Oct 2012

TL;DR: An O(nm) algorithm for finding all the matches of a pattern P1 …m in a text T1 …n and an approximate variant of function matching where two equal-length strings X and Y match if there exists a function that maps X to a string X′ such that X′ and Y are δγ- similar.

...read moreread less

Abstract: This paper defines a new string matching problem by combining two paradigms: function matching and δγ-matching. The result is an approximate variant of function matching where two equal-length strings X and Y match if there exists a function that maps X to a string X′ such that X′ and Y are δγ- similar. We propose an O(nm) algorithm for finding all the matches of a pattern P1 …m in a text T1 …n.

...read moreread less

A GPGPU Implementation of Approximate String Matching with Regular Expression Operators and Comparison with Its FPGA Implementation

[...]

Yuichiro Utan, Masato Inagi, Shin'ichi Wakabayashi, Shinobu Nagayama

01 Jan 2012

TL;DR: An efficient GPGPU implementation of an algorithm for approximate string matching with regular expression operators, originally implemented on an FPGA, is proposed and experimental results showed that the GPU implementation is more than 18 times as fast as the CPU one when the pattern length is greater than 3200, while the FPG a one could not handle such a long pattern.

...read moreread less

Abstract: In this paper, we propose an efficient GPGPU implementation of an algorithm for approximate string matching with regular expression operators, originally implemented on an FPGA, and compare the GPGPU, FPGA and CPU implementations by experiments. Approximate string matching with regular expression operators is used in various applications, such as full text database search and DNA sequence analysis. To efficiently handle a long text in the matching, a hardware algorithm for FPGA implementation has been proposed. However, due to the limitation of FPGAs’ capacity, it cannot handle long patterns. In contrast, our proposed GPGPU implementation is able to handle long patterns efficiently, utilizing the scalability of GPGPU programming. Experimental results showed that the GPU implementation is more than 18 times as fast as the CPU one when the pattern length is greater than 3200, while the FPGA one could not handle such a long pattern.

...read moreread less

Journal Article•DOI•

A linearly computable measure of string complexity

[...]

Verónica Becher¹, Pablo Ariel Heiber¹•Institutions (1)

Facultad de Ciencias Exactas y Naturales¹

01 Jun 2012-Theoretical Computer Science

TL;DR: A measure of string complexity, called I-complexity, computable in linear time and space, which counts the number of different substrings in a given string.

...read moreread less

Patent•

Natural language processing system, method and computer program product useful for automotive data mapping

[...]

Michael D. Swinson, Mikhail Semeniuk, Xingchu Liu

27 Feb 2012

TL;DR: In this article, natural language processing (NLP) approaches were used to map two strings and compute a similarity factor representing a measure of similarity between two strings based on a plurality of parameters, including a Levenshtein edit distance parameter.

...read moreread less

Abstract: Natural language processing (NLP) approaches may be utilized to map two strings. The strings may come from sources utilizing different naming conventions. One example may be a data aggregator that collects used car transaction information. Another example may be a comprehensive database listing all possible manufacturer-defined vehicle options. A NLP system may operate to determine whether a source string is present in a target string and outputting a match containing the source string and the target string if the source string is present in the target string or computing a similarity factor if the source string is not present in the target string. The similarity factor representing a measure of similarity between two strings may be computed based on a plurality of parameters, including a Levenshtein edit distance parameter. The computed similarity can be used to find pricing information, including trade-in, sale, and list prices, across disparate naming conventions.

...read moreread less

Book Chapter•DOI•

Configurations and minority in the string consensus problem

[...]

Amihood Amir¹, Haim Paryenty², Liam Roditty²•Institutions (2)

Johns Hopkins University¹, Bar-Ilan University²

21 Oct 2012

TL;DR: The minority lemma is proved that exploit surprising properties of the closeststring problem and enable constructing the closest string in a sequential fashion and gives an O(l2) time algorithm for computing a closest string of 5 binary strings.

...read moreread less

Abstract: The Closest String Problem is defined as follows. Let S be a set of k strings {s1,…sk}, each of length l, find a string $\hat{s}$, such that the maximum Hamming distance of $\hat{s}$ from each of the strings is minimized. We denote this distance with d. The string $\hat{s}$ is called a consensus string. In this paper we present two main algorithms, the Configuration algorithm with O(k2 l k) running time for this problem, and the Minority algorithm. The problem was introduced by Lanctot, Li, Ma, Wang and Zhang [13]. They showed that the problem is $\cal{NP}$-hard and provided an IP approximation algorithm. Since then the closest string problem has been studied extensively. This research can be roughly divided into three categories: Approximate, exact and practical solutions. This paper falls under the exact solutions category. Despite the great effort to obtain efficient algorithms for this problem an algorithm with the natural running time of O(l k) was not known. In this paper we close this gap. Our result means that algorithms solving the closest string problem in times O(l2), O(l3), O(l4) and O(l5) exist for the cases of k=2,3,4 and 5, respectively. It is known that, in fact, the cases of k=2,3, and 4 can be solved in linear time. No efficient algorithm is currently known for the case of k=5. We prove the minority lemma that exploit surprising properties of the closest string problem and enable constructing the closest string in a sequential fashion. This lemma with some additional ideas give an O(l2) time algorithm for computing a closest string of 5 binary strings.

...read moreread less

Proceedings Article•DOI•

A syntactic PR approach to Telugu handwritten character recognition

[...]

Samit Kumar Pradhan¹, Atul Negi¹•Institutions (1)

University of Hyderabad¹

16 Dec 2012

TL;DR: A character recognition mechanism based on a syntactic PR approach that uses the trie data structure for efficient recognition that considers the approximate matching of the string instead of the exact matching to make the approach robust in the presence of noise.

...read moreread less

Abstract: This paper shows a character recognition mechanism based on a syntactic PR approach that uses the trie data structure for efficient recognition It uses approximate matching of the string for classification During the preprocessing an input character image is transformed into a skeletonized image and discrete curves are found using a 3 x 3 pixel region A trie, which we call as a sequence trie is used for a look up approach at a lower level to encode a discrete curve pattern of pixels The sequence of such discrete curves from the input pattern is looked up in the sequence trie The encoding of several such sequence numbers for the thinned character constructs a pattern string Approximate string matching is used to compare the encoded pattern string from a template character with the pattern string obtained from the input character We consider the approximate matching of the string instead of the exact matching to make the approach robust in the presence of noise Another trie data structure (called pattern trie) is used for the efficient storage and retrieval for approximate matching of the string We make use of the trie since it takes O(m) in worst case where m is the length of the longest string in the trie For the approximate string matching we use look ahead with a branch and bound scheme in the trie Here we apply our method on 43 Telugu characters from the basic Telugu characters for demonstration The proposed approach has recognised all the test characters given here correctly, however more extensive testing on realistic data is required

...read moreread less

Proceedings Article•DOI•

Comparison of Two-Dimensional String Matching Algorithms

[...]

Chengguo Chang, Hui Wang

23 Mar 2012

TL;DR: The KMP algorithm, Rabin-Karp algorithm and their combinatorial are presented and compared, by a number of tests at diverse data scales, to validate the efficiency of these three algorithms.

...read moreread less

Abstract: String matching is a special kind of pattern recognition problem, which finds all occurrences of a given pattern string in a given text string. The technology of two-dimensional string matching is applied broadly in many information processing domains. A good two-dimensional string matching algorithm can effectively enhance the searching speed. In this paper, the KMP algorithm, Rabin-Karp algorithm and their combinatorial are presented and compared, by a number of tests at diverse data scales, to validate the efficiency of these three algorithms.

...read moreread less