scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2012"


Proceedings ArticleDOI
23 Sep 2012
TL;DR: This paper proposes an approach to automatically split identifiers in their composing words, and expand abbreviations in linear time with respect to the size of the dictionary, taking advantage of an approximate string matching technique.
Abstract: Information Retrieval (IR) techniques are being exploited by an increasing number of tools supporting Software Maintenance activities. Indeed the lexical information embedded in the source code can be valuable for tasks such as concept location, clustering or recovery of traceability links. The application of such IR-based techniques relies on the consistency of the lexicon available in the different artifacts, and their effectiveness can worsen if programmers introduce abbreviations (e.g: rect) and/or do not strictly follow naming conventions such as Camel Case (e.g: UTFtoASCII). In this paper we propose an approach to automatically split identifiers in their composing words, and expand abbreviations. The solution is based on a graph model and performs in linear time with respect to the size of the dictionary, taking advantage of an approximate string matching technique. The proposed technique exploits a number of different dictionaries, referring to increasingly broader contexts, in order to achieve a disambiguation strategy based on the knowledge gathered from the most appropriate domain. The approach has been compared to other splitting and expansion techniques, using freely available oracles for the identifiers extracted from 24 C/C++ and Java open source systems. Results show an improvement in both splitting and expanding performance, in addition to a strong enhancement in the computational efficiency.

72 citations


Journal ArticleDOI
01 Feb 2012
TL;DR: A novel possibilistic fuzzy matching strategy with invariant properties is proposed, which can provide a robust and effective matching scheme for two sets of iris feature points and is comparable to those of the typical systems.
Abstract: In this paper, we propose a novel possibilistic fuzzy matching strategy with invariant properties, which can provide a robust and effective matching scheme for two sets of iris feature points. In addition, the nonlinear normalization model is adopted to provide more accurate position before matching. Moreover, an effective iris segmentation method is proposed to refine the detected inner and outer boundaries to smooth curves. For feature extraction, the Gabor filters are adopted to detect the local feature points from the segmented iris image in the Cartesian coordinate system and to generate a rotation-invariant descriptor for each detected point. After that, the proposed matching algorithm is used to compute a similarity score for two sets of feature points from a pair of iris images. The experimental results show that the performance of our system is better than those of the systems based on the local features and is comparable to those of the typical systems.

62 citations


Journal ArticleDOI
01 Jan 2012
TL;DR: This work presents an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.
Abstract: Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of a Parikh vector q in the text s requires finding a substring t of s with p(t)=q. This can be viewed as the task of finding a jumbled (permuted) version of a query pattern, hence the term Jumbled Pattern Matching. We present several algorithms for the approximate version of the problem: Given a string s and two Parikh vectors u,v (the query bounds), find all maximal occurrences in s of some Parikh vector q such that u≤q≤v. This definition encompasses several natural versions of approximate Parikh vector search. We present an algorithm solving this problem in sub-linear expected time using a wavelet tree of s, which can be computed in time O(n) in a preprocessing phase. We then discuss a Scrabble-like variation of the problem, in which a weight function on the letters of s is given and one has to find all occurrences in s of a substring t with maximum weight having Parikh vector p(t)≤v. For the case of a binary alphabet, we present an algorithm which solves the decision version of the Approximate Jumbled Pattern Matching problem in constant time, by indexing the string in subquadratic time.

55 citations


Book ChapterDOI
05 Sep 2012
TL;DR: The techniques reduction pattern matching and generalized Hamming distance problem to a novel linear algebra formulation that allows for generic solutions based on any additively homomorphic encryption are believed to be of independent interest.
Abstract: In this paper we consider the problem of secure pattern matching that allows single character wildcards and substring matching in the malicious (stand-alone) setting. Our protocol, called 5PM, is executed between two parties: Server, holding a text of length n, and Client, holding a pattern of length m to be matched against the text, where our notion of matching is more general and includes non-binary alphabets, non-binary Hamming distance and non-binary substring matching. 5PM is the first protocol with communication complexity sub-linear in circuit size to compute non-binary substring matching in the malicious model (general MPC has communication complexity which is at least linear in the circuit size). 5PM is also the first sublinear protocol to compute non-binary Hamming distance in the malicious model. Additionally, in the honest-but-curious (semi-honest) model, 5PM is asymptotically more efficient than the best known scheme when amortized for applications that require single charcter wildcards or substring pattern matching. 5PM in the malicious model requires O((m+n)k2) bandwidth and O(m+n) encryptions, where m is the pattern length and n is the text length. Further, 5PM can hide pattern size with no asymptotic additional costs in either computation or bandwidth. Finally, 5PM requires only 2 rounds of communication in the honest-but-curious model and 8 rounds in the malicious model. Our techniques reduce pattern matching and generalized Hamming distance problem to a novel linear algebra formulation that allows for generic solutions based on any additively homomorphic encryption. We believe our efficient algebraic techniques are of independent interest.

52 citations


Journal ArticleDOI
TL;DR: A new algorithm is presented achieving time O([email protected]) and space O(m+A), where A is the sum of the lower bounds of the lengths of the gaps in P and @a is the total number of occurrences of the strings in P within T.

47 citations


Journal ArticleDOI
TL;DR: This paper proposes an offline, data-driven approach that mines query logs for instances where content creators and web users apply a variety of strings to refer to the same webpages, and generates an expanded set of equivalent strings (entity synonyms) for each entity.
Abstract: Nowadays, there are many queries issued to search engines targeting at finding values from structured data (e.g., movie showtime of a specific location). In such scenarios, there is often a mismatch between the values of structured data (how content creators describe entities) and the web queries (how different users try to retrieve them). Therefore, recognizing the alternative ways people use to reference an entity, is crucial for structured web search. In this paper, we study the problem of automatic generation of entity synonyms over structured data toward closing the gap between users and structured data. We propose an offline, data-driven approach that mines query logs for instances where content creators and web users apply a variety of strings to refer to the same webpages. This way, given a set of strings that reference entities, we generate an expanded set of equivalent strings (entity synonyms) for each entity. Our framework consists of three modules: candidate generation, candidate selection, and noise cleaning. We further study the cause of the problem through the identification of different entity synonym classes. The proposed method is verified with experiments on real-life data sets showing that we can significantly increase the coverage of structured web queries with good precision.

43 citations


Journal ArticleDOI
TL;DR: It is found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches.
Abstract: Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2 -L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

41 citations


Proceedings ArticleDOI
12 Aug 2012
TL;DR: This paper proposes a general pattern matching strategy that consists of a pre-processing step and a pattern matching step that significantly reduces the number of events that need to be processed by A as well as thenumber of calls to A.
Abstract: In event pattern matching a sequence of input events is matched against a complex query pattern that specifies constraints on extent, order, values, and quantification of matching events. In this paper we propose a general pattern matching strategy that consists of a pre-processing step and a pattern matching step. Instead of eagerly matching incoming events, the pre-processing step buffers events in a match window to apply different pruning techniques (filtering, partitioning, and testing for necessary match conditions). In the second step, an event pattern matching algorithm, A, is called only for match windows that satisfy the necessary match conditions. This two-phase strategy with a lazy call of the matching algorithm significantly reduces the number of events that need to be processed by A as well as the number of calls to A. This is important since pattern matching algorithms tend to be expensive in terms of runtime and memory complexity, whereas the pre-processing can be done very efficiently. We conduct extensive experiments using real-world data with pattern matching algorithms for, respectively, automata and join trees. The experimental results confirm the effectiveness of our strategy for both types of pattern matching algorithms.

41 citations


Journal ArticleDOI
TL;DR: A new approach for the computation of median string based on string embedding is proposed, which applies three different inverse transformations to go from the vector domain back to the string domain in order to obtain a final approximation of the median string.

31 citations


Patent
07 Aug 2012
TL;DR: In this article, a method is proposed to generate at least one string based on the input string, where input string is not a substring of the generated string, responsive to a determination that generated string was previously generated based on input string.
Abstract: A method includes receiving an input string from a virtual keyboard, generating at least one string based on the input string, where the input string is not a substring of the generated string, responsive to a determination that the generated string was previously generated based on the input string, selecting a candidate character associated with the input string and with the generated string, and displaying the generated string at a location on the virtual keyboard that is associated with the selected candidate character.

31 citations


Patent
29 Jun 2012
TL;DR: In this article, the authors describe methods and systems for managing an aggregation database using fuzzy matching rules that describe filters to determine how to match a media content record received from an external source to a stored record in the aggregation database.
Abstract: Methods and systems are described herein for managing an aggregation database. Matching rules that describe filters may be defined to determine how to match a media content record received from an external source to a stored record in the aggregation database. Fuzzy matching may be used to match attribute fields of the received record and stored records. Based on the results of the fuzzy matching, the received primary media content record may be linked to a stored record in the aggregation database.

Patent
28 Aug 2012
TL;DR: In this article, a string analysis tool for calculating a similarity metric between a source string and a plurality of target strings is presented, which is based on a minimum similarity metric threshold.
Abstract: A string analysis tool for calculating a similarity metric between a source string and a plurality of target strings. The string analysis tool may include optimizations that may reduce the number of calculations to be carried out when calculating the similarity metric for large volumes of data. In this regard, the string analysis tool may represent strings as features. As such, analysis may be performed relative to features (e.g., of either the source string or plurality of target strings) such that features from the strings may be eliminated from consideration when identifying target strings for which a similarity metric is to be calculated. The elimination of features may be based on a minimum similarity metric threshold, wherein features that are incapable of contributing to a similarity metric above the minimum similarity metric threshold are eliminated from consideration.

Book ChapterDOI
21 Oct 2012
TL;DR: It is shown there is a linear number of maximal-exponent repeats in an overlap-free string and the algorithm can locate all of them in linear time.
Abstract: The exponent of a string is the quotient of the string's length over the string's smallest period. The exponent and the period of a string can be computed in time proportional to the string's length. We design an algorithm to compute the maximal exponent of factors of an overlap-free string. Our algorithm runs in linear-time on a fixed-size alphabet, while a naive solution of the question would run in cubic time. The solution for non overlap-free strings derives from algorithms to compute all maximal repetitions, also called runs, occurring in the string. We show there is a linear number of maximal-exponent repeats in an overlap-free string. The algorithm can locate all of them in linear time.

Proceedings ArticleDOI
Koji Nakano1
05 Dec 2012
TL;DR: This paper shows efficient implementations of approximate string matching on the memory machine models DMM and UMM for strings X and Y with length m and n, respectively.
Abstract: The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory access and the global memory access of GPUs The approximate string matching for two strings $X$ and $Y$ is a task to find a sub string of $Y$ most similar to $X$ The main contribution of this paper is to show efficient implementations of approximate string matching on the memory machine models Our best implementation for strings $X$ and $Y$ with length $m$ and $n$ ($m\leq n$), respectively, runs in $O({mn\over w}+ml)$ time units using $n$ threads both on the DMM on the UMM with width $w$ and latency $l$

Patent
31 Aug 2012
TL;DR: A method that includes receiving an input string, ranking, by the processor, a predicted string associated with the input string and displaying the ranked predicted string is presented in this paper. But the ranking depends on whether the input text is a substring of the predicted string and at least on one of a typing speed and a typing confidence.
Abstract: A method that includes receiving an input string, ranking, by the processor, a predicted string associated with the input string, wherein the ranking depends on whether the input string is a substring of the predicted string and at least on one of a typing speed and a typing confidence, and displaying the ranked predicted string.

Book ChapterDOI
03 Jul 2012
TL;DR: A new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document, is studied.
Abstract: We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic) As a side result, we propose an improved solution to the weighted level ancestor problem

Proceedings Article
18 Nov 2012
TL;DR: This paper proposes and implements a recommender prototype which collects the natural language textual information available in the summary and description fields of the previously resolve bug reports and classifies that information in a number of separate inverted lists with respect to the resolver of each issue.
Abstract: In this paper, we propose a novel approach for assisting human bug triagers in large open source software projects by semi-automating the bug assignment process. Ou r approach employs a simple and efficient n-gram-based algorithm for approximate string matching on the character level. We propose and implement a recommender prototype which collects the natural language textual information available in the summary and description fields of the previously resolve d bug reports and classifies that information in a number of separate inverted lists with respect to the resolver of each issue. These inverted lists are considered as vocabulary-b ased expertise and interest models of the developers. Given a new bug report, the recommender creates all possible n-grams of the strings, evaluates their similarities to the available expertise models concerning a number of well-known string similarity measures, namely Cosine, Dice, Jaccard and Over lap coefficients. Finally, the top three developers are recomme nded as proper candidates for resolving this new issue. Experime ntal results on 5200 bug reports of the Eclipse JDT project show weighted average precision value of90.1% and weighted average recall value of45.5%. Keywords-software deployment and maintenance; semiautomated bug triage; approximate string retrieval; open source software.

Proceedings ArticleDOI
01 Sep 2012
TL;DR: A bit-parallel string matching spam filtering system based on the improved Baeza-Yates and Navarro approximate string matching algorithm that has a low computational cost, is easy to implement, and has the potential to catch misspelled keywords.
Abstract: Spam has evolved in terms of contents, methods, delivery networks and volume. Reports indicate that up to 90% of the World Wide Web email traffic is spam [1]. The contents are covering a wider range and are deviating from the conventional pharmaceuticals and adult content into more formal marketing campaigns. This illegal advertising is evolving into an underground market for bot masters who rent or sell spam agents. Progressively, spam campaigns engage new methods to ensure efficient mass delivery and dodge conventional spam detectors. They employ very complicated and vast infrastructure of Botnets and Fast Flux Networks to deliver as many emails as possible. The main concerns for spam detection process are detection and misclassification accuracies, and those remain a challenge because of the evolving techniques employed by spammers. In this paper we propose a bit-parallel string matching spam filtering system based on the improved Baeza-Yates and Navarro approximate string matching algorithm. This method has a low computational cost, is easy to implement, and has the potential to catch misspelled keywords. The proposed approach achieves 97.2% overall accuracy with a simple Naive Bayes classifier.

Book ChapterDOI
03 Jul 2012
TL;DR: The CSP and CSSP via rank distance are NP-hard and a polynomial time k-approximation algorithm for the CSP is presented, which is a parametrized algorithm if the alphabet is binary and each string has the same number of 0's and 1's.
Abstract: Given a set S of k strings of maximum length n, the goal of the closest substring problem (CSSP) is to find the smallest integer d (and a corresponding string t of length l≤n) such that each string s∈S has a substring of length l of "distance" at most d to t. The closest string problem (CSP) is a special case of CSSP where l=n. CSP and CSSP arise in many applications in bioinformatics and are extensively studied in the context of Hamming and edit distance. In this paper we consider a recently introduced distance measure, namely the rank distance. First, we show that the CSP and CSSP via rank distance are NP-hard. Then, we present a polynomial time k-approximation algorithm for the CSP problem. Finally, we give a parametrized algorithm for the CSP (the parameter is the number of input strings) if the alphabet is binary and each string has the same number of 0's and 1's.

Journal ArticleDOI
TL;DR: A generalization of the classical Rabin-Karp string matching algorithm to solve the k-mismatch problem, with average complexity O(n+m) (n text and m pattern lengths, respectively) and is in general faster and more accurate than other available tools like SOAP2, BWA, and BOWTIE.

01 Jan 2012
TL;DR: A comparison of Aho-Corasick, Commentz-Walter, Bit- Parallel(Shift-OR), Rabin-Karp, Wu-Manber etc. type of string matching algorithms is presented on different parameters.
Abstract: String matching algorithms in software applications like virus scanners (anti-virus) or intrusion detection systems is stressed for improving data security over the internet. String-matching techniques are used for sequence analysis, gene finding, evolutionary biology studies and analysis of protein expression. Other fields such as Music Technology, Computational Linguistics, Artificial Intelligence, Artificial Vision, have been using string matching algorithm as their integral part of theoretical and practical tools. There are various problems in string matching appeared as a result of such continuous, exhaustive use, which in turn were promptly solved by the computer scientists. The more practical solutions to the real world problems can be solved by the multiple pattern string matching algorithms. String Matching Algorithms like Aho-Corasick, Commentz-Walter, Bit parallel, Rabin-Karp, Wu-Manber etc. are to be focused in this paper. Aho-Corasick algorithm is based on finite state machines (automata). Commentz Walter algorithm is based on the idea of Knutt-Morris-Pratt and finite state machines. Bit parallel algorithm like shift-or makes use of wide machine words (CPU registers) to parallelize the work. Rabin- Karp uses hashing to find any one of a set of pattern strings in a text. Wu-Manber looking text in blocks instead of one by one character combining idea of Aho-Corasick and Boyer-Moore. Each algorithm has certain advantages and disadvantages. This paper presents the comparative analysis of various multiple pattern string matching algorithms. A comparison of Aho-Corasick, Commentz-Walter, Bit- Parallel(Shift-OR), Rabin-Karp, Wu-Manber etc. type of string matching algorithms is presented on different parameters.

Journal ArticleDOI
TL;DR: This research proposes a hybrid exact string matching algorithm by combining the good properties of the Quick Search and the Skip Search algorithms to demonstrate and devise a better method to solve the string matching problem with higher speed and lower cost.
Abstract: The string matching problem occupies a corner stone in many computer science fields because of the fundamental role it plays in various computer applications. Thus, several string matching algorithms have been produced and applied in most operating systems, information retrieval, editors, internet searching engines, firewall interception and searching nucleotide or amino acid sequence patterns in genome and protein sequence databases. Several important factors are considered during the matching process such as number of character comparisons, number of attempts and the consumed time. This research proposes a hybrid exact string matching algorithm by combining the good properties of the Quick Search and the Skip Search algorithms to demonstrate and devise a better method to solve the string matching problem with higher speed and lower cost. The hybrid algorithm was tested using different types of standard data. The hybrid algorithm provides efficient results and reliability compared with the original algorithms in terms of number of character comparisons and number of attempts when the hybrid algorithm applied with different pattern lengths. Additionally, the hybrid algorithm produced better quality in performance through providing less time complexity for the worst and best cases comparing with other hybrid algorithms.

Proceedings ArticleDOI
27 Mar 2012
TL;DR: A fast text retrieval system to index and browse degraded historical documents, designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process.
Abstract: In this paper, we present a fast text retrieval system to index and browse degraded historical documents. The indexing and retrieval strategy is designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process. During the indexing step, the text parts in the images are encoded into sequences of primitives, obtained from two different codebooks: a coarse one corresponding to connected components and a fine one corresponding to glyph primitives. A glyph consists of a single character or a part of a character according to the shape complexity. During the querying step, the coarse and the fine signature are generated from the query image using both codebooks. Then, a bi-level approximate string matching algorithm is applied to find similar words, using coarse approach first, and then the fine approach if necessary, by exploiting predetermined hypothetical locations. An experimental evaluation on datasets of real life document images, gathered from historical books of different scripts, demonstrated the speed improvement and good accuracy in presence of degradation.

Book ChapterDOI
21 Oct 2012
TL;DR: An O(nm) algorithm for finding all the matches of a pattern P1 …m in a text T1 …n and an approximate variant of function matching where two equal-length strings X and Y match if there exists a function that maps X to a string X′ such that X′ and Y are δγ- similar.
Abstract: This paper defines a new string matching problem by combining two paradigms: function matching and δγ-matching. The result is an approximate variant of function matching where two equal-length strings X and Y match if there exists a function that maps X to a string X′ such that X′ and Y are δγ- similar. We propose an O(nm) algorithm for finding all the matches of a pattern P1 …m in a text T1 …n.

01 Jan 2012
TL;DR: An efficient GPGPU implementation of an algorithm for approximate string matching with regular expression operators, originally implemented on an FPGA, is proposed and experimental results showed that the GPU implementation is more than 18 times as fast as the CPU one when the pattern length is greater than 3200, while the FPG a one could not handle such a long pattern.
Abstract: In this paper, we propose an efficient GPGPU implementation of an algorithm for approximate string matching with regular expression operators, originally implemented on an FPGA, and compare the GPGPU, FPGA and CPU implementations by experiments. Approximate string matching with regular expression operators is used in various applications, such as full text database search and DNA sequence analysis. To efficiently handle a long text in the matching, a hardware algorithm for FPGA implementation has been proposed. However, due to the limitation of FPGAs’ capacity, it cannot handle long patterns. In contrast, our proposed GPGPU implementation is able to handle long patterns efficiently, utilizing the scalability of GPGPU programming. Experimental results showed that the GPU implementation is more than 18 times as fast as the CPU one when the pattern length is greater than 3200, while the FPGA one could not handle such a long pattern.

Journal ArticleDOI
TL;DR: A measure of string complexity, called I-complexity, computable in linear time and space, which counts the number of different substrings in a given string.

Patent
27 Feb 2012
TL;DR: In this article, natural language processing (NLP) approaches were used to map two strings and compute a similarity factor representing a measure of similarity between two strings based on a plurality of parameters, including a Levenshtein edit distance parameter.
Abstract: Natural language processing (NLP) approaches may be utilized to map two strings. The strings may come from sources utilizing different naming conventions. One example may be a data aggregator that collects used car transaction information. Another example may be a comprehensive database listing all possible manufacturer-defined vehicle options. A NLP system may operate to determine whether a source string is present in a target string and outputting a match containing the source string and the target string if the source string is present in the target string or computing a similarity factor if the source string is not present in the target string. The similarity factor representing a measure of similarity between two strings may be computed based on a plurality of parameters, including a Levenshtein edit distance parameter. The computed similarity can be used to find pricing information, including trade-in, sale, and list prices, across disparate naming conventions.

Book ChapterDOI
21 Oct 2012
TL;DR: The minority lemma is proved that exploit surprising properties of the closeststring problem and enable constructing the closest string in a sequential fashion and gives an O(l2) time algorithm for computing a closest string of 5 binary strings.
Abstract: The Closest String Problem is defined as follows. Let S be a set of k strings {s1,…sk}, each of length l, find a string $\hat{s}$, such that the maximum Hamming distance of $\hat{s}$ from each of the strings is minimized. We denote this distance with d. The string $\hat{s}$ is called a consensus string. In this paper we present two main algorithms, the Configuration algorithm with O(k2 l k) running time for this problem, and the Minority algorithm. The problem was introduced by Lanctot, Li, Ma, Wang and Zhang [13]. They showed that the problem is $\cal{NP}$-hard and provided an IP approximation algorithm. Since then the closest string problem has been studied extensively. This research can be roughly divided into three categories: Approximate, exact and practical solutions. This paper falls under the exact solutions category. Despite the great effort to obtain efficient algorithms for this problem an algorithm with the natural running time of O(l k) was not known. In this paper we close this gap. Our result means that algorithms solving the closest string problem in times O(l2), O(l3), O(l4) and O(l5) exist for the cases of k=2,3,4 and 5, respectively. It is known that, in fact, the cases of k=2,3, and 4 can be solved in linear time. No efficient algorithm is currently known for the case of k=5. We prove the minority lemma that exploit surprising properties of the closest string problem and enable constructing the closest string in a sequential fashion. This lemma with some additional ideas give an O(l2) time algorithm for computing a closest string of 5 binary strings.

Proceedings ArticleDOI
16 Dec 2012
TL;DR: A character recognition mechanism based on a syntactic PR approach that uses the trie data structure for efficient recognition that considers the approximate matching of the string instead of the exact matching to make the approach robust in the presence of noise.
Abstract: This paper shows a character recognition mechanism based on a syntactic PR approach that uses the trie data structure for efficient recognition It uses approximate matching of the string for classification During the preprocessing an input character image is transformed into a skeletonized image and discrete curves are found using a 3 x 3 pixel region A trie, which we call as a sequence trie is used for a look up approach at a lower level to encode a discrete curve pattern of pixels The sequence of such discrete curves from the input pattern is looked up in the sequence trie The encoding of several such sequence numbers for the thinned character constructs a pattern string Approximate string matching is used to compare the encoded pattern string from a template character with the pattern string obtained from the input character We consider the approximate matching of the string instead of the exact matching to make the approach robust in the presence of noise Another trie data structure (called pattern trie) is used for the efficient storage and retrieval for approximate matching of the string We make use of the trie since it takes O(m) in worst case where m is the length of the longest string in the trie For the approximate string matching we use look ahead with a branch and bound scheme in the trie Here we apply our method on 43 Telugu characters from the basic Telugu characters for demonstration The proposed approach has recognised all the test characters given here correctly, however more extensive testing on realistic data is required

Proceedings ArticleDOI
23 Mar 2012
TL;DR: The KMP algorithm, Rabin-Karp algorithm and their combinatorial are presented and compared, by a number of tests at diverse data scales, to validate the efficiency of these three algorithms.
Abstract: String matching is a special kind of pattern recognition problem, which finds all occurrences of a given pattern string in a given text string. The technology of two-dimensional string matching is applied broadly in many information processing domains. A good two-dimensional string matching algorithm can effectively enhance the searching speed. In this paper, the KMP algorithm, Rabin-Karp algorithm and their combinatorial are presented and compared, by a number of tests at diverse data scales, to validate the efficiency of these three algorithms.