Showing papers on "Approximate string matching published in 2009"

PDF

Open Access

Book Chapter•DOI•

[...]

Arnaud Giacometti¹, Patrick Marcel¹, Elsa Negre¹•Institutions (1)

30 Aug 2009

TL;DR: This paper proposes to apply a Collaborative Work approach that leverages former explorations of the cube to recommend OLAP queries, and adapts Approximate String Matching, a technique popular in Information Retrieval, to match the current analysis with theFormer explorations and help suggesting a query to the user.

...read moreread less

Abstract: Interactive analysis of datacube, in which a user navigates a cube by launching a sequence of queries is often tedious since the user may have no idea of what the forthcoming query should be in his current analysis. To better support this process we propose in this paper to apply a Collaborative Work approach that leverages former explorations of the cube to recommend OLAP queries. The system that we have developed adapts Approximate String Matching, a technique popular in Information Retrieval, to match the current analysis with the former explorations and help suggesting a query to the user. Our approach has been implemented with the open source Mondrian OLAP server to recommend MDX queries and we have carried out some preliminary experiments that show its efficiency for generating effective query recommendations.

...read moreread less

74 citations

Book•

Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance

[...]

Frederic P. Miller, Agnes F. Vandome, John McBrewster

24 Nov 2009

TL;DR: The Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance), often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.

...read moreread less

Abstract: In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau?Levenshtein distance) allows the transposition of two characters as an operation. Some Translation Environment Tools, such as translation memory leveraging applications, use the Levenhstein algorithm to measure the edit distance between two fuzzy matching content segments.The metric is named after Vladimir Levenshtein, who considered this distance in 1965. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers

...read moreread less

66 citations

Journal Article•DOI•

More Efficient Algorithms for Closest String and Substring Problems

[...]

Bin Ma¹, Xiaoming Sun²•Institutions (2)

University of Waterloo¹, Tsinghua University²

01 Sep 2009-SIAM Journal on Computing

TL;DR: It is proved that a restricted version of the closest string problem has the same parameterized complexity as the closest substring, answering an open question in the literature.

...read moreread less

Abstract: The closest string problem and the closest substring problem are all natural theoretical computer science problems and find important applications in computational biology. Given $n$ input strings, the closest string (substring) problem finds a new string within distance $d$ to (a substring of) each input string and such that $d$ is minimized. Both problems are NP-complete. In this paper we propose new algorithms for these two problems. For the closest string problem, we developed an exact algorithm with time complexity $O(n|\Sigma|^{O(d)})$, where $\Sigma$ is the alphabet. This improves the previously best known result $O(nd^{O(d)})$ and results into a polynomial time algorithm when $d=O(\log n)$. By using this algorithm, a polynomial time approximation scheme (PTAS) for the closest string problem is also given with time complexity $O(n^{O(\epsilon^{-2})})$, improving the previously best known $O(n^{O(\epsilon^{-2}\log\frac{1}{\epsilon})})$ PTAS. A new algorithm for the closest substring problem is also proposed. Finally, we prove that a restricted version of the closest substring problem has the same parameterized complexity as the closest substring, answering an open question in the literature.

...read moreread less

57 citations

Journal Article•DOI•

Approximate String Matching with Compressed Indexes

[...]

Luís M. S. Russo¹, Gonzalo Navarro², Arlindo L. Oliveira², Pedro Morales³•Institutions (3)

INESC-ID¹, University of Chile², Universidade Nova de Lisboa³

10 Sep 2009-Algorithms

TL;DR: This paper focuses on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics, and study ASM algorithms for Lempel-Ziv compressed indexes and for compressed sufﬁx trees/arrays, which are competitive and provide useful space-time tradeoffs compared to classical indexes.

...read moreread less

Abstract: A compressed full-text self-index for a text T is a data structure requiring reducedspace and able to search for patterns P in T . It can also reproduce any substring of T , thusactually replacing T . Despite the recent explosion of interest on compressed indexes, therehas not been much progress on functionalities beyond the basic exact search. In this paperwe focus on indexed approximate string matching (ASM), which is of great interest, say,in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and forcompressed sufﬁx trees/arrays. Most compressed self-indexes belong to one of these classes.We start by adapting the classical method of partitioning into exact search to self-indexes, andoptimize it over a representative of either class of self-index. Then, we show that a Lempel-Ziv index can be seen as an extension of the classical q -samples index. We give new insightson this type of index, which can be of independent interest, and then apply them to a Lempel-Ziv index. Finally, we improve hierarchical veriﬁcation, a successful technique for sequentialsearching, so as to extend the matches of pattern pieces to the left or right. Most compressedsufﬁx trees/arrays support the required bidirectionality, thus enabling the implementation ofthe improved technique. In turn, the improved veriﬁcation largely reduces the accesses to thetext, which are expensive in self-indexes. We show experimentally that our algorithms arecompetitive and provide useful space-time tradeoffs compared to classical indexes.

...read moreread less

53 citations

Proceedings Article•DOI•

[...]

Peter Christen¹, Ross W. Gayler², David Hawking•Institutions (2)

Australian National University¹, Veda²

02 Nov 2009

TL;DR: Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach.

...read moreread less

Abstract: Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static databases. However, many organisations are increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include online law enforcement and national security databases, public health surveillance and emergency response systems, financial verification systems, online retail stores, eGovernment services, and digital libraries. A novel inverted index based approach for real-time entity resolution is presented in this paper. At build time, similarities between attribute values are computed and stored to support the fast matching of records at query time. The presented approach differs from other approaches to approximate query matching in that it allows any similarity comparison function, and any 'blocking' (encoding) function, both possibly domain specific, to be incorporated. Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach. The interested reader is referred to the longer version of this paper [5].

...read moreread less

48 citations

Proceedings Article•DOI•

Incremental maintenance of length normalized indexes for approximate string matching

[...]

Marios Hadjieleftheriou¹, Nick Koudas², Divesh Srivastava¹•Institutions (2)

AT&T Labs¹, University of Toronto²

29 Jun 2009

TL;DR: This paper presents a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers.

...read moreread less

Abstract: Approximate string matching is a problem that has received a lot of attention recently. Existing work on information retrieval has concentrated on a variety of similarity measures TF/IDF, BM25, HMM, etc.) specifically tailored for document retrieval purposes. As new applications that depend on retrieving short strings are becoming popular(e.g., local search engines like YellowPages.com, Yahoo!Local, and Google Maps) new indexing methods are needed, tailored for short strings. For that purpose, a number of indexing techniques and related algorithms have been proposed based on length normalized similarity measures. A common denominator of indexes for length normalized measures is that maintaining the underlying structures in the presence of incremental updates is inefficient, mainly due to data dependent, precomputed weights associated with each distinct token and string. Incorporating updates usually is accomplished by rebuilding the indexes at regular time intervals. In this paper we present a framework that advocates lazy update propagation with the following key feature: Efficient, incremental updates that immediately reflect the new data in the indexes in a way that gives strict guarantees on the quality of subsequent query answers. More specifically, our techniques guarantee against false negatives and limit the number of false positives produced. We implement a fully working prototype and illustrate that the proposed ideas work really well in practice for real datasets.

...read moreread less

45 citations

Proceedings Article•DOI•

A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression

[...]

Danny Hermelin¹, Gad M. Landau¹, Shir Landau, Oren Weimann²•Institutions (2)

University of Haifa¹, Massachusetts Institute of Technology²

26 Feb 2009

TL;DR: The classical four-russians technique can be incorporated into the SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings.

...read moreread less

Abstract: We present a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs For two strings of total length $N$ having straight-line program representations of total size $n$, we provide an algorithm running in $O(n^{14}N^{12})$ time for computing the edit-distance of these two strings under any rational scoring function, and an $O(n^{134}N^{134})$ time algorithm for arbitrary scoring functions This improves on a recent algorithm of Tiskin that runs in $O(nN^{15})$ time, and works only for rational scoring functions Also, in the last part of the paper, we show how the classical four-russians technique can be incorporated into our SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings

...read moreread less

44 citations

Journal Article•DOI•

Average-optimal string matching

[...]

Kimmo Fredriksson¹, Szymon Grabowski•Institutions (1)

University of Eastern Finland¹

01 Dec 2009-Journal of Discrete Algorithms

TL;DR: This work develops a novel and unorthodox filtering technique based on transforming the problem into multiple matching of carefully chosen pattern subsequences that leads to very simple algorithms that are optimal on average.

...read moreread less

40 citations

Proceedings Article•DOI•

From coding theory to efficient pattern matching

[...]

Raphaël Clifford¹, Klim Efremenko², Ely Porat², Amir Rothschild²•Institutions (2)

University of Bristol¹, Bar-Ilan University²

04 Jan 2009

TL;DR: This work considers the classic problem of pattern matching with few mismatches in the presence of promiscuously matching wildcard symbols and develops a new framework in which to tackle approximate pattern matching problems.

...read moreread less

Abstract: We consider the classic problem of pattern matching with few mismatches in the presence of promiscuously matching wildcard symbols. Given a text t of length n and a pattern p of length m with optional wildcard symbols and a bound k, our algorithm finds all the alignments for which the pattern matches the text with Hamming distance at most k and also returns the location and identity of each mismatch. The algorithm we present is deterministic and runs in O(kn) time, matching the best known randomised time complexity to within logarithmic factors. The solutions we develop borrow from the tool set of algebraic coding theory and provide a new framework in which to tackle approximate pattern matching problems.

...read moreread less

35 citations

Patent•

Incremental maintenance of inverted indexes for approximate string matching

[...]

Marios Hadjieleftheriou¹, Nick Koudas¹, Divesh Srivastava¹•Institutions (1)

AT&T¹

10 Jun 2009

TL;DR: In this paper, the inverted indexes are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves.

...read moreread less

Abstract: In embodiments of the disclosed technology, indexes, such as inverted indexes, are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves. With the present technology, a batch of daily updates can be processed in a matter of minutes, rather than a few hours for rebuilding an index, and a query may be answered with assurances that the results are accurate or within a threshold of accuracy.

...read moreread less

34 citations

Book Chapter•DOI•

Finding All Approximate Gapped Palindromes

[...]

Ping-Hui Hsu¹, Kuan-Yu Chen¹, Kun-Mao Chao¹•Institutions (1)

National Taiwan University¹

05 Dec 2009

TL;DR: It turns out that the core technique actually solves a more general incremental string comparison problem that allows the insertion, deletion, and substitution of multiple symbols.

...read moreread less

Abstract: We study the problem of finding all maximal approximate gapped palindromes in a string. More specifically, given a string S of length n, a parameter q ? 0 and a threshold k > 0, the problem is to identify all substrings in S of the form uvw such that (1) the Levenshtein distance between u and w r is at most k, where w r is the reverse of w and (2) v is a string of length q. The best previous work requires O(k 2 n) time. In this paper, we propose an O(kn)-time algorithm for this problem by utilizing an incremental string comparison technique. It turns out that the core technique actually solves a more general incremental string comparison problem that allows the insertion, deletion, and substitution of multiple symbols.

...read moreread less

Journal Article•DOI•

Pattern matching with address errors: Rearrangement distances

[...]

Amihood Amir¹, Yonatan Aumann², Gary Benson³, Avivit Levy², Ohad Lipsky², Ely Porat², Steven Skiena⁴, Uzi Vishne² - Show less +4 more•Institutions (4)

Johns Hopkins University¹, Bar-Ilan University², Boston University³, Stony Brook University⁴

01 Sep 2009-Journal of Computer and System Sciences

TL;DR: This paper considers a class of pattern matching problems where the content is assumed to be correct, while the locations may have shifted/changed, and formally defines a broad class of problems, capturing situations in which the pattern is obtained from the text by a sequence of rearrangements.

...read moreread less

Patent•

Fuzzy search using progressive relaxation of search terms

[...]

Ram Dayal Goyal

25 Mar 2009

TL;DR: In this article, a computer implemented method and system that progressively relaxes search terms provided by a user is described, where the accepted search terms are modified uniquely based on the predefined types to structure first alternative queries and the second alternative queries are compared with the stored data to find approximate matches.

...read moreread less

Abstract: Disclosed herein is a computer implemented method and system that progressively relaxes search terms provided by a user. Data of predefined types is stored in a database. The data is obtained by uniquely modifying data previously stored in the database, based on the predefined types. Search terms of predefined types are accepted from the user. The search terms are compared with the stored data to find exact matches, if length of the search terms exceeds a predefined value. On not finding exact matches, the accepted search terms are modified uniquely based on the predefined types to structure first alternative queries. The first alternative queries are compared with the stored data to find exact matches. On not finding exact matches, the first alternative queries are modified based on the predefined types to structure second alternative queries. The second alternative queries are compared with the stored data to find approximate matches.

...read moreread less

Proceedings Article•DOI•

Efficient top-k algorithms for fuzzy search in string collections

[...]

Rares Vernica¹, Chen Li¹•Institutions (1)

University of California, Irvine¹

28 Jun 2009

TL;DR: This paper develops a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques, and develops efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists.

...read moreread less

Abstract: An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.

...read moreread less

Journal Article•DOI•

Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts

[...]

Philip Bille¹, Rolf Fagerberg², Inge Li Gørtz³•Institutions (3)

University of Copenhagen¹, University of Southern Denmark², Technical University of Denmark³

28 Dec 2009-ACM Transactions on Algorithms

TL;DR: This work significantly improves the space bounds of the Ziv-Lempel adaptive dictionary compression schemes, improving the previously known complexities for both approximate string matching and regular expression matching problems.

...read moreread less

Abstract: We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck.

...read moreread less

Journal Article•

Fast approximate string matching with finite automata

[...]

Mans Hulden

01 Sep 2009-Procesamiento Del Lenguaje Natural

TL;DR: A fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity, which can be adapted to use a variety of metrics for determining the distance between two words.

...read moreread less

Abstract: We present a fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity. The algorithm can be adapted to use a variety of metrics for determining the distance between two words.

...read moreread less

Book Chapter•DOI•

Faster and Space-Optimal Edit Distance 1 Dictionary

[...]

Djamal Belazzougui¹•Institutions (1)

École Normale Supérieure¹

18 Jun 2009

TL;DR: This paper proposes the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most " 1" from query string) in time linear in the length of query string.

...read moreread less

Abstract: In the approximate dictionary search problem we have to construct a data structure on a set of strings so that we can answer to queries of the kind: find all strings of the set that are similar (according to some string distance) to a given string. In this paper we propose the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most "1" from query string) in time linear in the length of query string. Based on our new dictionary we propose a full-text index for approximate queries with edit distance "1" (report all positions of all sub-strings of the text that are at edit distance at most "1" from query string) answering to a query in time linear in the length of query string using space $O(n(\lg(n)\lg\lg(n))^2)$ in the worst case on a text of length n . Our index is the first index that answers queries in time linear in the length of query string while using space O (n ·poly (log (n ))) in the worst case and for any alphabet size.

...read moreread less

Journal Article•DOI•

A memory-efficient parallel string matching for intrusion detection systems

[...]

HyunJin Kim¹, Hyejeong Hong¹, Hong-Sik Kim¹, Sungho Kang¹•Institutions (1)

Yonsei University¹

15 Dec 2009-IEEE Communications Letters

TL;DR: An Aho-Corasick algorithm based parallel string matching that outperforms the existing bit-split string matching in the evaluations of Snort rules is proposed.

...read moreread less

Abstract: As the variety of hazardous packet payload contents increases, the intrusion detection system (IDS) should be able to detect numerous patterns in real time. For this reason, this paper proposes an Aho-Corasick algorithm based parallel string matching. In order to balance memory usage between homogeneous finite-state machine (FSM) tiles for each string matcher, an optimal set of bit position groups is determined. Target patterns are sorted by binary-reflected gray code (BRGC), which reduces bit transitions in patterns mapped onto a string matcher. In the evaluations of Snort rules, the proposed string matching outperforms the existing bit-split string matching.

...read moreread less

Patent•

Error tolerant autocompletion

[...]

Surajit Chaudhuri¹, Shriraghav Kaushik¹•Institutions (1)

Microsoft¹

23 Jun 2009

TL;DR: In this paper, the authors describe techniques for error-tolerant auto-completion, where characters of an input string are displayed as they are inputted by a user, and when a character is added to the input string by the user, matching strings may be selected from among a set of candidate strings by determining which of the candidate strings have a prefix whose characters match the characters of the input text within a given edit distance of input text.

...read moreread less

Abstract: Techniques for error-tolerant autocompletion are described. While displaying characters of an input string as they are inputted by a user, when a character is added to the input string by the user, matching strings may be selected from among a set of candidate strings by determining which of the candidate strings have a prefix whose characters match the characters of the input string within a given edit distance of the input string.

...read moreread less

Book Chapter•DOI•

Efficient fuzzy matching and intersection on private datasets

[...]

Qingsong Ye¹, Ron Steinfeld¹, Josef Pieprzyk¹•Institutions (1)

Macquarie University¹

02 Dec 2009

TL;DR: A protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values is presented.

...read moreread less

Abstract: At Eurocrypt'04, Freedman, Nissim and Pinkas introduced a fuzzy private matching problem. The problem is defined as follows. Given two parties, each of them having a set of vectors where each vector has T integer components, the fuzzy private matching is to securely test if each vector of one set matches any vector of another set for at least t components where t < T. In the conclusion of their paper, they asked whether it was possible to design a fuzzy private matching protocol without incurring a communication complexity with the factor (Tt). We answer their question in the affirmative by presenting a protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which we show how to implement with efficient decoding using interleaved Reed-Solomon codes. This scheme may be of independent interest. Our protocol is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values.

...read moreread less

Proceedings Article•DOI•

Efficient algorithms for approximate member extraction using signature-based inverted lists

[...]

Jiaheng Lu¹, Jialong Han¹, Xiaofeng Meng¹•Institutions (1)

Renmin University of China¹

02 Nov 2009

TL;DR: An incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text and significantly outperform existing methods in the literature.

...read moreread less

Abstract: We study the problem of approximate membership extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. We solve this problem in two steps. In the first step, for each substring in the text, we filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. We develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text. Our experimental study of the proposed algorithms on real and synthetic datasets showed that our solutions significantly outperform existing methods in the literature.

...read moreread less

Journal Article•DOI•

Faster pattern matching with character classes using prime number encoding

[...]

Chaim Linhart¹, Ron Shamir¹•Institutions (1)

Tel Aviv University¹

01 May 2009-Journal of Computer and System Sciences

TL;DR: An FFT-based algorithm is presented that uses a novel prime-numbers encoding scheme, which is logn/logm times faster than the fastest extant approaches, which are based on boolean convolutions, and speeds up solutions to approximate matching with character classes problems.

...read moreread less

Patent•

Technique for comparing a string to large sets of strings

[...]

William T. Laaser¹•Institutions (1)

Intuit¹

29 Jan 2009

TL;DR: In this paper, a comparison technique for efficiently comparing an input string to a set of strings is described, where the input string is represented in a tree structure as paths from a root to leaves of the tree structure, and strings in the set of string that share common substrings share nodes in the tree.

...read moreread less

Abstract: A comparison technique for efficiently comparing an input string to a set of strings is described. This set of strings may be represented in a tree structure as paths from a root of the tree structure to leaves of the tree structure, and strings in the set of strings that share common substrings share nodes in the tree structure. During the comparison technique, labels may be assigned to a given node in the tree structure based at least in part on comparisons between a given character in the input string and a character associated with the given node. These labels may include a position of the given character in the input string, and a cumulative error between the characters in a string that are associated with a branch in the tree structure and the characters in the input string that have been processed. Based at least in part on these labels, an actual string, which corresponds to the input string, may be identified.

...read moreread less

Proceedings Article•DOI•

Clustering malware-generated spam emails with a novel fuzzy string matching algorithm

[...]

Chun Wei¹, Alan P. Sprague¹, Gary Warner¹•Institutions (1)

University of Alabama at Birmingham¹

08 Mar 2009

TL;DR: A fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware to detect similar patterns even when the spammer creates a variation of the original pattern.

...read moreread less

Abstract: In this paper, a fuzzy-matching clustering algorithm is introduced to group subjects found in spam emails which are generated by malware. A modified scoring strategy is applied in dynamic programming to find subjects that are similar to each other. A recursive seed selection strategy allows the algorithm to detect similar patterns even when the spammer creates a variation of the original pattern. A sliding threshold based on string length helps to minimize false-positives.The algorithm proves to be effective in detecting and grouping spam emails using templates. It also helps spam investigators to collect and sort large amount of malware-generated spam more efficiently without looking at the email content.

...read moreread less

Proceedings Article•DOI•

Fast error-tolerant search on very large texts

[...]

Marjan Celikik¹, Holger Bast¹•Institutions (1)

Max Planck Society¹

08 Mar 2009

TL;DR: This work combines various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space.

...read moreread less

Abstract: We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.

...read moreread less

Journal Article•DOI•

Light-based string matching

[...]

Mihai Oltean¹•Institutions (1)

Babeș-Bolyai University¹

01 Mar 2009-Natural Computing

TL;DR: A special device which can do string matching by performing n−m + 1 text-to-pattern comparisons is described, which uses light and optical filters for performing computations.

...read moreread less

Abstract: String matching is a very important problem in computer science. The problem consists in finding all the occurrences of a pattern P of length m in a text T of length n. We describe a special device which can do string matching by performing n−m + 1 text-to-pattern comparisons. The proposed device uses light and optical filters for performing computations. Two physical implementations are proposed. One of them uses colored glass and the other one uses polarizing filters. The strengths and the weaknesses of each method are deeply discussed.

...read moreread less

Proceedings Article•DOI•

A hardware-efficent multi-character string matching architecture using brute-force algorithm

[...]

Seongyong Ahn¹, Hyejong Hong¹, HyunJin Kim¹, Jin-Ho Ahn², Dongmyong Baek³, Sungho Kang¹ - Show less +2 more•Institutions (3)

Yonsei University¹, Hoseo University², Electronics and Telecommunications Research Institute³

01 Nov 2009

TL;DR: A hardware-efficient string matching architecture using the brute-force algorithm is proposed and a process element that organizes the proposed architecture is optimized by reducing the number of the comparators.

...read moreread less

Abstract: Due to the growth of network environment complexity, the necessity of packet payload inspection at application layer is increased. String matching, which is critical to network intrusions detection systems, inspects packet payloads and detects malicious network attacks using a set of rules. Because string matching is a computationally intensive task, hardware based string matching is required. In this paper, we propose a hardware-efficient string matching architecture using the brute-force algorithm. A process element that organizes the proposed architecture is optimized by reducing the number of the comparators. The performance of the proposed architecture is nearly equal to a previous work. The experimental results show that the proposed architecture with any process width reduces the comparator requirements in comparison with the previous work.

...read moreread less

Book Chapter•DOI•

Nested Counters in Bit-Parallel String Matching

[...]

Kimmo Fredriksson¹, Szymon Grabowski²•Institutions (2)

University of Eastern Finland¹, University of Łódź²

31 Mar 2009

TL;DR: This work presents several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities.

...read moreread less

Abstract: Many algorithms, e.g. in the field of string matching, are based on handling many counters, which can be performed in parallel, even on a sequential machine, using bit-parallelism. The recently presented technique of nested counters (Matryoshka counters ) [1] is to handle small counters most of the time, and refer to larger counters periodically, when the small counters may get full, to prevent overflow. In this work, we present several non-trivial applications of Matryoshka counters in string matching algorithms, improving their worst- or average-case time complexities. The set of problems comprises (Δ ,α )-matching, matching with k insertions, episode matching, and matching under Levenshtein distance.

...read moreread less

Journal Article•DOI•

Automatic discovery of topics and acoustic morphemes from speech

[...]

Christophe Cerisara

01 Apr 2009-Computer Speech & Language

TL;DR: The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach.

...read moreread less

Proceedings Article•DOI•

A data-based coding of candidate strings in the closest string problem

[...]

Bryant A. Julstrom¹•Institutions (1)

St. Cloud State University¹

08 Jul 2009

TL;DR: In trials on twenty-five instances of the closest string problem with alphabets ranging is size from 2 to 30, the algorithm that used the data-based representation of candidate strings consistently returned the best results, and its advantage increased with the sizes of the test instances' alphABets.

...read moreread less

Abstract: Given a set of strings S of equal lengths over an alphabet σ, the closest string problem seeks a string over σ whose maximum Hamming distance to any of the given strings is as small as possible. A data-based coding of strings for evolutionary search represents candidate closest strings as sequences of indexes of the given strings. The string such a chromosome represents consists of the symbols in the corresponding positions of the indexed strings.A genetic algorithm using this coding was compared with two GAs that encoded candidate strings directly as strings over σ. In trials on twenty-five instances of the closest string problem with alphabets ranging is size from 2 to 30, the algorithm that used the data-based representation of candidate strings consistently returned the best results, and its advantage increased with the sizes of the test instances' alphabets.

...read moreread less