Showing papers on "Approximate string matching published in 2001"

PDF

Open Access

Journal Article•DOI•

A guided tour to approximate string matching

[...]

01 Mar 2001-ACM Computing Surveys

TL;DR: This work surveys the current techniques to cope with the problem of string matching that allows errors, and focuses on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms.

...read moreread less

Abstract: We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices. We conclude with some directions for future work and open problems.

...read moreread less

2,723 citations

Proceedings Article•

Approximate String Joins in a Database (Almost) for Free

[...]

Luis Gravano¹, Panagiotis G. Ipeirotis¹, H. V. Jagadish, Nick Koudas², S. Muthukrishnan², Divesh Srivastava² - Show less +2 more•Institutions (2)

Columbia University¹, AT&T²

11 Sep 2001

TL;DR: In this article, the authors propose a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. But this technique relies on matching short substrings of length, called -grams, and taking into account both positions of individual matches and the total number of such matches.

...read moreread less

Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string joins directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on matching short substrings of length , called -grams, and taking into account both positions of individual matches and the total number of such matches. Our approach applies to both approximate full string matching and approximate substring matching, with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers. We demonstrate experimentally the benefits of our technique over the direct use of UDFs, using commercial database systems and real data. To study the I/O and CPU behavior of approximate string join algorithms with variations in edit distance and -gram length, we also describe detailed experiments based on a prototype implementation.

...read moreread less

556 citations

Patent•

Method for execution of query to search strings of characters that match pattern with a target string utilizing bit vector

[...]

Barry Lynn Fritchman¹•Institutions (1)

Unisys¹

02 May 2001

TL;DR: In this paper, a method for matching a pattern string with a target string, where either string can contain single or multi-character wild cards, is described, which includes the steps of preprocessing the pattern string into a prefix, a suffix, and zero or more interior segments.

...read moreread less

Abstract: The method of the present invention is useful in a computer system including at least one client. The program executes a method for matching a pattern string with a target string, where either string can contain single or multi-character wild cards. The method includes the steps of preprocessing the pattern string into a prefix segment, a suffix segment, and zero or more interior segments. Next, matching the prefix segment, the suffix segment, and the interior segment(s) with the target string.

...read moreread less

217 citations

Journal Article•

Indexing methods for approximate string matching

[...]

Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen, Jorma Tarhio

01 Jan 2001-IEEE Data(base) Engineering Bulletin

204 citations

Better Filtering with Gapped q-Grams

[...]

Stefan Burkhardt¹, Juha Kärkkäinen¹•Institutions (1)

Max Planck Society¹

01 Jan 2001

TL;DR: It is shown that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q- grams and a filter parameter called threshold have to be optimized.

...read moreread less

Abstract: A popular and well-studied class of filters for approximate string matching compares substrings of length q, the q-grams, in the pattern and the text to identify text areas that contain potential matches. A generalization of the method that uses gapped q-grams instead of contiguous substrings is mentioned a few times in literature but has never been analyzed in any depth. In this paper, we report the first results of a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. To achieve these results the arrangement of the gaps in the q-gram and a filter parameter called threshold have to be optimized. Both of these tasks are nontrivial combinatorial optimization problems for which we present efficient solutions. We concentrate on the k mismatches problem, i.e, approximate string matching with the Hamming distance.

...read moreread less

159 citations

Journal Article•DOI•

NR-grep: a fast and flexible pattern-matching tool

[...]

Gonzalo Navarro¹•Institutions (1)

University of Chile¹

30 Oct 2001-Software - Practice and Experience

TL;DR: Nrgrep is a new pattern‐matching tool designed for efficient search of complex patterns based on a single and uniform concept: the bit‐parallel simulation of a non‐deterministic suffix automaton that can find from simple patterns to regular expressions, exactly or allowing errors in the matches.

...read moreread less

Abstract: We present nrgrep (‘non-deterministic reverse grep’), a new pattern-matching tool designed for efficient search of complex patterns. Unlike previous tools of the grep family, such as agrep and Gnu grep, nrgrep is based on a single and uniform concept: the bit-parallel simulation of a non-deterministic suffix automaton. As a result, nrgrep can find from simple patterns to regular expressions, exactly or allowing errors in the matches, with an efficiency that degrades smoothly as the complexity of the searched pattern increases. Another concept that is fully integrated into nrgrep and that contributes to this smoothness is the selection of adequate subpatterns for fast scanning, which is also absent in many current tools. We show that the efficiency of nrgrep is similar to that of the fastest existing string-matching tools for the simplest patterns, and is by far unmatched for more complex patterns. Copyright © 2001 John Wiley & Sons, Ltd.

...read moreread less

134 citations

Journal Article•

Using q-grams in a DBMS for Approximate String Processing.

[...]

Luis Gravano¹, Panagiotis G. Ipeirotis¹, H. V. Jagadish², Nick Koudas³, S. Muthukrishnan³, Lauri Pietarinen, Divesh Srivastava³ - Show less +3 more•Institutions (3)

Columbia University¹, University of Michigan², AT&T³

01 Jan 2001-IEEE Data(base) Engineering Bulletin

TL;DR: This paper develops a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them by relying on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS.

...read moreread less

Abstract: String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a challenge to implement this functionality efficiently with user-defined functions (UDFs). In this paper, we develop a technique for building approximate string processing capabilities on top of commercial databases by exploiting facilities already available in them. At the core, our technique relies on generating short substrings of length q, called q-grams, and processing them using standard methods available in the DBMS. The proposed technique enables various approximate string processing methods in a DBMS, for example approximate (sub)string selections and joins, and can even be used with a variety of possible edit distance functions. The approximate string match predicate, with a suitable edit distance threshold, can be mapped into a vanilla relational expression and optimized by conventional relational optimizers.

...read moreread less

117 citations

Book Chapter•DOI•

Better filtering with gapped q-grams

[...]

Stefan Burkhardt¹, Juha Kärkkäinen¹•Institutions (1)

Max Planck Society¹

01 Jul 2001

TL;DR: In this paper, the authors show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering for approximate string matching than contiguous substrings, and they also show that the arrangement of the gaps in the qgram and a filter parameter called threshold have to be optimized.

...read moreread less

108 citations

Journal Article•DOI•

OCRSpell: an interactive spelling correction system for OCR errors in text

[...]

Kazem Taghva, Eric Stofsky

01 Mar 2001-International Journal on Document Analysis and Recognition

TL;DR: A spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources is described, based on static and dynamic device mappings, approximate string matching, and n-gram analysis.

...read moreread less

Abstract: In this paper, we describe a spelling correction system designed specifically for OCR-generated text that selects candidate words through the use of information gathered from multiple knowledge sources. This system for text correction is based on static and dynamic device mappings, approximate string matching, and n-gram analysis. Our statistically based, Bayesian system incorporates a learning feature that collects confusion information at the collection and document levels. An evaluation of the new system is presented as well.

...read moreread less

97 citations

Patent•

Apparatus for repeatedly compressing a data string and a method thereof

[...]

Noriko Satoh

22 Jan 2001

TL;DR: In this paper, a character string of which a start point is each address of character string data in an input buffer is rearranged in a predetermined order, so that a rank list is generated.

...read moreread less

Abstract: A character string of which a start point is each address of character string data in an input buffer is rearranged in the predetermined order, so that a rank list is generated. Next, the location of the matching candidate of a character string to be encoded is obtained on the basis of the rank list. Then, the character string to be encoded is compared with a matching candidate, thereby obtaining a matching length. Further, a code is generated using the location of the matching candidate and the matching length, and the code is output as compression data.

...read moreread less

72 citations

Book Chapter•DOI•

Exact Solutions for CLOSEST STRING and Related Problems

[...]

Jens Gramm¹, Rolf Niedermeier¹, Peter Rossmanith²•Institutions (2)

University of Tübingen¹, Technische Universität München²

19 Dec 2001

TL;DR: It is shown how to solve CLOSEST STRING in linear time for constant d (the exponential growth is O(d d) ), and this result is extended to the closely related problems d-MISMATCH and DISTINGUISHING STRING SELECTION.

...read moreread less

Abstract: CLOSEST STRING is one of the core problems in the field of consensus word analysis with particular importance for computational biology Given k strings of same length and a positive integer d, find a "closest string" s such that none of the given strings has Hamming distance greater than d from s Closest String is NP-complete We show how to solve CLOSEST STRING in linear time for constant d (the exponential growth is O(d d We extend this result to the closely related problems d-MISMATCH and DISTINGUISHING STRING SELECTION Moreover, we discuss fixed parameter tractability for parameter k and give an efficient linear time algorithm for CLOSEST STRING when k = 3 Finally, the practical usefulness of our findings is substantiated by some experimental results

...read moreread less

Patent•

Apparatus for and method of multiple parallel string searching

[...]

Yaniv Shapira¹•Institutions (1)

Telecom Italia¹

21 Feb 2001

TL;DR: In this article, the strings to be searched for are divided into a plurality of two and three character substrings and stored in substring tables, and a hash of each substring is calculated and stored into a hash table whose output is an index to a substring table, and the string is declared found if all the substrings making up the string have been received in correct consecutive order.

...read moreread less

Abstract: An apparatus for and method of simultaneously searching an input character stream for the presence of multiple strings. The strings to be searched for are determined a priori, processed and stored in substring tables during a configuration phase. The strings to be searched for are divided into a plurality of two and three character substrings and stored in substring tables. A hash of each substring is calculated and stored in a hash table whose output is an index to a substring table. During searching, the content filter generates the hash of the input character stream and attempts to find a matching substring stored in the hash table. A string is declared found if all the substrings making up the string have been received in correct consecutive order.

...read moreread less

Proceedings Article•DOI•

Faster approximate string matching over compressed text

[...]

Gonzalo Navarro, Takuya Kida¹, Masayuki Takeda¹, Ayumi Shinohara¹, Setsuo Arikawa¹ - Show less +1 more•Institutions (1)

Kyushu University¹

27 Mar 2001

TL;DR: This work presents a different approach to the approximate string matching problem, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas, thus becoming the first practical solution to the problem.

...read moreread less

Abstract: Approximate string matching on compressed text was an open problem for almost a decade The two existing solutions are very new Despite that they represent important complexity breakthroughs, in most practical cases they are not useful, in the sense that they are slower than uncompressing the text and then searching the uncompressed text We present a different approach, which reduces the problem to multipattern searching of pattern pieces plus local decompression and direct verification of candidate text areas We show experimentally that this solution is 10-30 times faster than previous work and up to three times faster than the trivial approach of uncompressing and searching, thus becoming the first practical solution to the problem

...read moreread less

Journal Article•DOI•

A Randomized Algorithm for Approximate String Matching

[...]

Mikhail J. Atallah¹, Frédéric Chyzak², Philippe Dumas²•Institutions (2)

Purdue University¹, French Institute for Research in Computer Science and Automation²

01 Mar 2001-Algorithmica

TL;DR: A randomized algorithm in deterministic time O(Nlog M) for estimating the score vector of matches between a text string of length N and a patternstring of length M, i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position.

...read moreread less

Abstract: We give a randomized algorithm in deterministic time O(Nlog M) for estimating the score vector of matches between a text string of length N and a pattern string of length M , i.e., the vector obtained when the pattern is slid along the text, and the number of matches is counted for each position. A direct application is approximate string matching. The randomized algorithm uses convolution to find an estimator of the scores; the variance of the estimator is particularly small for scores that are close to M , i.e., for approximate occurrences of the pattern in the text. No assumption is made about the probabilistic characteristics of the input, or about the size of the alphabet. The solution extends to string matching with classes, class complements, ``never match'' and ``always match'' symbols, to the weighted case and to higher dimensions.

...read moreread less

Journal Article•DOI•

Attributed concept maps: fuzzy integration and fuzzy matching

[...]

Sei-Wang Chen¹, S.C. Lin¹, Kuo-En Chang•Institutions (1)

National Taiwan Normal University¹

01 Oct 2001

TL;DR: Experimental results have shown that the proposed ideas of ACM, MM, fuzzy map integration, and fuzzy map matching are well suited for students with high performances and difficult subject materials.

...read moreread less

Abstract: A concept map, typically depicted as a connected graph, is composed of a collection of propositions. Each proposition forming a semantic unit consists of a small set of concept nodes interconnected to one another with relation links. Concept maps possess a number of appealing features which make them a promising tool for teaching, learning, evaluation, and curriculum planning. We extend concept maps by associating their concept nodes and relation links with attribute values which indicate the relative significance of concepts and relationships in knowledge representation. The resulting maps are called attributed concept maps (ACM). Assessing students will be conducted by matching their ACMs with those prebuilt by experts. The associated techniques are referred to as map matching techniques. The building of an expert ACM has in the past been done by only one specialist. We integrate a number of maps developed by separate experts into a single map, called the master map (MM), which will serve as a prototypical map in map matching. Both map integration and map matching are conceptualized in terms of fuzzy set discipline. Experimental results have shown that the proposed ideas of ACM, MM, fuzzy map integration, and fuzzy map matching are well suited for students with high performances and difficult subject materials.

...read moreread less

Journal Article•DOI•

On the Common Substring Alignment Problem

[...]

Gad M. Landau, Michal Ziv-Ukelson¹•Institutions (1)

IBM¹

01 Nov 2001-Journal of Algorithms

TL;DR: This paper describes an algorithm which is composed of an encoding stage and an alignment stage, and shows how to reduce the O(n?) alignment work, for each appearance of the common substring Y in a source string, to O-at the cost of O( n?) encoding work, which is executed only once.

...read moreread less

Patent•

Adaptively weighted, partitioned context edit distance string matching

[...]

Alvin Richardson, Charles Davis, Daniel P. Miranker

26 Jul 2001

TL;DR: In this article, a pattern is partitioned into context and value components, and candidate matches for each of the components is identified by calculating an edit distance between that component and each potentially matching set (sub-string) of symbols within the string.

...read moreread less

Abstract: A system and method for examining a string of symbols and identifying portions of the string which match a predetermined pattern using adaptively weighted, partitioned context edit distances. A pattern is partitioned into context and value components, and candidate matches for each of the components is identified by calculating an edit distance between that component and each potentially matching set (sub-string) of symbols within the string. One or more candidate matches having the lowest edit distances are selected as matches for the pattern. The weighting of each of the component matches may be adapted to optimize the pattern matching and, in one embodiment, the context components may be heavily weighted to obtain matches of a value for which the corresponding pattern is not well defined. In one embodiment, an edit distance matrix is evaluated for each of a prefix component, a value component and a suffix component of a pattern. The evaluation of the prefix matrix provides a basis for identifying indicators of the beginning of a value window, while the evaluation of the suffix matrix provides a basis for identifying the alignment of the end of the value window. The value within the value window can then be evaluated via the value matrix to determine a corresponding value match score.

...read moreread less

Journal Article•DOI•

Approximate periods of strings

[...]

Jeong Seop Sim¹, Costas S. Iliopoulos², Costas S. Iliopoulos³, Kunsoo Park¹, William F. Smyth⁴, William F. Smyth³ - Show less +2 more•Institutions (4)

Seoul National University¹, King's College London², Curtin University³, McMaster University⁴

06 Jul 2001-Theoretical Computer Science

TL;DR: Different forms of approximate periodicity under a variety of distance functions are studied, for two of which polynomial-time algorithms are derived and the third problem is NP-complete.

...read moreread less

Book Chapter•DOI•

Efficient Experimental String Matching by Weak Factor Recognition

[...]

Cyril Allauzen¹, Maxime Crochemore¹, Mathieu Raffinot•Institutions (1)

University of Marne-la-Vallée¹

01 Jul 2001

TL;DR: A new notion of weak factor recognition that is the foundation of new data structures and on-line string matching algorithms, and a new automaton built on a string p = p1p2 ... pm that acts like an oracle on the set of factors pi ... pj.

...read moreread less

Abstract: We introduce a new notion of weak factor recognition that is the foundation of new data structures and on-line string matching algorithms. We define a new automaton built on a string p = p1p2 ... pm that acts like an oracle on the set of factors pi ... pj. If a string is recognized by this automaton, it may be a factor of p. But, if it is rejected, it is surely not a factor. We call it factor oracle. More precisely, this automaton is acyclic, recognizes at least the factors of p, has m+ 1 states and a linear number of transitions. We give a very simple sequential construction algorithm to build it. Using this automaton, we design an efficient experimental on-line string matching algorithm (we conjecture its optimality in regard to the experimental results) that is really simple to implement. We also extend the factor oracle to predict that a string could be a suffix (i.e. in the set pi ... pm) of p. We obtain the suffix oracle, that enables in some cases a tricky improvement of the previous string matching algorithm.

...read moreread less

Patent•

System and method for improved string matching under noisy channel conditions

[...]

Kevyn Collins-Thompson¹, Charles B. Schweizer¹•Institutions (1)

Microsoft¹

30 Jul 2001

TL;DR: In this article, a system and method for improving string matching in a noisy channel environment is described. The system identifies candidates within the textual file that may match the query string and analyzes the probability that the string candidate matches a user-defined string.

...read moreread less

Abstract: Described is a system and method for improving string matching in a noisy channel environment. The invention provides a method for identifying string candidates and analyzing the probability that the string candidate matches a user-defined string. In one implementation, a find engine receives a query string, converts an image file into a textual file, and identifies each instance of the query string in the textual file. The find engine identifies candidates within the textual file that may match the query string. The find engine refers to a confusion table to help identify whether candidates that are near matches to the query string are actually matches to the query string but for a common recognition error. Candidates meeting a probability threshold are identified as matches to the query string. The invention further provides for analysis options including word heuristics, language models, and OCR confidences.

...read moreread less

Journal Article•DOI•

Improving an Algorithm for Approximate Pattern Matching

[...]

Gonzalo Navarro¹, Ricardo Baeza-Yates¹•Institutions (1)

University of Chile¹

01 Oct 2001-Algorithmica

TL;DR: This work shows an excellent example of a complex and theoretical analysis of algorithms used for design and for practical algorithm engineering, instead of the common practice of first designing an algorithm and then analyzing it.

...read moreread less

Abstract: We study a recent algorithm for fast on-line approximate string matching. This is the problem of searching a pattern in a text allowing errors in the pattern or in the text. The algorithm is based on a very fast kernel which is able to search short patterns using a nondeterministic finite automaton, which is simulated using bit-parallelism. A number of techniques to extend this kernel for longer patterns are presented in that work. However, the techniques can be integrated in many ways and the optimal interplay among them is by no means obvious. The solution to this problem starts at a very low level, by obtaining basic probabilistic information about the problem which was not previously known, and ends integrating analytical results with empirical data to obtain the optimal heuristic. The conclusions obtained via analysis are experimentally confirmed. We also improve many of the techniques and obtain a combined heuristic which is faster than the original work. This work shows an excellent example of a complex and theoretical analysis of algorithms used for design and for practical algorithm engineering, instead of the common practice of first designing an algorithm and then analyzing it.

...read moreread less

Journal Article•DOI•

Indexing huge genome sequences for solving various problems.

[...]

Kunihiko Sadakane¹, Tetsuo Shibuya²•Institutions (2)

Tohoku University¹, IBM²

01 Jan 2001-Genome Informatics

TL;DR: The compressed suffix array is used, which compactly stores the suffix array at the cost of theoretically a small slowdown in access speed, and an approximate string matching algorithm is proposed which is suitable for the compressed suffix arrays.

...read moreread less

Abstract: Because of the increase in the size of genome sequence databases, the importance of indexing the sequences for fast queries grows. Suffix trees and suffix arrays are used for simple queries. However these are not suitable for complicated queries from huge amount of sequences because the indices are stored in disk which has slow access speed. We propose storing the indices in memory in a compressed form. We use the compressed suffix array. It compactly stores the suffix array at the cost of theoretically a small slowdown in access speed. We experimentally show that the overhead of using the compressed suffix array is reasonable in practice. We also propose an approximate string matching algorithm which is suitable for the compressed suffix array. Furthermore, we have constructed the compressed suffix array of the whole human genome. Because its size is about 2G bytes, a workstation can handle the search index for the whole data in main memory, which will accelerate the speed of solving various problems in genome informatics.

...read moreread less

Patent•

Method and system for compression of a set of mostly similar strings allowing fast retrieval

[...]

Yoav Ossia¹•Institutions (1)

IBM¹

21 May 2001

TL;DR: In this paper, a computer implemented method and system for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings calculates preliminary compression results for every string relative to an initial reference string, and uses the preliminary compression result to find a better reference string without additional compression tests.

...read moreread less

Abstract: A computer implemented method and system for selecting a string for serving as a reference string for a comparison scheme for compressing a set of strings calculates preliminary compression results for every string relative to an initial reference string, and uses the preliminary compression results to find a better reference string without additional compression tests. According to one embodiment, a histogram is calculated showing the number of occurrences of each compressed length for each string in the set plotted against the initial reference string and the better reference string has a length corresponding to an average compression length or center of gravity of the histogram.

...read moreread less

Patent•

String matching method and device

[...]

Christopher Peiffer¹•Institutions (1)

Juniper Networks¹

10 Oct 2001

TL;DR: In this paper, a method and device for string matching HTTP headers is presented, which typically includes identifying a predefined string, identifying an unknown string to compare with the predefined strings, performing a bitwise exclusive OR operation on an ASCII binary representation of at least one segment of the unknown string, and identifying a case-insensitive string match based on the exclusive operation.

...read moreread less

Abstract: A method and device for string matching HTTP headers. The method typically includes identifying a predefined string, identifying an unknown string to compare with the predefined string, performing a bitwise exclusive OR operation on an ASCII binary representation of at least one segment of the unknown string and an ASCII binary representation of at least one segment of the predefined string, and identifying a case-insensitive string match based on the exclusive OR operation. The method may further include performing a bitwise operation with a predefined flag to determine the case-insensitive segment match.

...read moreread less

Patent•

Automatic identification of dvd title using internet technologies and fuzzy matching techniques

[...]

Christopher Commons, Piero Andreas Madar

05 Dec 2001

TL;DR: In this article, an iterative search technique is used to quickly and accurately locate information in a database, such as one storing information about digital versatile discs (DVDs), where a presumably unique search key is generated for an unidentified DVD and compared with corresponding keys in the database.

...read moreread less

Abstract: An iterative search technique is used to quickly and accurately locate information in a database, such as one storing information about digital versatile discs (DVDs). First, a presumably unique search key is generated for an unidentified DVD and compared with corresponding keys in a database. If no match is found progressively less specific information is used to generate a series of search keys that are similarly compared with corresponding keys in the database. If at least one possibly matching record is found, it is determined whether the best matching record can be considered a match, otherwise, less specific information is used to search for a match until predefined least specific information is used.

...read moreread less

Patent•

Method for the manipulation, storage, modeling, visualization and quantification of datasets

[...]

Sandy C. Shaw

19 Jan 2001

TL;DR: In this paper, a method for manipulation, storage, modeling, visualization, and quantification of datasets which correspond to target strings is described, which is used to generate comparison strings corresponding to some set of points that can serve as the domain of an iterative function.

...read moreread less

Abstract: There is described a method for manipulation, storage, modeling, visualization, and quantification of datasets, which correspond to target strings. An iterative algorithm is used to generate comparison strings corresponding to some set of points that can serve as the domain of an iterative function. Preferably these points are located in the complex plane, such as in and/or near the Mandelbrot Set or a Julian Set. The comparison string is scored by evaluating a function having the comparison string and one of the plurality of target strings as inputs. The evaluation may be repeated for a number of the other target strings. The score or some other property corresponding to the comparison string is used to determine the target string's placement on a map. The points are analyzed and/or compared by examining, either visually or mathematically, their relative locations, their absolute locations within the region, and/or metrics other than location.

...read moreread less

Proceedings Article•

Approximate String Matching in Musical Sequences

[...]

Maxime Crochemore, Costas S. Iliopoulos, Thierry Lecroq, Yoan J. Pinzón

01 Jan 2001

TL;DR: These are two new notions of approximate matching that arise naturally in applications of computer assisted music analysis and are presented as fast, efficient and practical algorithms for these two notion of approximate string matching.

...read moreread less

Abstract: Here we consider computational problems on δ-approximate and (δ, γ)-approximate string matching. These are two new notions of approximate matching that arise naturally in applications of computer assisted music analysis. We present fast, efficient and practical algorithms for these two notions of approximate string matching

...read moreread less

Proceedings Article•DOI•

Two-dimensional substring indexing

[...]

Paolo Ferragina¹, Nick Koudas², Divesh Srivastava², S. Muthukrishnan²•Institutions (2)

University of Pisa¹, AT&T Labs²

01 May 2001

TL;DR: A technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points is presented and can be practically realized using a combination of string B-trees and R-tree.

...read moreread less

Abstract: As databases have expanded in scope to storing string data (XML documents, product catalogs), it has become increasingly important to search databases based on matching substrings, often on multiple, correlated dimensions. While string B-trees are I/O optimal in one dimension, no index structure with non-trivial query bounds is known for two-dimensional substring indexing.In this paper, we present a technique for two-dimensional substring indexing based on a reduction to the geometric problem of identifying common colors in two ranges containing colored points. We develop an I/O efficient algorithm for solving the common colors problem, and use it to obtain an I/O efficient (poly-logarithmic query time) algorithm for the two-dimensional substring indexing problem. Our techniques result in a family of secondary memory index structures that trade space for time, with no loss of accuracy. We show how our technique can be practically realized using a combination of string B-trees and R-trees.

...read moreread less

Book Chapter•DOI•

The Max-Shift Algorithm for Approximate String Matching

[...]

Costas S. Iliopoulos¹, Costas S. Iliopoulos², Laurent Mouchard³, Laurent Mouchard², Yoan J. Pinzón¹, Yoan J. Pinzón² - Show less +2 more•Institutions (3)

King's College London¹, Curtin University², University of Rouen³

28 Aug 2001-Lecture Notes in Computer Science

TL;DR: This paper presents an elegant and veryeasy to implement bit-vector algorithm for answering the following incremental version of the approximate string matching problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B with equal efficiency?

...read moreread less

Abstract: The approximate string matching problem is to find all locations which a pattern of length m matches a substring of a text of length n with at most k differences. The program agrep is a simple and practical bit-vector algorithm for this problem. In this paper we consider the following incremental version of the problem: given an appropriate encoding of a comparison between A and bB, can one compute the answer for A and B, and the answer for A and Bc with equal efficiency, where b and c are additional symbols? Here we present an elegant and veryeasy to implement bit-vector algorithm for answering these questions that requires only O(n⌈m/w⌉) time, where n is the length of A, m is the length of B and w is the number of bits in a machine word. We also present an O(nm⌈h/w⌉) algorithm for the fixed-length approximate string matching problem: given a text t, a pattern p and an integer h, compute the optimal alignment of all substrings of p of length h and a substring of t.

...read moreread less

Patent•

System and method for compressing and decompressing data in real time

[...]

Michel Drummondville Levesque, Guillaume Drummondville Parenteau, Guillaume Trois-Rivieres Plante

19 Dec 2001

TL;DR: In this article, a look-up table is used to address a chained array or list of previously matching character strings, and the array is updated if there is another matching character string found when compressing the input string.

...read moreread less

Abstract: A system and method for compressing and decompressing data in real time begins by taking a character string from an input string (12), generating a hash value (28) of the character string (16) which is utilized in a look up table (18) to address a chained array or list (20) of previously matching character strings. The array is updated (34) if there is another matching character string found when compressing the input string. A token generator (36) writes a code (102, 103, 105) to the output string (14) indicating whether or not that there has been a match. The token generator (36) generates an indication of the length of the character string not compressed, the one or more characters string not compressed, the length of a matching character string, and the number of characters processed since the last match. These values generated by the token generator are optimally represented based upon preselected criteria.

...read moreread less