Showing papers on "Edit distance published in 2009"

PDF

Open Access

Journal Article•DOI•

Approximate graph edit distance computation by means of bipartite graph matching

[...]

Kaspar Riesen¹, Horst Bunke¹•Institutions (1)

01 Jun 2009-Image and Vision Computing

TL;DR: A novel algorithm is introduced which allows us to approximately, or suboptimally, compute edit distance in a substantially faster way and is emprically verified that the accuracy of the suboptimal distance remains sufficiently accurate for various pattern recognition applications.

...read moreread less

654 citations

Journal Article•DOI•

Comparing stars: on approximating graph edit distance

[...]

Zhiping Zeng¹, Anthony K. H. Tung², Jianyong Wang¹, Jianhua Feng¹, Lizhu Zhou¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, National University of Singapore²

01 Aug 2009

TL;DR: Three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time are introduced and result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs.

...read moreread less

Abstract: Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.

...read moreread less

413 citations

Journal Article•DOI•

Time Warp Edit Distance with Stiffness Adjustment for Time Series Matching

[...]

P.-F. Marteau

01 Feb 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is shown that the similarity provided by TWED is a potentially useful metric in time series retrieval applications since it could benefit from the triangular inequality property to speed up the retrieval process while tuning the parameters of the elastic measure.

...read moreread less

Abstract: In a way similar to the string-to-string correction problem, we address discrete time series similarity in light of a time-series-to-time-series-correction problem for which the similarity between two time series is measured as the minimum cost sequence of edit operations needed to transform one time series into another. To define the edit operations, we use the paradigm of a graphical editing process and end up with a dynamic programming algorithm that we call time warp edit distance (TWED). TWED is slightly different in form from dynamic time warping (DTW), longest common subsequence (LCSS), or edit distance with real penalty (ERP) algorithms. In particular, it highlights a parameter that controls a kind of stiffness of the elastic measure along the time axis. We show that the similarity provided by TWED is a potentially useful metric in time series retrieval applications since it could benefit from the triangular inequality property to speed up the retrieval process while tuning the parameters of the elastic measure. In that context, a lower bound is derived to link the matching of time series into down sampled representation spaces to the matching into the original space. The empiric quality of the TWED distance is evaluated on a simple classification task. Compared to edit distance, DTW, LCSS, and ERP, TWED has proved to be quite effective on the considered experimental task.

...read moreread less

298 citations

Journal Article•DOI•

An optimal decomposition algorithm for tree edit distance

[...]

Erik D. Demaine¹, Shay Mozes², Benjamin Rossman¹, Oren Weimann¹•Institutions (2)

Massachusetts Institute of Technology¹, Brown University²

28 Dec 2009-ACM Transactions on Algorithms

TL;DR: This article presents a worst-case O(n) 3-time algorithm for the problem when the two trees have size n, and proves the optimality of the algorithm among the family of decomposition strategy algorithms—which also includes the previous fastest algorithms—by tightening the known lower bound.

...read moreread less

Abstract: The edit distance between two ordered rooted trees with vertex labels is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. In this article, we present a worst-case O(n3)-time algorithm for the problem when the two trees have size n, improving the previous best O(n3 log n)-time algorithm. Our result requires a novel adaptive strategy for deciding how a dynamic program divides into subproblems, together with a deeper understanding of the previous algorithms for the problem. We prove the optimality of our algorithm among the family of decomposition strategy algorithms—which also includes the previous fastest algorithms—by tightening the known lower bound of Ω(n2 log2n) to Ω(n3), matching our algorithm's running time. Furthermore, we obtain matching upper and lower bounds for decomposition strategy algorithms of Θ(nm2 (1 + log n/m)) when the two trees have sizes m and n and m

...read moreread less

264 citations

Proceedings Article•

Context aware trace clustering : towards improving process mining results

[...]

R. P. Jagadeesh Chandra Bose, Wil M. P. van der Aalst

01 Jan 2009

TL;DR: The method proposed in this paper outperforms contemporary approaches to trace clustering in process mining and evaluates the goodness of the formed clusters using established fitness and comprehensibility metrics defined in the context of process mining.

...read moreread less

Abstract: Process Mining refers to the extraction of process models from event logs. Real-life processes tend to be less structured and more flexible. Traditional process mining algorithms have problems dealing with such unstructured processes and generate spaghetti-like process models that are hard to comprehend. An approach to overcome this is to cluster process instances (a process instance is manifested as a trace and an event log corresponds to a multi-set of traces) such that each of the resulting clusters correspond to a coherent set of process instances that can be adequately represented by a process model. In this paper, we propose a context aware approach to trace clustering based on generic edit distance. It is well known that the generic edit distance framework is highly sensitive to the costs of edit operations. We define an automated approach to derive the costs of edit operations. The method proposed in this paper outperforms contemporary approaches to trace clustering in process mining. We evaluate the goodness of the formed clusters using established fitness and comprehensibility metrics defined in the context of process mining. The proposed approach is able to generate clusters such that the process models mined from the clustered traces show a high degree of fitness and comprehensibility when compared to contemporary approaches.

...read moreread less

218 citations

Journal Article•DOI•

RazerS—fast read mapping with sensitivity control

[...]

David Weese¹, Anne-Katrin Emde, Tobias Rausch, Andreas Döring, Knut Reinert - Show less +1 more•Institutions (1)

Free University of Berlin¹

01 Sep 2009-Genome Research

TL;DR: This work presents an efficient read mapping tool called RazerS, which allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance and guarantees not to lose more reads than specified.

...read moreread less

Abstract: Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, efficient algorithms and implementations are crucial for this task. We present an efficient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. [RazerS is freely available at http://www.seqan.de/projects/razers.html.]

...read moreread less

173 citations

Proceedings Article•DOI•

Extending autocompletion to tolerate errors

[...]

Surajit Chaudhuri¹, Raghav Kaushik¹•Institutions (1)

Microsoft¹

29 Jun 2009

TL;DR: This paper captures input typing errors via edit distance and shows that a naive approach of invoking an offline edit distance matching algorithm at each step performs poorly and present more efficient algorithms.

...read moreread less

Abstract: Autocompletion is a useful feature when a user is doing a look up from a table of records. With every letter being typed, autocompletion displays strings that are present in the table containing as their prefix the search string typed so far. Just as there is a need for making the lookup operation tolerant to typing errors, we argue that autocompletion also needs to be error-tolerant. In this paper, we take a first step towards addressing this problem. We capture input typing errors via edit distance. We show that a naive approach of invoking an offline edit distance matching algorithm at each step performs poorly and present more efficient algorithms. Our empirical evaluation demonstrates the effectiveness of our algorithms.

...read moreread less

149 citations

Proceedings Article•DOI•

Efficient approximate entity extraction with edit distance constraints

[...]

Wei Wang¹, Chuan Xiao¹, Xuemin Lin¹, Chengqi Zhang²•Institutions (2)

University of New South Wales¹, University of Technology, Sydney²

29 Jun 2009

TL;DR: This paper studies the problem of approximate dictionary matching with edit distance constraints and proposes an improved neighborhood generation method employing novel partitioning and prefix pruning techniques that outperforms alternative approaches by up to an order of magnitude.

...read moreread less

Abstract: Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.

...read moreread less

118 citations

Patent•

Spelling correction system and method for misspelled input

[...]

Hee-Jun Song¹, Young-Hee Park¹, Hyun Sik Shim¹, Ham Jong Gyu¹, Harksoo Kim¹, Jooho Lee¹, Se Hee Lee¹ - Show less +3 more•Institutions (1)

Samsung¹

07 Apr 2009

TL;DR: In this article, a spelling correction system and method automatically recognizes and corrects misspelled inputs in an electronic device with relatively lower computing power in a learning process, a misspelling correction dictionary is constructed on the basis of a corpus of accepted words, and context-sensitive strings are selected from among all the strings registered in the dictionary Context information about the context sensitive strings is acquired.

...read moreread less

Abstract: A spelling correction system and method automatically recognizes and corrects misspelled inputs in an electronic device with relatively lower computing power In a learning process, a misspelling correction dictionary is constructed on the basis of a corpus of accepted words, and context-sensitive strings are selected from among all the strings registered in the dictionary Context information about the context-sensitive strings is acquired In an applying process, at least one target string is selected from among all the strings in a user's input sentence through the dictionary If the target string is one of the context-sensitive strings, the target string is corrected by use of the context information

...read moreread less

102 citations

Patent•

Method and apparatus for detecting near-duplicate videos using perceptual video signatures

[...]

Ismail Haritaoglu

27 Jul 2009

TL;DR: In this article, a complete framework to detect unauthorized copying of videos on the Internet using the disclosed perceptual video signature is disclosed, which can be used to detect duplicate and near-duplicate videos.

...read moreread less

Abstract: Methods and apparatus for detection and identification of duplicate or near-duplicate videos using a perceptual video signature are disclosed. The disclosed apparatus and methods (i) extract perceptual video features, (ii) identify unique and distinguishing perceptual features to generate a perceptual video signature, (iii) compute a perceptual video similarity measure based on the video edit distance, and (iv) search and detect duplicate and near-duplicate videos. A complete framework to detect unauthorized copying of videos on the Internet using the disclosed perceptual video signature is disclosed.

...read moreread less

68 citations

Book•

Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance

[...]

Frederic P. Miller, Agnes F. Vandome, John McBrewster

24 Nov 2009

TL;DR: The Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance), often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.

...read moreread less

Abstract: In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau?Levenshtein distance) allows the transposition of two characters as an operation. Some Translation Environment Tools, such as translation memory leveraging applications, use the Levenhstein algorithm to measure the edit distance between two fuzzy matching content segments.The metric is named after Vladimir Levenshtein, who considered this distance in 1965. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers

...read moreread less

Proceedings Article•DOI•

Learning semantic distance from community-tagged media collection

[...]

Guo-Jun Qi¹, Xian-Sheng Hua², Hong-Jiang Zhang³•Institutions (3)

University of Illinois at Urbana–Champaign¹, Microsoft², Advanced Technology Center³

19 Oct 2009

TL;DR: A semi-definite programming is formulated to encode the above two aspects of criteria to learn the distance metric and it is shown such an optimization problem can be efficiently solved with a closed-form solution.

...read moreread less

Abstract: This paper proposes a novel semantic-aware distance metric for images by mining multimedia data on the Internet, in particular, web images and their associated tags. As well known, a proper distance metric between images is a key ingredient in many realistic web image retrieval engines, as well many image understanding techniques. In this paper, we attempt to mine a novel distance metric from the web images by integrating their visual content as well as the associated user tags. Different from many existing distance metric learning algorithms which utilize the dissimilar or similar information between images pixels or features in signal level, the proposed scheme also takes the associated user-input tags into consideration. The visual content of images is also leveraged to respect an intuitive assumption that the visual similar images ought to have a smaller distance. A semi-definite programming is formulated to encode the above two aspects of criteria to learn the distance metric and we show such an optimization problem can be efficiently solved with a closed-form solution. We evaluate the proposed algorithm on two datasets. One is the benchmark Corel dataset and the other is a real-world dataset crawled from the image sharing website Flickr. By comparison with other existing distance learning algorithms, competitive results are obtained by the proposed algorithm in experiments.

...read moreread less

Proceedings Article•DOI•

Approximating edit distance in near-linear time

[...]

Alexandr Andoni¹, Krzysztof Onak¹•Institutions (1)

Massachusetts Institute of Technology¹

31 May 2009

TL;DR: This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n^{(1/3+o(1)) approximation.}

...read moreread less

Abstract: We show how to compute the edit distance between two strings of length n up to a factor of 2(O-tilde(sqrt(log n))) in n(1+o(1)) time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n(1/3+o(1)) approximation. Previously, approximation of 2O √log n) was known only for embedding edit distance into l1, and it is not known if that embedding can be computed in less than a quadratic time.

...read moreread less

Posted Content•

A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression

[...]

Danny Hermelin¹, Gad M. Landau¹, Shir Landau, Oren Weimann²•Institutions (2)

University of Haifa¹, Massachusetts Institute of Technology²

16 Feb 2009-arXiv: Computational Complexity

TL;DR: In this article, a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs was presented, and an algorithm running in O(n 1.4}N^{1.2})$ time for computing the edit distance of these two strings under any rational scoring function was presented.

...read moreread less

Abstract: We present a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs. For two strings of total length $N$ having straight-line program representations of total size $n$, we provide an algorithm running in $O(n^{1.4}N^{1.2})$ time for computing the edit-distance of these two strings under any rational scoring function, and an $O(n^{1.34}N^{1.34})$ time algorithm for arbitrary scoring functions. This improves on a recent algorithm of Tiskin that runs in $O(nN^{1.5})$ time, and works only for rational scoring functions. Also, in the last part of the paper, we show how the classical four-russians technique can be incorporated into our SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings.

...read moreread less

Patent•

Similarity detection for error reports

[...]

Kevin Bartz¹, Jack W. Stokes¹, Ryan S. Kivett¹, David Grant¹, Gretchen Loihle¹, Silviu C. Calinoiu¹ - Show less +2 more•Institutions (1)

Microsoft¹

17 Sep 2009

TL;DR: In this article, the similarity between error reports received by an error reporting service is determined by comparing frames included in a callstack of an error report to frames in other error reports to determine an edit distance between the callstacks.

...read moreread less

Abstract: Techniques for determining similarity between error reports received by an error reporting service. An error report may be compared to other previously-received error reports to determine similarity and facilitate diagnosing and resolving an error that generated the error report. In some implementations, the similarity may be determined by comparing frames included in a callstack of an error report to frames included in callstacks in other error reports to determine an edit distance between the callstacks, which may be based on the number and type of frame differences between callstacks. Each type of change may be weighted differently when determining the edit distance. Additionally or alternatively, the comparison may be performed by comparing a type of error, process names, and/or exception codes for the errors contained in the error reports. The similarity may be expressed as a probability that two error reports were generated as a result of a same error.

...read moreread less

Journal Article•DOI•

Modelling spatial landscape complexity using the Levenshtein algorithm

[...]

Fivos Papadimitriou¹•Institutions (1)

University of Patras¹

01 Jan 2009-Ecological Informatics

TL;DR: A new measure (CL) of spatial/structural landscape complexity is developed in this paper, based on the Levenshtein algorithm used in Computer Science and Bioinformatics for string comparisons, which may aid in landscape monitoring, management and planning, by identifying areas of higher structural landscape complexity.

...read moreread less

Proceedings Article•DOI•

A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression

[...]

Danny Hermelin¹, Gad M. Landau¹, Shir Landau, Oren Weimann²•Institutions (2)

University of Haifa¹, Massachusetts Institute of Technology²

26 Feb 2009

TL;DR: The classical four-russians technique can be incorporated into the SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings.

...read moreread less

Abstract: We present a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs For two strings of total length $N$ having straight-line program representations of total size $n$, we provide an algorithm running in $O(n^{14}N^{12})$ time for computing the edit-distance of these two strings under any rational scoring function, and an $O(n^{134}N^{134})$ time algorithm for arbitrary scoring functions This improves on a recent algorithm of Tiskin that runs in $O(nN^{15})$ time, and works only for rational scoring functions Also, in the last part of the paper, we show how the classical four-russians technique can be incorporated into our SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings

...read moreread less

Book Chapter•DOI•

Generalized Mongue-Elkan Method for Approximate Text String Comparison

[...]

Sergio Jimenez¹, Claudia Becerra¹, Alexander Gelbukh², Fabio A. González¹•Institutions (2)

National University of Colombia¹, Instituto Politécnico Nacional²

17 Feb 2009

TL;DR: The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens.

...read moreread less

Abstract: The Mongue-Elkan method is a general text string comparison method based on an internal character-based similarity measure (eg edit distance) combined with a token level (ie word level) similarity measure We propose a generalization of this method based on the notion of the generalized arithmetic mean instead of the simple average used in the expression to calculate the Monge-Elkan method The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens

...read moreread less

Journal Article•DOI•

Privacy preserving record linkage approaches

[...]

Vassilios S. Verykios¹, Alexandros Karakasidis¹, Vassilios K. Mitrogiannis•Institutions (1)

University of Thessaly¹

27 May 2009-International Journal of Data Mining, Modelling and Management

TL;DR: This work proposes a certain methodology for preserving the privacy of various record linkage approaches, implements, examines and compares four pairs of privacy preserving record linkage methods and protocols and presents also a blocking scheme as an extension to the privacy preserve record linkage methodology.

...read moreread less

Abstract: Privacy-preserving record linkage is a very important task, mostly because of the very sensitive nature of the personal data. The main focus in this task is to find a way to match records from among different organisation data sets or databases without revealing competitive or personal information to non-owners. Towards accomplishing this task, several methods and protocols have been proposed. In this work, we propose a certain methodology for preserving the privacy of various record linkage approaches and we implement, examine and compare four pairs of privacy preserving record linkage methods and protocols. Two of these protocols use n-gram based similarity comparison techniques, the third protocol uses the well known edit distance and the fourth one implements the Jaro-Winkler distance metric. All of the protocols used are enhanced by private key cryptography and hash encoding. This paper presents also a blocking scheme as an extension to the privacy preserving record linkage methodology. Our comparison is backed up by extended experimental evaluation that demonstrates the performance achieved by each of the proposed protocols.

...read moreread less

Proceedings Article•DOI•

Automatic Cost Estimation for Tree Edit Distance Using Particle Swarm Optimization

[...]

Yashar Mehdad¹•Institutions (1)

University of Trento¹

04 Aug 2009

TL;DR: This paper proposes an original method to estimate and optimize the operation costs in TED, applying the Particle Swarm Optimization algorithm, and shows the success of this method in automatic estimation, rather than manual assignment of edit costs.

...read moreread less

Abstract: Recently, there is a growing interest in working with tree-structured data in different applications and domains such as computational biology and natural language processing. Moreover, many applications in computational linguistics require the computation of similarities over pair of syntactic or semantic trees. In this context, Tree Edit Distance (TED) has been widely used for many years. However, one of the main constraints of this method is to tune the cost of edit operations, which makes it difficult or sometimes very challenging in dealing with complex problems. In this paper, we propose an original method to estimate and optimize the operation costs in TED, applying the Particle Swarm Optimization algorithm. Our experiments on Recognizing Textual Entailment show the success of this method in automatic estimation, rather than manual assignment of edit costs.

...read moreread less

Ed-join: an efficient algorithm for similarity joins with edit distance constraints

[...]

Thai Ngoc Thuy

01 Jan 2009

TL;DR: An efficient algorithm for similarity join with edit distance constraints is implemented that achieves substantial reduction of the candidate siz s and hence saves computation time and a new algorithm, Ed-Join, is proposed that exploits the mismatch-based filtering methods.

...read moreread less

Abstract: Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and patte rn r cognition. In this project, we implement an efficient algorithm for similarity joi n with edit distance constraints. Current approaches are mainly that the edit distance constr ai t is converted to a weaker constraint on number of matching q-grams between pair of strin gs. In our project, we exploit a novel perspective of investigating mismatching q-gr am. We derive two new edit distance lower bounds by analyzing the locations and content s of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the n ew mismatch-based filtering methods; it achieves substantial reduction of the candidate siz s and hence saves computation time.

...read moreread less

Proceedings Article•DOI•

Overcoming the l1 non-embeddability barrier: algorithms for product metrics

[...]

Alexandr Andoni¹, Piotr Indyk¹, Robert Krauthgamer²•Institutions (2)

Massachusetts Institute of Technology¹, Weizmann Institute of Science²

04 Jan 2009

TL;DR: This work proposes a new approach of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like l1 and l∞, and shows that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms.

...read moreread less

Abstract: A common approach for solving computational problems over a difficult metric space is to embed the "hard" metric into L1 which admits efficient algorithms and is thus considered an "easy" metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics.We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like l1 and l∞. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in non-repetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 0--1 strings [Andoni and Krauthgamer, ICALP 2008].

...read moreread less

Patent•

Search results ranking using editing distance and document information

[...]

Vladimir Tankovich¹, Hang Li¹, Dmitriy Meyerzon¹, Jun Xu¹•Institutions (1)

Microsoft¹

10 Mar 2009

TL;DR: In this paper, an architecture for extracting document information from documents received as search results based on a query string, and computing an edit distance between the data string and the query string is presented.

...read moreread less

Abstract: Architecture for extracting document information from documents received as search results based on a query string, and computing an edit distance between the data string and the query string. The edit distance is employed in determining relevance of the document as part of result ranking by detecting near-matches of a whole query or part of the query. The edit distance evaluates how close the query string is to a given data stream that includes document information such as TAUC (title, anchor text, URL, clicks) information, etc. The architecture includes the index-time splitting of compound terms in the URL to allow the more effective discovery of query terms. Additionally, index-time filtering of anchor text is utilized to find the top N anchors of one or more of the document results. The TAUC information can be input to a neural network (e.g., 2-layer) to improve relevance metrics for ranking the search results.

...read moreread less

Book Chapter•DOI•

Measuring the Similarity of Geometric Graphs

[...]

Otfried Cheong¹, Joachim Gudmundsson², Hyo-Sil Kim¹, Daria Schymura³, Fabian Stehn³ - Show less +1 more•Institutions (3)

KAIST¹, NICTA², Free University of Berlin³

04 Jun 2009

TL;DR: A distance for geometric graphs is proposed that is shown to be a metric, and that can be computed by solving an integer linear program and also presented experiments using a heuristic distance function.

...read moreread less

Abstract: What does it mean for two geometric graphs to be similar? We propose a distance for geometric graphs that we show to be a metric, and that can be computed by solving an integer linear program We also present experiments using a heuristic distance function

...read moreread less

Proceedings Article•DOI•

Multiple Word Alignment with Profile Hidden Markov Models

[...]

Aditya Bhargava¹, Grzegorz Kondrak¹•Institutions (1)

University of Alberta¹

01 Jun 2009

TL;DR: This work proposes the use of Profile HMMs for word-related tasks, and test their applicability to the tasks of multiple cognate alignment and cognate set matching, and finds that they work well in general for both tasks.

...read moreread less

Abstract: Profile hidden Markov models (Profile HMMs) are specific types of hidden Markov models used in biological sequence analysis. We propose the use of Profile HMMs for word-related tasks. We test their applicability to the tasks of multiple cognate alignment and cognate set matching, and find that they work well in general for both tasks. On the latter task, the Profile HMM method outperforms average and minimum edit distance. Given the success for these two tasks, we further discuss the potential applications of Profile HMMs to any task where consideration of a set of words is necessary.

...read moreread less

Journal Article•DOI•

Reference-based alignment in large sequence databases

[...]

Panagiotis Papapetrou¹, Vassilis Athitsos², George Kollios¹, Dimitrios Gunopulos³•Institutions (3)

Boston University¹, University of Texas at Arlington², National and Kapodistrian University of Athens³

01 Aug 2009

TL;DR: A novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure, which significantly outperforms state-of-the-art biological sequence alignment methods.

...read moreread less

Abstract: This paper introduces a novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure. RBSA operates using the assumption that the optimal match deviates by a relatively small amount from the query, an amount that does not exceed a prespecified fraction of the query length. RBSA has an exact version that guarantees no false dismissals and can handle large queries efficiently. An approximate version of RBSA is also described, that achieves significant additional improvements over the exact version, with negligible losses in retrieval accuracy. RBSA performs filtering of candidate matches using precomputed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of length equal to that of the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate subsequence matches. An alphabet collapsing technique is employed to improve the pruning power of the filter step. In our experimental evaluation, RBSA significantly outperforms state-of-the-art biological sequence alignment methods, such as q-grams, BLAST, and BWT.

...read moreread less

Book Chapter•DOI•

Faster and Space-Optimal Edit Distance 1 Dictionary

[...]

Djamal Belazzougui¹•Institutions (1)

École Normale Supérieure¹

18 Jun 2009

TL;DR: This paper proposes the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most " 1" from query string) in time linear in the length of query string.

...read moreread less

Abstract: In the approximate dictionary search problem we have to construct a data structure on a set of strings so that we can answer to queries of the kind: find all strings of the set that are similar (according to some string distance) to a given string. In this paper we propose the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most "1" from query string) in time linear in the length of query string. Based on our new dictionary we propose a full-text index for approximate queries with edit distance "1" (report all positions of all sub-strings of the text that are at edit distance at most "1" from query string) answering to a query in time linear in the length of query string using space $O(n(\lg(n)\lg\lg(n))^2)$ in the worst case on a text of length n . Our index is the first index that answers queries in time linear in the length of query string while using space O (n ·poly (log (n ))) in the worst case and for any alphabet size.

...read moreread less

Book Chapter•DOI•

ELKI in Time: ELKI 0.2 for the Performance Evaluation of Distance Measures for Time Series

[...]

Elke Achtert¹, Thomas Bernecker¹, Hans-Peter Kriegel¹, Erich Schubert¹, Arthur Zimek¹ - Show less +1 more•Institutions (1)

Ludwig Maximilian University of Munich¹

30 Jun 2009

TL;DR: The new version ELKI 0.2 now is extended to time series data and offers a selection of specialized distance measures, which can serve as a visualization- and evaluation-tool for the behavior of different distance measures on time seriesData.

...read moreread less

Abstract: ELKI is a unified software framework, designed as a tool suitable for evaluation of different algorithms on high dimensional real-valued feature-vectors. A special case of high dimensional real-valued feature-vectors are time series data where traditional distance measures like L p -distances can be applied. However, also a broad range of specialized distance measures like, e.g., dynamic time-warping, or generalized distance measures like second order distances, e.g., shared-nearest-neighbor distances, have been proposed. The new version ELKI 0.2 now is extended to time series data and offers a selection of these distance measures. It can serve as a visualization- and evaluation-tool for the behavior of different distance measures on time series data.

...read moreread less

Book Chapter•DOI•

A Novel Approach for Word Spotting Using Merge-Split Edit Distance

[...]

Khurram Khurshid¹, Claudie Faure², Nicole Vincent¹•Institutions (2)

Paris Descartes University¹, Centre national de la recherche scientifique²

29 Aug 2009

TL;DR: A Merge-split edit distance which overcomes segmentation problems by incorporating a multi-purpose merge cost function and Evaluation of the method on 19th century historical document images exhibits extremely promising results.

...read moreread less

Abstract: Edit distance matching has been used in literature for word spotting with characters taken as primitives. The recognition rate however, is limited by the segmentation inconsistencies of characters (broken or merged) caused by noisy images or distorted characters. In this paper, we have proposed a Merge-split edit distance which overcomes these segmentation problems by incorporating a multi-purpose merge cost function. The system is based on the extraction of words and characters in the text and then attributing each character with a set of features. Characters are matched by comparing their extracted feature sets using Dynamic Time Warping (DTW) while the words are matched by comparing the strings of characters using the proposed Merge-Split Edit distance algorithm. Evaluation of the method on 19th century historical document images exhibits extremely promising results.

...read moreread less

Book Chapter•DOI•

Video Event Classification Using Bag of Words and String Kernels

[...]

Lamberto Ballan¹, Marco Bertini¹, Alberto Del Bimbo¹, Giuseppe Serra¹•Institutions (1)

University of Florence¹

29 Aug 2009

TL;DR: This paper presents a method to introduce temporal information within the bag-of-words (BoW) approach, modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW model.

...read moreread less

Abstract: The recognition of events in videos is a relevant and challenging task of automatic semantic video analysis. At present one of the most successful frameworks, used for object recognition tasks, is the bag-of-words (BoW) approach. However this approach does not model the temporal information of the video stream. In this paper we present a method to introduce temporal information within the BoW approach. Events are modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW model. The sequences are treated as strings where each histogram is considered as a character. Event classification of these sequences of variable size, depending on the length of the video clip, are performed using SVM classifiers with a string kernel that uses the Needlemann-Wunsch edit distance. Experimental results, performed on two datasets, soccer video and TRECVID 2005, demonstrate the validity of the proposed approach.

...read moreread less