Showing papers on "Edit distance published in 2013"

PDF

Open Access

Journal Article•DOI•

Accelerating read mapping with FastHASH.

[...]

Hongyi Xin¹, Donghyuk Lee¹, Farhad Hormozdiari², Samihan Yedkar¹, Onur Mutlu¹, Can Alkan³ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, University of California, Los Angeles², Bilkent University³

21 Jan 2013-BMC Genomics

TL;DR: A new algorithm, FastHASH, is proposed, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods.

...read moreread less

Abstract: With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS. We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection. We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness.

...read moreread less

155 citations

Journal Article•DOI•

The Move-Split-Merge Metric for Time Series

[...]

Alexandra Stefan¹, Vassilis Athitsos¹, Gautam Das¹•Institutions (1)

University of Texas at Arlington¹

01 Jun 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel metric for time series, called Move-Split-Merge (MSM), is proposed, which uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series.

...read moreread less

Abstract: A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series. A Move operation changes the value of a single element, a Split operation converts a single element into two consecutive elements, and a Merge operation merges two consecutive elements into one. Each operation has an associated cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence of operations that transforms the first time series into the second one. An efficient, quadratic-time algorithm is provided for computing the MSM distance. MSM has the desirable properties of being metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments with public time series data sets demonstrate that MSM is a meaningful distance measure, that oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP.

...read moreread less

136 citations

Journal Article•DOI•

Structural entropy and metamorphic malware

[...]

Donabelle Baysa¹, Richard M. Low¹, Mark Stamp¹•Institutions (1)

San Jose State University¹

14 Apr 2013-Journal of Computer Virology and Hacking Techniques

TL;DR: Previous work on structural entropy to the metamorphic detection problem is applied and it is shown that this technique relies on an analysis of variations in the complexity of data within a file to obtain strong results in certain challenging cases.

...read moreread less

Abstract: Metamorphic malware is capable of changing its internal structure without altering its functionality. A common signature is nonexistent in highly metamorphic malware and, consequently, such malware can remain undetected under standard signature scanning. In this paper, we apply previous work on structural entropy to the metamorphic detection problem. This technique relies on an analysis of variations in the complexity of data within a file. The process consists of two stages, namely, file segmentation and sequence comparison. In the segmentation stage, we use entropy measurements and wavelet analysis to segment files. The second stage measures the similarity of file pairs by computing an edit distance between the sequences of segments obtained in the first stage. We apply this similarity measure to the metamorphic detection problem and show that we obtain strong results in certain challenging cases.

...read moreread less

128 citations

Proceedings Article•DOI•

Quantitative relaxation of concurrent data structures

[...]

Thomas A. Henzinger¹, Christoph M. Kirsch², Hannes Payer², Ali Sezgin¹, Ana Sokolova² - Show less +1 more•Institutions (2)

Institute of Science and Technology Austria¹, University of Salzburg²

23 Jan 2013

TL;DR: This work presents a systematic and formal framework for obtaining new data structures by quantitatively relaxing existing ones, and gives concurrent implementations of relaxed data structures and demonstrates that bounded relaxations provide the means for trading correctness for performance in a controlled way.

...read moreread less

Abstract: There is a trade-off between performance and correctness in implementing concurrent data structures. Better performance may be achieved at the expense of relaxing correctness, by redefining the semantics of data structures. We address such a redefinition of data structure semantics and present a systematic and formal framework for obtaining new data structures by quantitatively relaxing existing ones. We view a data structure as a sequential specification S containing all "legal" sequences over an alphabet of method calls. Relaxing the data structure corresponds to defining a distance from any sequence over the alphabet to the sequential specification: the k-relaxed sequential specification contains all sequences over the alphabet within distance k from the original specification. In contrast to other existing work, our relaxations are semantic (distance in terms of data structure states). As an instantiation of our framework, we present two simple yet generic relaxation schemes, called out-of-order and stuttering relaxation, along with several ways of computing distances. We show that the out-of-order relaxation, when further instantiated to stacks, queues, and priority queues, amounts to tolerating bounded out-of-order behavior, which cannot be captured by a purely syntactic relaxation (distance in terms of sequence manipulation, e.g. edit distance). We give concurrent implementations of relaxed data structures and demonstrate that bounded relaxations provide the means for trading correctness for performance in a controlled way. The relaxations are monotonic which further highlights the trade-off: increasing k increases the number of permitted sequences, which as we demonstrate can lead to better performance. Finally, since a relaxed stack or queue also implements a pool, we actually have new concurrent pool implementations that outperform the state-of-the-art ones.

...read moreread less

107 citations

Book Chapter•DOI•

A Novel Software Toolkit for Graph Edit Distance Computation.

[...]

Kaspar Riesen¹, Sandro Emmenegger¹, Horst Bunke²•Institutions (2)

University of Applied Sciences and Arts Northwestern Switzerland FHNW¹, University of Bern²

15 May 2013

TL;DR: The aim of the present paper is that the powerful and flexible algorithmic framework for graph edit distance computation can easily be adapted to specific problem domains via a versatile graphical user interface.

...read moreread less

Abstract: Graph edit distance is one of the most flexible mechanisms for error-tolerant graph matching. Its key advantage is that edit distance is applicable to unconstrained attributed graphs and can be tailored to a wide variety of applications by means of specific edit cost functions. The computational complexity of graph edit distance, however, is exponential in the number of nodes, which makes it feasible for small graphs only. In recent years the authors of the present paper introduced several powerful approximations for fast suboptimal graph edit distance computation. The contribution of the present work is a self standing software tool integrating these suboptimal graph matching algorithms. It is about being made publicly available. The idea of this software tool is that the powerful and flexible algorithmic framework for graph edit distance computation can easily be adapted to specific problem domains via a versatile graphical user interface. The aim of the present paper is twofold. First, it reviews the implemented approximation methods and second, it thoroughly describes the features and application of the novel graph matching software.

...read moreread less

82 citations

Proceedings Article•DOI•

[...]

Dong Deng¹, Guoliang Li¹, Jianhua Feng¹, Wen-Syan Li•Institutions (1)

Tsinghua University¹

08 Apr 2013

TL;DR: This paper proposes a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance, and develops a range-based method by grouping the pivotal entries to avoid duplicated computations.

...read moreread less

Abstract: String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

...read moreread less

74 citations

Proceedings Article•

Automated grading of DFA constructions

[...]

Rajeev Alur¹, Loris D'Antoni¹, Sumit Gulwani², Dileep Kini³, Mahesh Viswanathan³ - Show less +1 more•Institutions (3)

University of Pennsylvania¹, Microsoft², University of Illinois at Urbana–Champaign³

03 Aug 2013

TL;DR: This paper provides a solution to automatic grading of the standard computation-theory problem that asks a student to construct a deterministic finite automaton (DFA) from the given description of its language, and provides algorithms for transforming MOSEL descriptions into DFAs and vice-versa.

...read moreread less

Abstract: One challenge in making online education more effective is to develop automatic grading software that can provide meaningful feedback. This paper provides a solution to automatic grading of the standard computation-theory problem that asks a student to construct a deterministic finite automaton (DFA) from the given description of its language. We focus on how to assign partial grades for incorrect answers. Each student's answer is compared to the correct DFA using a hybrid of three techniques devised to capture different classes of errors. First, in an attempt to catch syntactic mistakes, we compute the edit distance between the two DFA descriptions. Second, we consider the entropy of the symmetric difference of the languages of the two DFAs, and compute a score that estimates the fraction of the number of strings on which the student answer is wrong. Our third technique is aimed at capturing mistakes in reading of the problem description. For this purpose, we consider a description language MOSEL, which adds syntactic sugar to the classical Monadic Second Order Logic, and allows defining regular languages in a concise and natural way. We provide algorithms, along with optimizations, for transforming MOSEL descriptions into DFAs and vice-versa. These allow us to compute the syntactic edit distance of the incorrect answer from the correct one in terms of their logical representations. We report an experimental study that evaluates hundreds of answers submitted by (real) students by comparing grades/feedback computed by our tool with human graders. Our conclusion is that the tool is able to assign partial grades in a meaningful way, and should be preferred over the human graders for both scalability and consistency.

...read moreread less

66 citations

Book Chapter•DOI•

A Fast Matching Algorithm for Graph-Based Handwriting Recognition

[...]

Andreas Fischer¹, Ching Y. Suen¹, Volkmar Frinken², Kaspar Riesen³, Horst Bunke⁴ - Show less +1 more•Institutions (4)

Concordia University¹, Autonomous University of Barcelona², University of Applied Sciences and Arts Northwestern Switzerland FHNW³, University of Bern⁴

15 May 2013

TL;DR: This paper proposes a faster graph matching algorithm which is derived from the Hausdorff distance, and demonstrates that the proposed method achieves a speedup factor of 12.9 without significant loss in recognition accuracy.

...read moreread less

Abstract: The recognition of unconstrained handwriting images is usually based on vectorial representation and statistical classification. Despite their high representational power, graphs are rarely used in this field due to a lack of efficient graph-based recognition methods. Recently, graph similarity features have been proposed to bridge the gap between structural representation and statistical classification by means of vector space embedding. This approach has shown a high performance in terms of accuracy but had shortcomings in terms of computational speed. The time complexity of the Hungarian algorithm that is used to approximate the edit distance between two handwriting graphs is demanding for a real-world scenario. In this paper, we propose a faster graph matching algorithm which is derived from the Hausdorff distance. On the historical Parzival database it is demonstrated that the proposed method achieves a speedup factor of 12.9 without significant loss in recognition accuracy.

...read moreread less

61 citations

Proceedings Article•DOI•

[...]

Weiguo Zheng¹, Lei Zou¹, Xiang Lian², Dong Wang¹, Dongyan Zhao¹ - Show less +1 more•Institutions (2)

Peking University¹, University of Texas–Pan American²

27 Oct 2013

TL;DR: This paper derives a lower bound, branch-based bound, which can greatly reduce the search space of the graph similarity search, and proposes a tree index structure, namely b-tree, to facilitate effective pruning and efficient query processing.

...read moreread less

Abstract: Due to many real applications of graph databases, it has become increasingly important to retrieve graphs g (in graph database D) that approximately match with query graph q, rather than exact subgraph matches. In this paper, we study the problem of graph similarity search, which retrieves graphs that are similar to a given query graph under the constraint of the minimum edit distance. Specifically, we derive a lower bound, branch-based bound, which can greatly reduce the search space of the graph similarity search. We also propose a tree index structure, namely b-tree, to facilitate effective pruning and efficient query processing. Extensive experiments confirm that our proposed approach outperforms the existing approaches by orders of magnitude, in terms of both pruning power and query response time.

...read moreread less

60 citations

Proceedings Article•

Evaluating Text Segmentation using Boundary Edit Distance

[...]

Chris Fournier¹•Institutions (1)

University of Ottawa¹

01 Aug 2013

TL;DR: This work proposes a new segmentation evaluation metric, named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012).

...read moreread less

Abstract: This work proposes a new segmentation evaluation metric, named boundary similarity (B), an inter-coder agreement coefficient adaptation, and a confusion-matrix for segmentation that are all based upon an adaptation of the boundary edit distance in Fournier and Inkpen (2012). Existing segmentation metrics such as Pk, WindowDiff, and Segmentation Similarity (S) are all able to award partial credit for near misses between boundaries, but are biased towards segmentations containing few or tightly clustered boundaries. Despite S’s improvements, its normalization also produces cosmetically high values that overestimate agreement & performance, leading this work to propose a solution.

...read moreread less

55 citations

Journal Article•DOI•

A partition-based approach to structure similarity search

[...]

Xiang Zhao¹, Chuan Xiao², Xuemin Lin¹, Qing Liu³, Wenjie Zhang¹ - Show less +1 more•Institutions (3)

University of New South Wales¹, Nagoya University², Commonwealth Scientific and Industrial Research Organisation³

01 Nov 2013

TL;DR: A partition-based approach to tackle the graph similarity queries with edit distance constraints is presented, by dividing data graphs into variable-size non-overlapping partitions, and the edit distance constraint is converted to a graph containment constraint for candidate generation.

...read moreread less

Abstract: Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This paper studies the graph similarity queries with edit distance constraints. Existing solutions to the problem utilize fixed-size overlapping substructures to generate candidates, and thus become susceptible to large vertex degrees or large distance thresholds. In this paper, we present a partition-based approach to tackle the problem. By dividing data graphs into variable-size non-overlapping partitions, the edit distance constraint is converted to a graph containment constraint for candidate generation. We develop efficient query processing algorithms based on the new paradigm. A candidate pruning technique and an improved graph edit distance algorithm are also developed to further boost the performance. In addition, a cost-aware graph partitioning technique is devised to optimize the index. Extensive experiments demonstrate our approach significantly outperforms existing approaches.

...read moreread less

Journal Article•DOI•

Efficient processing of graph similarity queries with edit distance constraints

[...]

Xiang Zhao¹, Chuan Xiao², Xuemin Lin¹, Wei Wang¹, Yoshiharu Ishikawa² - Show less +1 more•Institutions (2)

University of New South Wales¹, Nagoya University²

01 Dec 2013

TL;DR: Efficient algorithms are proposed to handle three types of graph similarity queries by exploiting both matching and mismatching features as well as degree information to improve the filtering and verification on candidates.

...read moreread less

Abstract: Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources such as erroneous data entries and find similarity matches. In this paper, we study graph similarity queries with edit distance constraints. Inspired by the $$q$$ -gram idea for string similarity problems, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. Efficient algorithms are proposed to handle three types of graph similarity queries by exploiting both matching and mismatching features as well as degree information to improve the filtering and verification on candidates. We demonstrate the proposed algorithms significantly outperform existing approaches with extensive experiments on real and synthetic datasets.

...read moreread less

Journal Article•DOI•

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

[...]

Wei Wang¹, Jianbin Qin¹, Chuan Xiao², Xuemin Lin¹, Heng Tao Shen³ - Show less +1 more•Institutions (3)

University of New South Wales¹, Nagoya University², University of Queensland³

01 Aug 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings, and a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary are proposed.

...read moreread less

Abstract: Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.

...read moreread less

Journal Article•DOI•

RCSI: scalable similarity search in thousand(s) of genomes

[...]

Sebastian Wandelt¹, Johannes Starlinger¹, Marc Bux¹, Ulf Leser¹•Institutions (1)

Humboldt University of Berlin¹

01 Aug 2013

TL;DR: RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer and presents a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.

...read moreread less

Abstract: Until recently, genomics has concentrated on comparing sequences between species. However, due to the sharply falling cost of sequencing technology, studies of populations of individuals of the same species are now feasible and promise advances in areas such as personalized medicine and treatment of genetic diseases. A core operation in such studies is read mapping, i.e., finding all parts of a set of genomes which are within edit distance k to a given query sequence (k-approximate search). To achieve sufficient speed, current algorithms solve this problem only for one to-be-searched genome and compute only approximate solutions, i.e., they miss some k- approximate occurrences.We present RCSI, Referentially Compressed Search Index, which scales to a thousand genomes and computes the exact answer. It exploits the fact that genomes of different individuals of the same species are highly similar by first compressing the to-be-searched genomes with respect to a reference genome. Given a query, RCSI then searches the reference and all genome-specific individual differences. We propose efficient data structures for representing compressed genomes and present algorithms for scalable compression and similarity search. We evaluate our algorithms on a set of 1092 human genomes, which amount to approx. 3 TB of raw data. RCSI compresses this set by a ratio of 450:1 (26:1 including the search index) and answers similarity queries on a mid-class server in 15 ms on average even for comparably large error thresholds, thereby significantly outperforming other methods. Furthermore, we present a fast and adaptive heuristic for choosing the best reference sequence for referential compression, a problem that was never studied before at this scale.

...read moreread less

Journal Article•DOI•

Efficient error-tolerant query autocompletion

[...]

Chuan Xiao¹, Jianbin Qin², Wei Wang², Yoshiharu Ishikawa¹, Koji Tsuda³, Kunihiko Sadakane - Show less +2 more•Institutions (3)

Nagoya University¹, University of New South Wales², National Institute of Advanced Industrial Science and Technology³

01 Apr 2013

TL;DR: A novel neighborhood generation-based algorithm, IncNGTrie, is proposed, which can achieve up to two orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem.

...read moreread less

Abstract: Query autocompletion is an important feature saving users many keystrokes from typing the entire query. In this paper we study the problem of query autocompletion that tolerates errors in users' input using edit distance constraints. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance from the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query response even if the entire query approximately matches only few prefixes.In this paper, we propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to two orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.

...read moreread less

Proceedings Article•

LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features

[...]

Davide Buscaldi, Joseph Le Roux, Jorge García Flores, Adrian Popescu

13 Jun 2013

TL;DR: This paper describes the system used by the LIPN team in the Semantic Textual Similarity task at SemEval 2013, which uses a support vector regression model, combining different text similarity measures that constitute the features.

...read moreread less

Abstract: This paper describes the system used by the LIPN team in the Semantic Textual Similarity task at SemEval 2013. It uses a support vector regression model, combining different text similarity measures that constitute the features. These measures include simple distances like Levenshtein edit distance, cosine, Named Entities overlap and more complex distances like Explicit Semantic Analysis, WordNet-based similarity, IR-based similarity, and a similarity measure based on syntactic dependencies.

...read moreread less

Book•

[...]

Nikolaus Augsten, Michael H. Bhlen

01 Nov 2013

TL;DR: This book describes the concepts and techniques to incorporate similarity into database systems, and describes prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low.

...read moreread less

Abstract: State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low. Table of Contents: Preface / Acknowledgments / Introduction / Data Types / Edit-Based Distances / Token-Based Distances / Query Processing Techniques / Filters for Token Equality Joins / Conclusion / Bibliography / Authors' Biographies / Index

...read moreread less

Journal Article•DOI•

A partition-based method for string similarity joins with edit-distance constraints

[...]

Guoliang Li¹, Dong Deng¹, Jianhua Feng¹•Institutions (1)

Tsinghua University¹

04 Jul 2013-ACM Transactions on Database Systems

TL;DR: This article Study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, and proposes a new filter, called the segment filter.

...read moreread less

Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.

...read moreread less

Journal Article•DOI•

Approximation and parameterized algorithms for common subtrees and edit distance between unordered trees

[...]

Tatsuya Akutsu¹, Daiji Fukagawa², Magnús M. Halldórsson³, Atsuhiro Takasu⁴, Keisuke Tanaka⁵ - Show less +1 more•Institutions (5)

Kyoto University¹, Doshisha University², Reykjavík University³, National Institute of Informatics⁴, Tokyo Institute of Technology⁵

01 Jan 2013-Theoretical Computer Science

TL;DR: A parameterized algorithm in terms of the number of branching nodes that solves both problems and yields polynomial algorithms for several special classes of trees and the first approximation algorithms for both problems are presented.

...read moreread less

Proceedings Article•DOI•

Automatic gazetteer enrichment with user-geocoded data

[...]

Judith Gelernter¹, Gautam Ganesh², Hamsini Krishnakumar³, Wei Zhang¹•Institutions (3)

Carnegie Mellon University¹, University of Texas at Dallas², Anna University³

05 Nov 2013

TL;DR: A fuzzy match algorithm using machine learning (SVM) that checks both for approximate spelling and approximate geocoding in order to find duplicates between the crowd-sourced tags and the gazetteer in effort to absorb those tags that are novel.

...read moreread less

Abstract: Geographical knowledge resources or gazetteers that are enriched with local information have the potential to add geographic precision to information retrieval. We have identified sources of novel local gazetteer entries in crowd-sourced OpenStreetMap and Wikimapia geotags that include geo-coordinates. We created a fuzzy match algorithm using machine learning (SVM) that checks both for approximate spelling and approximate geocoding in order to find duplicates between the crowd-sourced tags and the gazetteer in effort to absorb those tags that are novel. For each crowd-sourced tag, our algorithm generates candidate matches from the gazetteer and then ranks those candidates based on word form or geographical relations between each tag and gazetteer candidate. We compared a baseline of edit distance for candidate ranking to an SVM-trained candidate ranking model on a city level location tag match task. Experiment results show that the SVM greatly outperforms the baseline.

...read moreread less

Proceedings Article•DOI•

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

[...]

Michael Saks¹, C. Seshadhri²•Institutions (2)

Rutgers University¹, Sandia National Laboratories²

06 Jan 2013

TL;DR: An algorithm which, for any δ > 0, given streaming access to an array of length n provides a (1 + δ)-multiplicative approximation to the distance to monotonicity (n minus the length of the LIS), and uses only O((log2 n)/δ) space.

...read moreread less

Abstract: Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any δ > 0, given streaming access to an array of length n provides a (1 + δ)-multiplicative approximation to the distance to monotonicity (n minus the length of the LIS), and uses only O((log2n)/δ) space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. The improved approximation factor reflects a qualitative difference between our algorithm and previous algorithms: previous polylogarithmic space algorithms could not reliably detect increasing subsequences of length as large as n/2, while ours can detect increasing subsequences of length βn for any β > 0. More precisely, our algorithm can be used to estimate the length of the LIS to within an additive δn for any δ > 0 while previous algorithms could only achieve additive error n(1/2 -- o(1)).Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS.We also show how our technique can be applied to other problems solvable by dynamic programs. For example, we give a streaming algorithm for approximating LCS(x, y), the length of the longest common subsequence between strings x and y, each of length n. Our algorithm works in the asymmetric setting (inspired by [AKO10]), in which we have random access to y and streaming access to x, and runs in small space provided that no single symbol appears very often in y. More precisely, it gives an additive-δn approximation to LCS(x, y) (and hence also to E(x, y) = n -- LCS(x, y), the edit distance between x and y when insertions and deletions, but not substitutions, are allowed), with space complexity O(k(log2n)/δ), where k is the maximum number of times any one symbol appears in y.We also provide a deterministic 1-pass streaming algorithm that outputs a (1 + δ)-multiplicative approximation for E(x, y) (which is also an additive δn-approximation), in the asymmetric setting, and uses O(√n/δ log(n)) space. All these algorithms are obtained by carefully trading space and accuracy within a standard dynamic program.

...read moreread less

Edit transducers for spelling variation in Old Spanish

[...]

Jordi Porta¹, José-Luis Sancho¹, Javier Gómez¹•Institutions (1)

Real Academia Española¹

17 May 2013

TL;DR: A qualitative error analysis suggests several potential ways to improve the performance of the system, both in accuracy and in the trade-off between precision and recall, with respect to the baseline and the Levenshtein edit distance.

...read moreread less

Abstract: A system for the analysis of Old Spanish word forms using weighted finite-state transducers is presented. The system uses previously existing resources such as a modern lexicon, a phonological transcriber and a set of rules implementing the evolution of Spanish from the Middle Ages. The results obtained in all datasets show significant improvements, both in accuracy and in the trade-off between precision and recall, with respect to the baseline and the Levenshtein edit distance. A qualitative error analysis suggests several potential ways to improve the performance of the system.

...read moreread less

Journal Article•DOI•

MeshGit: diffing and merging meshes for polygonal modeling

[...]

Jonathan D. Denning¹, Fabio Pellacini²•Institutions (2)

Dartmouth College¹, Sapienza University of Rome²

21 Jul 2013

TL;DR: This paper presents MeshGit, a practical algorithm for diffing and merging polygonal meshes typically used in subdivision modeling workflows that translates the mesh correspondence into a set of mesh editing operations that transforms the first mesh into the second.

...read moreread less

Abstract: This paper presents MeshGit, a practical algorithm for diffing and merging polygonal meshes typically used in subdivision modeling workflows. Inspired by version control for text editing, we introduce the mesh edit distance as a measure of the dissimilarity between meshes. This distance is defined as the minimum cost of matching the vertices and faces of one mesh to those of another. We propose an iterative greedy algorithm to approximate the mesh edit distance, which scales well with model complexity, providing a practical solution to our problem. We translate the mesh correspondence into a set of mesh editing operations that transforms the first mesh into the second. The editing operations can be displayed directly to provide a meaningful visual difference between meshes. For merging, we compute the difference between two versions and their common ancestor, as sets of editing operations. We robustly detect conflicting operations, automatically apply non-conflicting edits, and allow the user to choose how to merge the conflicting edits. We evaluate MeshGit by diffing and merging a variety of meshes and find it to work well for all.

...read moreread less

Journal Article•DOI•

Efficient and effective KNN sequence search with approximate n-grams

[...]

Xiaoli Wang¹, Xiaofeng Ding², Anthony K. H. Tung¹, Zhenjie Zhang•Institutions (2)

National University of Singapore¹, Huazhong University of Science and Technology²

01 Sep 2013

TL;DR: This paper devise a pipeline framework over a two-level index for searching KNN in the sequence database using the edit distance and brings various enticing advantages over existing works, including huge reduction on false positive candidates to avoid large overheads on candidate verifications.

...read moreread less

Abstract: In this paper, we address the problem of finding k-nearest neighbors (KNN) in sequence databases using the edit distance. Unlike most existing works using short and exact n-gram matchings together with a filter-and-refine framework for KNN sequence search, our new approach allows us to use longer but approximate n-gram matchings as a basis of KNN candidates pruning. Based on this new idea, we devise a pipeline framework over a two-level index for searching KNN in the sequence database. By coupling this framework together with several efficient filtering strategies, i.e. the frequency queue and the well-known Combined Algorithm (CA), our proposal brings various enticing advantages over existing works, including 1) huge reduction on false positive candidates to avoid large overheads on candidate verifications; 2) progressive result update and early termination; and 3) good extensibility to parallel computation. We conduct extensive experiments on three real datasets to verify the superiority of the proposed framework.

...read moreread less

Proceedings Article•DOI•

Homomorphic fingerprints under misalignments: sketching edit and shift distances

[...]

Alexandr Andoni¹, Assaf Goldberger², Andrew McGregor³, Ely Porat⁴•Institutions (4)

Microsoft¹, Tel Aviv University², University of Massachusetts Amherst³, University of Michigan⁴

01 Jun 2013

TL;DR: This paper presents the first linear sketch that is robust to a small number of alignment errors and can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other.

...read moreread less

Abstract: Fingerprinting is a widely-used technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the "dissimilarity" of non-identical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinate-wise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a file given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [Question 13 in Indyk-McGregor-Newman-Onak'11] with a rather surprising outcome. Our sketch projects a length $n$ file into D(n) ⋅ polylog n dimensions where D(n)l n is the number of divisors of n. The striking fact is that this is near-optimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression.In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a long-standing open problem on the low distortion embeddings of edit distance [Question 2.15 in Naor-Matousek'11, Indyk'01], for the case of linear embeddings.

...read moreread less

Proceedings Article•DOI•

Efficient top-k algorithms for approximate substring matching

[...]

Younghoon Kim¹, Kyuseok Shim¹•Institutions (1)

Seoul National University¹

22 Jun 2013

TL;DR: The efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings are proposed and the novel filtering techniques which take advantages of q-grams and invertedq-gram indexes available are utilized.

...read moreread less

Abstract: There is a wide range of applications that require to query a large database of texts to search for similar strings or substrings. Traditional approximate substring matching requests a user to specify a similarity threshold. Without top-k approximate substring matching, users have to try repeatedly different maximum distance threshold values when the proper threshold is unknown in advance.In our paper, we first propose the efficient algorithms for finding the top-k approximate substring matches with a given query string in a set of data strings. To reduce the number of expensive distance computations, the proposed algorithms utilize our novel filtering techniques which take advantages of q-grams and inverted q-gram indexes available. We conduct extensive experiments with real-life data sets. Our experimental results confirm the effectiveness and scalability of our proposed algorithms.

...read moreread less

Book Chapter•DOI•

Understanding Short Texts

[...]

Haixun Wang¹•Institutions (1)

Microsoft¹

04 Apr 2013

TL;DR: In this paper, the authors evaluate the semantic similarity between a search query and an ad and conclude that edit distance-based string similarity does not work, and statistical methods that find latent topic models from text also fall short because ads and search queries are insufficient to provide enough statistical signals.

...read moreread less

Abstract: Many applications handle short texts, and enableing machines to understand short texts is a big challenge. For example, in Ads selection, it is is difficult to evaluate the semantic similarity between a search query and an ad. Clearly, edit distance based string similarity does not work. Moreover, statistical methods that find latent topic models from text also fall short because ads and search queries are insufficient to provide enough statistical signals.

...read moreread less

Proceedings Article•DOI•

Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints

[...]

Yu Jiang¹, Dong Deng¹, Jiannan Wang¹, Guoliang Li¹, Jianhua Feng¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

18 Mar 2013

TL;DR: This paper proposes parallel algorithms to support efficient similarity search and join with edit-distance constraints and adopts the partition-based framework and extends it to support parallel similaritySearch and join on multi-core processors and develops two novel pruning techniques.

...read moreread less

Abstract: The quantity of data in real-world applications is growing significantly while the data quality is still a big problem. Similarity search and similarity join are two important operations to address the poor data quality problem. Although many similarity search and join algorithms have been proposed, they did not utilize the abilities of modern hardware with multi-core processors. It calls for new parallel algorithms to enable multi-core processors to meet the high performance requirement of similarity search and join on big data. To this end, in this paper we propose parallel algorithms to support efficient similarity search and join with edit-distance constraints. We adopt the partition-based framework and extend it to support parallel similarity search and join on multi-core processors. We also develop two novel pruning techniques. We have implemented our algorithms and the experimental results on two real datasets show that our parallel algorithms achieve high performance and obtain good speedup.

...read moreread less

Journal Article•DOI•

Unified Compression-Based Acceleration of Edit-Distance Computation

[...]

Danny Hermelin¹, Gad M. Landau², Shir Landau³, Oren Weimann²•Institutions (3)

Max Planck Society¹, University of Haifa², Tel Aviv University³

01 Feb 2013-Algorithmica

TL;DR: This paper presents an algorithm running in O(nNlg(N/n) time for computing the edit-distance of these two strings under any rational scoring function, and an O( n2/3N4/3) time algorithm for arbitrary scoring functions.

...read moreread less

Abstract: The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N2) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nNlg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n2/3N4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.

...read moreread less

Journal Article•DOI•

Handling missing values and unmatched features in a CBR system for hydro-generator design

[...]

Xiaolong Xie¹, Lin Lin¹, Shisheng Zhong¹•Institutions (1)

Harbin Institute of Technology¹

01 Jun 2013-Computer-aided Design

TL;DR: A similarity measurement is proposed by improving the edit distance which is widely used as a similarity measurement and the similarity function is defined based on the cost function.

...read moreread less

Abstract: Hydro-generator design is a complex problem and case based reasoning (CBR) can improve its efficiency, but there are missing values and unmatched features which decrease the accuracy of CBR In order to solve the problems brought by missing values and unmatched features, a similarity measurement is proposed by improving the edit distance which is widely used as a similarity measurement In the proposed CBR system, the case base is constructed based on domain ontology to improve the retrieval efficiency Then a case representation is proposed and cases are represented by a unified tree model Next, by combining the edit distance with feature weights and the semantic meanings of case nodes, the cost function is proposed to measure the semantic difference and the conditions which make it a metric are discussed Lastly, the similarity function is defined based on the cost function A case study is presented to illustrate the use of the proposed CBR system, and then the experiments are executed to evaluate its performance in dealing with missing values and unmatched features respectively The results validate that the proposed CBR system can handle missing values and unmatched features effectively

...read moreread less

Collapse