scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2016"


Journal ArticleDOI
TL;DR: A new, memory efficient algorithm for the tree edit distance, AP-TED (All Path Tree Edit Distance), which runs at least as fast as RTED without trading in memory efficiency and develops new single-path functions which are better in terms of runtime and memory than the previously used functions.

136 citations


Proceedings ArticleDOI
19 Jun 2016
TL;DR: This work reduces SAT on Branching Programs to fundamental problems in P like Edit Distance, LCS, and many others, and shows that if an arbitrarily large polylog factor is shaved from n2 for Edit Distance then NEXP does not have non-uniform NC1 circuits.
Abstract: A recent, active line of work achieves tight lower bounds for fundamental problems under the Strong Exponential Time Hypothesis (SETH). A celebrated result of Backurs and Indyk (STOC’15) proves that computing the Edit Distance of two sequences of length n in truly subquadratic O(n2−e) time, for some e>0, is impossible under SETH. The result was extended by follow-up works to simpler looking problems like finding the Longest Common Subsequence (LCS). SETH is a very strong assumption, asserting that even linear size CNF formulas cannot be analyzed for satisfiability with an exponential speedup over exhaustive search. We consider much safer assumptions, e.g. that such a speedup is impossible for SAT on more expressive representations, like subexponential-size NC circuits. Intuitively, this assumption is much more plausible: NC circuits can implement linear algebra and complex cryptographic primitives, while CNFs cannot even approximately compute an XOR of bits. Our main result is a surprising reduction from SAT on Branching Programs to fundamental problems in P like Edit Distance, LCS, and many others. Truly subquadratic algorithms for these problems therefore have far more remarkable consequences than merely faster CNF-SAT algorithms. For example, SAT on arbitrary o(n)-depth bounded fan-in circuits (and therefore also NC-Circuit-SAT) can be solved in (2−e)n time. An interesting feature of our work is that we get major consequences even from mildly subquadratic algorithms for Edit Distance or LCS. For example, we show that if an arbitrarily large polylog factor is shaved from n2 for Edit Distance then NEXP does not have non-uniform NC1 circuits.

80 citations


Proceedings ArticleDOI
19 Jun 2016
TL;DR: A randomized injective embedding of the edit distance into the Hamming distance with a small distortion and a randomized embedding with quadratic distortion is shown.
Abstract: The Hamming and the edit metrics are two common notions of measuring distances between pairs of strings x,y lying in the Boolean hypercube. The edit distance between x and y is defined as the minimum number of character insertion, deletion, and bit flips needed for converting x into y. Whereas, the Hamming distance between x and y is the number of bit flips needed for converting x to y. In this paper we study a randomized injective embedding of the edit distance into the Hamming distance with a small distortion. We show a randomized embedding with quadratic distortion. Namely, for any x,y satisfying that their edit distance equals k, the Hamming distance between the embedding of x and y is O(k2) with high probability. This improves over the distortion ratio of O( n * n) obtained by Jowhari (2012) for small values of k. Moreover, the embedding output size is linear in the input size and the embedding can be computed using a single pass over the input. We provide several applications for this embedding. Among our results we provide a one-pass (streaming) algorithm for edit distance running in space O(s) and computing edit distance exactly up-to distance s1/6. This algorithm is based on kernelization for edit distance that is of independent interest.

80 citations


Journal ArticleDOI
TL;DR: The results show that the idea of using structural string representations and distances on top of pretrained CNN features clearly improves the classification performance over standard approaches based on CNN and SVM with linear kernel, as well as other recognized methods of the literature.

54 citations


Journal ArticleDOI
TL;DR: This paper introduces a new distance measure based on q-grams, and shows how it can be applied effectively and computed efficiently for circular sequence comparison, and demonstrates orders-of-magnitude superiority of the approach in terms of efficiency.
Abstract: Background: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. Results: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art.

50 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: To enable graph edit similarity computation on larger and distant graphs, CSI_GED is presented, a novel edge-based mapping method for computing graph edit distance through common sub-structure isomorphisms enumeration that outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.
Abstract: Graph similarity is a basic and essential operation in many applications. In this paper, we are interested in computing graph similarity based on edit distance. Existing graph edit distance computation methods adopt the best first search paradigm A*. These methods are time and space bound. In practice, they can compute the edit distance of graphs containing 12 vertices at most. To enable graph edit similarity computation on larger and distant graphs, we present CSI_GED, a novel edge-based mapping method for computing graph edit distance through common sub-structure isomorphisms enumeration. CSI_GED utilizes backtracking search combined with a number of heuristics to reduce memory requirements and quickly prune away a large portion of the mapping search space. Experiments show that CSI_GED is highly efficient for computing the edit distance on small as well as large and distant graphs. Furthermore, we evaluated CSI_GED as a stand-alone graph edit similarity search query method. The experiments show that CSI_GED is effective and scalable, and outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.

49 citations


Book ChapterDOI
27 Sep 2016
TL;DR: A novel procedure for measuring robustness between digitized CPS signals and Signal Temporal Logic (STL) specifications is proposed and a dynamic programming algorithm for computing the robustness degree is developed.
Abstract: In cyber-physical systems (CPS), physical behaviors are typically controlled by digital hardware. As a consequence, continuous behaviors are discretized by sampling and quantization prior to their processing. Quantifying the similarity between CPS behaviors and their specification is an important ingredient in evaluating correctness and quality of such systems. We propose a novel procedure for measuring robustness between digitized CPS signals and Signal Temporal Logic (STL) specifications. We first equip STL with quantitative semantics based on the weighted edit distance (WED), a metric that quantifies both space and time mismatches between digitized CPS behaviors. We then develop a dynamic programming algorithm for computing the robustness degree between digitized signals and STL specifications. We implemented our approach and evaluated it on an automotive case study.

49 citations


Posted Content
TL;DR: In this paper, it was shown that Alice can send Bob a message of size O(K(log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x and$y$ is no more than $K, and output "error" otherwise.
Abstract: We show that in the document exchange problem, where Alice holds $x \in \{0,1\}^n$ and Bob holds $y \in \{0,1\}^n$, Alice can send Bob a message of size $O(K(\log^2 K+\log n))$ bits such that Bob can recover $x$ using the message and his input $y$ if the edit distance between $x$ and $y$ is no more than $K$, and output "error" otherwise. Both the encoding and decoding can be done in time $\tilde{O}(n+\mathsf{poly}(K))$. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold $x$ and $y$ respectively, they can compute sketches of $x$ and $y$ of sizes $\mathsf{poly}(K \log n)$ bits (the encoding), and send to the referee, who can then compute the edit distance between $x$ and $y$ together with all the edit operations if the edit distance is no more than $K$, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using $\mathsf{poly}(K \log n)$ bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using $\mathsf{poly}(K \log n)$ bits of space.

43 citations


Journal ArticleDOI
TL;DR: In this paper, the authors define a combinatorial distance for Reeb graphs of orientable surfaces in terms of the cost necessary to transform one graph into another by edit operations and show that the edit distance discriminates Reeb graph better than any other distance for surfaces satisfying the stability property.
Abstract: Reeb graphs are structural descriptors that capture shape properties of a topological space from the perspective of a chosen function. In this work, we define a combinatorial distance for Reeb graphs of orientable surfaces in terms of the cost necessary to transform one graph into another by edit operations. The main contributions of this paper are the stability property and the optimality of this edit distance. More precisely, the stability result states that changes in the Reeb graphs, measured by the edit distance, are as small as changes in the functions, measured by the maximum norm. The optimality result states that the edit distance discriminates Reeb graphs better than any other distance for Reeb graphs of surfaces satisfying the stability property.

42 citations


Proceedings ArticleDOI
19 Dec 2016
TL;DR: This work shows that in the document exchange problem, Alice can send Bob a message of size O(K(log2 K + log n) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise.
Abstract: We show that in the document exchange problem, where Alice holds x e {0, 1}n and Bob holds y e {0, 1}n, Alice can send Bob a message of size O(K(log2 K + log n)) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output "error" otherwise. Both the encoding and decoding can be done in time O(n + poly(K)). This result significantly improves on the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold x and y respectively, they can compute sketches of x and y of sizes poly(K log n) bits (the encoding), and send to the referee, who can then compute the edit distance between x and y together with all the edit operations if the edit distance is no more than K, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(K log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(K log n) bits of space.

41 citations


Book
09 Jan 2016
TL;DR: This unique text/reference presents a thorough introduction to the field of structural pattern recognition, with a particular focus on graph edit distance (GED) and a detailed review of a diverse selection of novel methods related to GED.
Abstract: This unique text/reference presents a thorough introduction to the field of structural pattern recognition, with a particular focus on graph edit distance (GED). The book also provides a detailed review of a diverse selection of novel methods related to GED, and concludes by suggesting possible avenues for future research. Topics and features: formally introduces the concept of GED, and highlights the basic properties of this graph matching paradigm; describes a reformulation of GED to a quadratic assignment problem; illustrates how the quadratic assignment problem of GED can be reduced to a linear sum assignment problem; reviews strategies for reducing both the overestimation of the true edit distance and the matching time in the approximation framework; examines the improvement demonstrated by the described algorithmic framework with respect to the distance accuracy and the matching time; includes appendices listing the datasets employed for the experimental evaluations discussed in the book.

Journal ArticleDOI
TL;DR: This article revisits the kNN classifier on time-series data by considering ten classic distance-based vote weighting schemes in the context of Euclidean distance, as well as four commonly used elastic distance measures: DTW, Longest Common Subsequence, Edit Distance with Real Penalty and Edit Distance on Real sequence.
Abstract: Many well-known machine learning algorithms have been applied to the task of time-series classification, including decision trees, neural networks, support vector machines and others. However, it was shown that the simple 1-nearest neighbor (1NN) classifier, coupled with an elastic distance measure like Dynamic Time Warping (DTW), often produces better results than more complex classifiers on time-series data, including k-nearest neighbor (kNN) for values of $$k>1$$k>1. In this article, we revisit the kNN classifier on time-series data by considering ten classic distance-based vote weighting schemes in the context of Euclidean distance, as well as four commonly used elastic distance measures: DTW, Longest Common Subsequence, Edit Distance with Real Penalty and Edit Distance on Real sequence. Through experiments on the complete collection of UCR time-series datasets, we confirm the view that the 1NN classifier is very hard to beat. Overall, for all considered distance measures, we found that variants of the Dudani weighting scheme produced the best results.

Journal ArticleDOI
TL;DR: The article describes a new tool for analyses of eye-movement data called ScanGraph, modified to work better with the sequences with different lengths, and describes its functionality on the example of a simple cartographic eye-tracking study.
Abstract: The article describes a new tool for analyses of eye-movement data. Many different approaches to scanpath comparison exist. One of the most frequently used approaches is String Edit Distance, where the gaze trajectories are replaced by the sequences of visited Areas of Interest. In cartographic literature, the most commonly used software for scanpath comparison is eyePatterns. During the analysis of eyePatterns functionality, we have found that tree-graph visualization of its results is not reliable. Thus, we decided to develop a new tool called ScanGraph. Its computational algorithms are modified to work better with the sequences with different lengths. The output is visualized as a simple graph, and similar groups of sequences are displayed as cliques of this graph. The article describes ScanGraph’s functionality on the example of a simple cartographic eye-tracking study. Differences of the reading strategy of a simple map between cartographic experts and novices were investigated. The paper should serve to the researchers who would like to analyze differences between groups of participants, and who would like to use our tool - ScanGraph, available at www.eyetracking.upol.cz/scangraph.

Proceedings ArticleDOI
01 Jan 2016
TL;DR: In this paper, the authors considered the problem of interactive communication in which two remote parties perform a computation while their communication channel is (adversarially) noisy, and they obtained the first interactive coding scheme that has a constant rate and tolerates noise rates of up to 1/18 - epsilon.
Abstract: We consider the question of interactive communication, in which two remote parties perform a computation while their communication channel is (adversarially) noisy. We extend here the discussion into a more general and stronger class of noise, namely, we allow the channel to perform insertions and deletions of symbols. These types of errors may bring the parties "out of sync", so that there is no consensus regarding the current round of the protocol. In this more general noise model, we obtain the first interactive coding scheme that has a constant rate and tolerates noise rates of up to 1/18 - epsilon. To this end we develop a novel primitive we name edit distance tree code. The edit distance tree code is designed to replace the Hamming distance constraints in Schulman's tree codes (STOC 93), with a stronger edit distance requirement. However, the straightforward generalization of tree codes to edit distance does not seem to yield a primitive that suffices for communication in the presence of synchronization problems. Giving the "right" definition of edit distance tree codes is a main conceptual contribution of this work.

Journal ArticleDOI
TL;DR: An original approach for the writer authentication task based on the analysis of a unique sample of a handwriting word using the Levenshtein edit distance based on Fisher-Wagner algorithm to estimate the cost of transforming one handwritten word into another.
Abstract: The writer recognition task has received a lot of interests during the last decade due to it wide range of applications. This task includes writer identification and/or writer verification. However, all the researches assumed that they dispose of a large amount of text to identify or authenticate the writer, which is never the case in real-life applications. In this paper, we present an original approach for the writer authentication task based on the analysis of a unique sample of a handwriting word. We used the Levenshtein edit distance based on Fisher-Wagner algorithm to estimate the cost of transforming one handwritten word into another. Such method has been successfully applied for signature authentication and voice recognition. In order to apply it to handwriting words, we developed a segmentation module to generate the graphemes; considered as elementary components for each word. We evaluated this approach on part of the IAM database (100 writers), where half of them provided three samples only of the same word. The obtained results are very promising since we succeed to accept correctly in 87 % of cases when we used the whole database (100 writers) and up to 92 % when we used 40 writers.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The first truly sub-cubic algorithm for the bounded differences (min,+)-product of two n by n matrices has been given in this paper, which is equivalent to the All-Pairs-Shortest-Paths (APSP) problem in n-vertex graphs.
Abstract: It is a major open problem whether the (min,+)-product of two n by n matrices has a truly sub-cubic time algorithm, as it is equivalent to the famous All-Pairs-Shortest-Paths problem (APSP) in n-vertex graphs. There are some restrictions of the (min,+)-product to special types of matrices that admit truly sub-cubic algorithms, each giving rise to a special case of APSP that can be solved faster. In this paper we consider a new, different and powerful restriction in which one matrix can be arbitrary, as long as the other matrix has "bounded differences" in either its columns or rows, i.e. any two consecutive entries differ by only a small amount. We obtain the first truly sub-cubic algorithm for this Bounded Differences (min,+)-product (answering an open problem of Chan and Lewenstein). Our new algorithm, combined with a strengthening of an approach of L. Valiant for solving context-free grammar parsing with matrix multiplication, yields the first truly sub-cubic algorithms for the following problems: Language Edit Distance (a major problem in the parsing community), RNA-folding (a major problem in bioinformatics) and Optimum Stack Generation (answering an open problem of Tarjan).

Journal ArticleDOI
TL;DR: This work develops a dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised, and improves the error model and language model by analyzing error types and creating an edit distance re-ranker.
Abstract: A spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.

Proceedings ArticleDOI
01 Jan 2016
TL;DR: In this article, the first subquadratic algorithms for computing similarity between a pair of point sequences in R^d, for any fixed d > 1, using dynamic time warping (DTW) and edit distance, assuming that the point sequences are drawn from certain natural families of curves.
Abstract: We present the first subquadratic algorithms for computing similarity between a pair of point sequences in R^d, for any fixed d > 1, using dynamic time warping (DTW) and edit distance, assuming that the point sequences are drawn from certain natural families of curves. In particular, our algorithms compute (1 + eps)-approximations of DTW and ED in near-linear time for point sequences drawn from k-packed or k-bounded curves, and subquadratic time for backbone sequences. Roughly speaking, a curve is k-packed if the length of its intersection with any ball of radius r is at most kr, and it is k-bounded if the sub-curve between two curve points does not go too far from the two points compared to the distance between the two points. In backbone sequences, consecutive points are spaced at approximately equal distances apart, and no two points lie very close together. Recent results suggest that a subquadratic algorithm for DTW or ED is unlikely for an arbitrary pair of point sequences even for d = 1. The commonly used dynamic programming algorithms for these distance measures reduce the problem to computing a minimum-weight path in a grid graph. Our algorithms work by constructing a small set of rectangular regions that cover the grid vertices. The weights of vertices inside each rectangle are roughly the same, and we develop efficient procedures to compute the approximate minimum-weight paths through these rectangles.

Proceedings ArticleDOI
08 May 2016
TL;DR: The main contribution of this paper is the introduction of a general edit distance suitable for comparing Reeb graphs in these settings, which promises to be useful for applications in 3D object retrieval because of its stability properties in the presence of noise.
Abstract: We consider the problem of assessing the similarity of 3D shapes using Reeb graphs from the standpoint of robustness under perturbations. For this purpose, 3D objects are viewed as spaces endowed with real-valued functions, while the similarity between the resulting Reeb graphs is addressed through a graph edit distance. The cases of smooth functions on manifolds and piecewise linear functions on polyhedra stand out as the most interesting ones. The main contribution of this paper is the introduction of a general edit distance suitable for comparing Reeb graphs in these settings. This edit distance promises to be useful for applications in 3D object retrieval because of its stability properties in the presence of noise.

Posted Content
TL;DR: This paper presents two streaming algorithms for computing edit distance, one which runs in time $O(n+k^2)$ and the other which is known to be optimal under the Strong Exponential Time Hypothesis.
Abstract: The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. In this paper we study the computational problem of computing the edit distance between a pair of strings where their distance is bounded by a parameter $k\ll n$. We present two streaming algorithms for computing edit distance: One runs in time $O(n+k^2)$ and the other $n+O(k^3)$. By writing $n+O(k^3)$ we want to emphasize that the number of operations per an input symbol is a small constant. In particular, the running time does not depend on the alphabet size, and the algorithm should be easy to implement. Previously a streaming algorithm with running time $O(n+k^4)$ was given in the paper by the current authors (STOC'16). The best off-line algorithm runs in time $O(n+k^2)$ (Landau et al., 1998) which is known to be optimal under the Strong Exponential Time Hypothesis.

Journal ArticleDOI
TL;DR: The statistical analysis carried out in this paper shows that the proposed WED-based method improves the system reliability by significantly increasing in the accuracy of word prediction.
Abstract: P300 speller-based brain-computer interface is a direct communication from human-brain to computer machine without any muscular movements. In conventional P300 speller, a display paradigm is used to present alphanumeric characters to users and a classification system is used to detect the target character from the acquired electroencephalographic signals. In this paper, we present an $8\times 8$ matrix consisting of Devanagari characters, digits, and special characters as Devanagari script (DS)-based display paradigm. The larger size of the display paradigm as compared with conventional $6\times 6$ English row/column (RC) paradigm, involvement of matras and ardha-aksharas and similar looking characters in DS increase the adjacency problem, crowding effect, fatigue, and task difficulty. This results in deteriorated performance at the classification stage. Binary differential evolution algorithm was employed for optimal channel selection and support vector machine was used to classify target verses non target stimuli for the data set collected from ten healthy subjects using the DS-based paradigm. In order to further improve the system reliability in terms of higher accuracy at word prediction level, this paper proposes a novel spelling correction approach based on weighted edit distance (WED). A custom-built dictionary was incorporated and each misspelled word was replaced by a correct word of minimum WED from it. The proposed work is based on the validation of hypothesis that most of the target-error pairs lie in the same RC. Using the proposed spelling correction approach with optimal channel selection, an average accuracy of 99% was achieved at the word prediction level. The statistical analysis carried out in this paper shows that the proposed WED-based method improves the system reliability by significantly increasing in the accuracy of word prediction. This paper also validates that the proposed method performs better as compared to the conventional edit distance-based spelling correction approach.

Journal ArticleDOI
TL;DR: The VarMatch algorithm is presented, based on a theoretical result which allows us to partition the input into smaller subproblems without sacrificing accuracy, and is able to detect more matches than either the normalization or decomposition algorithms on tested datasets.
Abstract: Motivation: Small variant calling is an important component of many analyses, and, in many instances, it is important to determine the set of variants which appear in multiple callsets. Variant matching is complicated by variants that have multiple equivalent representations. Normalization and decomposition algorithms have been proposed, but are not robust to different representation of complex variants. Variant matching is also usually done to maximize the number of matches, as opposed to other optimization criteria. Results: We present the VarMatch algorithm for the variant matching problem. Our algorithm is based on a theoretical result which allows us to partition the input into smaller subproblems without sacrificing accuracy. VarMatch is robust to different representation of complex variants and is particularly effective in low complexity regions or those dense in variants. VarMatch is able to detect more matches than either the normalization or decomposition algorithms on tested datasets. It also implements different optimization criteria, such as edit distance, that can improve robustness to different variant representations. Finally, the VarMatch software provides summary statistics, annotations and visualizations that are useful for understanding callers’ performance. Availability and Implementation: VarMatch is freely available at: https://github.com/medvedevgroup/varmatch Contact: chensun@cse.psu.edu or pashadag@cse.psu.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Posted Content
TL;DR: The problem of transforming a set of elements into another by a sequence of elementary edit operations, namely substitutions, removals and insertions of elements, can be formalized as an extension of the linear sum assignment problem (LSAP), which thus finds an optimal bijection between the two augmented sets.
Abstract: We consider the problem of transforming a set of elements into another by a sequence of elementary edit operations, namely substitutions, removals and insertions of elements. Each possible edit operation is penalized by a non-negative cost and the cost of a transformation is measured by summing the costs of its operations. A solution to this problem consists in defining a transformation having a minimal cost, among all possible transformations. To compute such a solution, the classical approach consists in representing removal and insertion operations by augmenting the two sets so that they get the same size. This allows to express the problem as a linear sum assignment problem (LSAP), which thus finds an optimal bijection (or permutation, perfect matching) between the two augmented sets. While the LSAP is known to be efficiently solvable in polynomial time complexity, for instance with the Hungarian algorithm, useless time and memory are spent to treat the elements which have been added to the initial sets. In this report, we show that the problem can be formalized as an extension of the LSAP which considers only one additional element in each set to represent removal and insertion operations. A solution to the problem is no longer represented as a bijection between the two augmented sets. We show that the considered problem is a binary linear program (BLP) very close to the LSAP. While it can be solved by any BLP solver, we propose an adaptation of the Hungarian algorithm which improves the time and memory complexities previously obtained by the approach based on the LSAP. The importance of the improvement increases as the size of the two sets and their absolute difference increase. Based on the analysis of the problem presented in this report, other classical algorithms can be adapted.

Proceedings ArticleDOI
10 Jul 2016
TL;DR: It is shown that if the reads are long enough and there are sufficiently many of them, then approximate reconstruction is possible: a simple algorithm such that for almost all original sequences the output of the algorithm is a sequence whose edit distance from the original one is at most O(ε) times the length of the original sequence.
Abstract: The prevalent technique for DNA sequencing consists of two main steps: shotgun sequencing, where many randomly located fragments, called reads, are extracted from the overall sequence, followed by an assembly algorithm that aims to reconstruct the original sequence. There are many different technologies that generate the reads: widely-used second-generation methods create short reads with low error rates, while emerging third-generation methods create long reads with high error rates. Both error rates and error profiles differ among methods, so reconstruction algorithms are often tailored to specific shotgun sequencing technologies. As these methods change over time, a fundamental question is whether there exist reconstruction algorithms which are robust, i.e., which perform well under a wide range of error distributions. Here we study this question of sequence assembly from corrupted reads. We make no assumption on the types of errors in the reads, but only assume a bound on their magnitude. More precisely, for each read we assume that instead of receiving the true read with no errors, we receive a corrupted read which has edit distance at most e times the length of the read from the true read. We show that if the reads are long enough and there are sufficiently many of them, then approximate reconstruction is possible: we construct a simple algorithm such that for almost all original sequences the output of the algorithm is a sequence whose edit distance from the original one is at most O(e) times the length of the original sequence.

Journal ArticleDOI
TL;DR: Harry is a small tool specifically designed for measuring the similarity of strings and implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel.
Abstract: Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka.

Posted Content
TL;DR: The presented method is compared to several baselines including `orthographic translation' with Levenshtein edit distance and outperforms them by a large margin and shows that language-independent `semantic fingerprints' are superior to multi-lingual clustering algorithms proposed in the previous work, at the same time requiring less linguistic resources.
Abstract: We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic texts, for which topics are their academic fields. In order to build language-independent semantic representations of these documents, we train neural distributional models on monolingual corpora and learn the optimal linear transformation of vectors from one language to another. The resulting vectors are then used to produce `semantic fingerprints' of documents, serving as input to a clustering algorithm. The presented method is compared to several baselines including `orthographic translation' with Levenshtein edit distance and outperforms them by a large margin. We also show that language-independent `semantic fingerprints' are superior to multi-lingual clustering algorithms proposed in the previous work, at the same time requiring less linguistic resources.

Book ChapterDOI
01 Jan 2016
TL;DR: Some of the most common tools used for computing the edit distance function are described, the major current results are summarized, generalizations to other combinatorial structures are outlined, and some open problems are posed.
Abstract: The edit distance is a very simple and natural metric on the space of graphs. In the edit distance problem, we fix a hereditary property of graphs and compute the asymptotically largest edit distance of a graph from the property. This quantity is very difficult to compute directly but in many cases, it can be derived as the maximum of the edit distance function. Szemeredi’s regularity lemma, strongly regular graphs, constructions related to the Zarankiewicz problem – all these play a role in the computing of edit distance functions. The most powerful tool is derived from symmetrization, which we use to optimize quadratic programs that define the edit distance function. In this paper, we describe some of the most common tools used for computing the edit distance function, summarize the major current results, outline generalizations to other combinatorial structures, and pose some open problems.

Journal ArticleDOI
TL;DR: Several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem, which aims to find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion.
Abstract: Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. One popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs. The algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads. Our algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in this paper are also applicable to other motif search problems such as Planted Motif Search (PMS) and Simple Motif Search (SMS).

Posted Content
TL;DR: This work investigates the problem of aligning two RDF databases, an essential problem in understanding the evolution of ontologies, and proposes approaches inspired by the classical notion of graph bisimulation to capture the natural metrics of edit distance on the data values and the graph structure.
Abstract: We investigate the problem of aligning two RDF databases, an essential problem in understanding the evolution of ontologies. Our approaches address three fundamental challenges: 1) the use of "blank" (null) names, 2) ontology changes in which different names are used to identify the same entity, and 3) small changes in the data values as well as small changes in the graph structure of the RDF database. We propose approaches inspired by the classical notion of graph bisimulation and extend them to capture the natural metrics of edit distance on the data values and the graph structure. We evaluate our methods on three evolving curated data sets. Overall, our results show that the proposed methods perform well and are scalable.

Proceedings ArticleDOI
15 Jun 2016
TL;DR: The first algorithm that runs in linear time when the number of necessary edits is small is shown, which is related to the task of repairing semi-structured documents such as XML and JSON.
Abstract: We consider the problem of fixing sequences of unbalanced parentheses. A classic algorithm based on dynamic programming computes the optimum sequence of edits required to solve the problem in cubic time. We show the first algorithm that runs in linear time when the number of necessary edits is small. More precisely, our algorithm runs in O(n) + dO(1) time, where n is the length of the sequence to be fixed and d is the minimum number of edits. The problem of fixing parentheses sequences is related to the task of repairing semi-structured documents such as XML and JSON.