scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2015"


Journal ArticleDOI
TL;DR: This work believes that their ensemble is the first ever classifier to significantly outperform DTW and raises the bar for future work in this area, and demonstrates that the ensemble is more accurate than approaches not based in the time domain.
Abstract: Several alternative distance measures for comparing time series have recently been proposed and evaluated on time series classification (TSC) problems. These include variants of dynamic time warping (DTW), such as weighted and derivative DTW, and edit distance-based measures, including longest common subsequence, edit distance with real penalty, time warp with edit, and move---split---merge. These measures have the common characteristic that they operate in the time domain and compensate for potential localised misalignment through some elastic adjustment. Our aim is to experimentally test two hypotheses related to these distance measures. Firstly, we test whether there is any significant difference in accuracy for TSC problems between nearest neighbour classifiers using these distance measures. Secondly, we test whether combining these elastic distance measures through simple ensemble schemes gives significantly better accuracy. We test these hypotheses by carrying out one of the largest experimental studies ever conducted into time series classification. Our first key finding is that there is no significant difference between the elastic distance measures in terms of classification accuracy on our data sets. Our second finding, and the major contribution of this work, is to define an ensemble classifier that significantly outperforms the individual classifiers. We also demonstrate that the ensemble is more accurate than approaches not based in the time domain. Nearly all TSC papers in the data mining literature cite DTW (with warping window set through cross validation) as the benchmark for comparison. We believe that our ensemble is the first ever classifier to significantly outperform DTW and as such raises the bar for future work in this area.

443 citations


Proceedings ArticleDOI
14 Jun 2015
TL;DR: This paper shows that, if the edit distance can be computed in time O(n2-δ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time MO(1) 2(1-ε)N for a constant ε>0.
Abstract: The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for this problem run in nearly quadratic time.In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be {tight}. Specifically, we show that, if the edit distance can be computed in time O(n2-δ) for some constant δ>0, then the satisfiability of conjunctive normal form formulas with N variables and M clauses can be solved in time MO(1) 2(1-e)N for a constant e>0. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.

264 citations


Proceedings ArticleDOI
17 Oct 2015
TL;DR: In this article, it was shown that these measures do not have strongly sub quadratic time algorithms, i.e., no algorithm with running time O(n 2 ) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails.
Abstract: Classic similarity measures of strings are longest common subsequence and Levenshtein distance (i.e., The classic edit distance). A classic similarity measure of curves is dynamic time warping. These measures can be computed by simple O(n2) dynamic programming algorithms, and despite much effort no algorithms with significantly better running time are known. We prove that, even restricted to binary strings or one-dimensional curves, respectively, these measures do not have strongly sub quadratic time algorithms, i.e., No algorithms with running time O(n2 -- a#x03B5;) for any a#x03B5; > 0, unless the Strong Exponential Time Hypothesis fails. We generalize the result to edit distance for arbitrary fixed costs of the four operations (deletion in one of the two strings, matching, substitution), by identifying trivial cases that can be solved in constant time, and proving quadratic-time hardness on binary strings for all other cost choices. This improves and generalizes the known hardness result for Levenshtein distance [Backurs, Indyk STOC'15] by the restriction to binary strings and the generalization to arbitrary costs, and adds important problems to a recent line of research showing conditional lower bounds for a growing number of quadratic time problems. As our main technical contribution, we introduce a framework for proving quadratic-time hardness of similarity measures. To apply the framework it suffices to construct a single gadget, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability. Finally, we prove quadratic-time hardness for longest palindromic subsequence and longest tandem subsequence via reductions from longest common subsequence, showing that conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.

195 citations


Journal ArticleDOI
TL;DR: This paper introduces and compares four of the most common measures of trajectory similarity: longest common subsequence (LCSS), Fréchet distance, dynamic time warping (DTW), and edit distance, implemented in a new open source R package.
Abstract: Storing, querying, and analyzing trajectories is becoming increasingly important, as the availability and volumes of trajectory data increases. One important class of trajectory analysis is computing trajectory similarity. This paper introduces and compares four of the most common measures of trajectory similarity: longest common subsequence (LCSS), Frechet distance, dynamic time warping (DTW), and edit distance. These four measures have been implemented in a new open source R package, freely available on CRAN [19]. The paper highlights some of the differences between these four similarity measures, using real trajectory data, in addition to indicating some of the important emerging applications for measurement of trajectory similarity.

144 citations


Proceedings ArticleDOI
10 Jan 2015
TL;DR: This work proposes a depth-first graph edit distance algorithm which requires less memory and searching time, and empirically demonstrated that this approach is better than the A* graph edits distance computation in terms of speed, accuracy and classification rate.
Abstract: Graph edit distance is an error tolerant matching technique emerged as a powerful and flexible graph matching paradigm that can be used to address different tasks in pattern recognition, machine learning and data mining; it represents the minimum-cost sequence of basic edit operations to transform one graph into another by means of insertion, deletion and substitution of vertices and/or edges. A widely used method for exact graph edit distance computation is based on the A* algorithm. To overcome its high memory load while traversing the search tree for storing pending solutions to be explored, we propose a depth-first graph edit distance algorithm which requires less memory and searching time. An evaluation of all possible solutions is performed without explicitly enumerating them all. Candidates are discarded using an upper and lower bounds strategy. A solid experimental study is proposed; experiments on a publicly available database empirically demonstrated that our approach is better than the A* graph edit distance computation in terms of speed, accuracy and classification rate.

143 citations


Proceedings ArticleDOI
12 Oct 2015
TL;DR: This paper proposes GENSETS, a genome-wide, privacy- preserving similar patient query system able to support search- ing large-scale, distributed genome databases across the nation, and implements a prototype of GENSET, a combination of a novel genomic edit distance ap- proximation algorithm and new construction of private set difference size protocols.
Abstract: Edit distance has been proven to be an important and frequently-used metric in many human genomic research, with Similar Patient Query (SPQ) being a particularly promising and attractive example However, due to the widespread privacy concerns on revealing personal genomic data, the scope and scale of many novel use of genome edit distance are substantially limited While the problem of private genomic edit distance has been studied by the research community for over a decade [6], the state-of-the-art solution [31] is far from even close to be applicable to real genome sequences In this paper, we propose several private edit distance protocols that feature unprecedentedly high efficiency and precision Our construction is a combination of a novel genomic edit distance ap- proximation algorithm and new construction of private set difference size protocols With the private edit distance based secure SPQ primitive, we propose GENSETS, a genome-wide, privacy- preserving similar patient query system It is able to support search- ing large-scale, distributed genome databases across the nation We have implemented a prototype of GENSETS The experimental results show that, with 100 Mbps network connection, it would take GENSETS less than 200 minutes to search through 1 million breast cancer patients (distributed nation-wide in 250 hospitals, each having 4000 patients), based on edit distances between their genomes of lengths about 75 million nucleotides each

128 citations


Journal ArticleDOI
TL;DR: A quadratic time approximation of graph edit distance based on Hausdorff matching is proposed and shows a promising potential in terms of flexibility, efficiency, and accuracy.

118 citations


Journal ArticleDOI
TL;DR: RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature.
Abstract: We consider the classical tree edit distance between ordered labelled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity but the worst case happens frequently, or they are very efficient for some tree shapes but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.In this article we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of our algorithm is smaller than or equal to the complexity of the best competitors for any input instance, that is, our algorithm is both efficient and worst-case optimal. This is achieved by computing a dynamic decomposition strategy that depends on the input trees. RTED is shown optimal among all algorithms that use LRH (left-right-heavy) strategies, which include RTED and the fastest tree edit distance algorithms presented in literature. In our experiments on synthetic and real-world data we empirically evaluate our solution and compare it to the state-of-the-art.

112 citations


Book ChapterDOI
26 Jan 2015
TL;DR: In this article, the authors proposed a method to perform the edit distance algorithm on encrypted data to obtain an encrypted result, where the genomic data owner provided only the encrypted sequence, and the public commercial cloud can perform the sequence analysis without decryption.
Abstract: These days genomic sequence analysis provides a key way of understanding the biology of an organism. However, since these sequences contain much private information, it can be very dangerous to reveal any part of them. It is desirable to protect this sensitive information when performing sequence analysis in public. As a first step in this direction, we present a method to perform the edit distance algorithm on encrypted data to obtain an encrypted result. In our approach, the genomic data owner provides only the encrypted sequence, and the public commercial cloud can perform the sequence analysis without decryption. The result can be decrypted only by the data owner or designated representative holding the decryption key.

111 citations


Proceedings ArticleDOI
13 Apr 2015
TL;DR: A robust distance function called Edit Distance with Projections (EDwP) to match trajectories under inconsistent and variable sampling rates through dynamic interpolation is formulated, and an index structure called TrajTree is designed to enable efficient trajectory retrievals using EDwP.
Abstract: Quantifying the similarity between two trajectories is a fundamental operation in analysis of spatio-temporal databases. While a number of distance functions exist, the recent shift in the dynamics of the trajectory generation procedure violates one of their core assumptions; a consistent and uniform sampling rate. In this paper, we formulate a robust distance function called Edit Distance with Projections (EDwP) to match trajectories under inconsistent and variable sampling rates through dynamic interpolation. This is achieved by deploying the idea of projections that goes beyond matching only the sampled points while aligning trajectories. To enable efficient trajectory retrievals using EDwP, we design an index structure called TrajTree. TrajTree derives its pruning power by employing the unique combination of bounding boxes with Lipschitz embedding. Extensive experiments on real trajectory databases demonstrate EDwP to be up to 5 times more accurate than the state-of-the-art distance functions. Additionally, TrajTree increases the efficiency of trajectory retrievals by up to an order of magnitude over existing techniques.

108 citations


Journal ArticleDOI
TL;DR: Gentry et al. as mentioned in this paper used homomorphic encryption for secure computation of the minor allele frequencies and χ2 statistic in a genome-wide association studies setting, which can be performed in an untrusted cloud without requiring the decryption key or any interaction with the data owner.
Abstract: The rapid development of genome sequencing technology allows researchers to access large genome datasets. However, outsourcing the data processing o the cloud poses high risks for personal privacy. The aim of this paper is to give a practical solution for this problem using homomorphic encryption. In our approach, all the computations can be performed in an untrusted cloud without requiring the decryption key or any interaction with the data owner, which preserves the privacy of genome data. We present evaluation algorithms for secure computation of the minor allele frequencies and χ2 statistic in a genome-wide association studies setting. We also describe how to privately compute the Hamming distance and approximate Edit distance between encrypted DNA sequences. Finally, we compare performance details of using two practical homomorphic encryption schemes - the BGV scheme by Gentry, Halevi and Smart and the YASHE scheme by Bos, Lauter, Loftus and Naehrig. The approach with the YASHE scheme analyzes data from 400 people within about 2 seconds and picks a variant associated with disease from 311 spots. For another task, using the BGV scheme, it took about 65 seconds to securely compute the approximate Edit distance for DNA sequences of size 5K and figure out the differences between them. The performance numbers for BGV are better than YASHE when homomorphically evaluating deep circuits (like the Hamming distance algorithm or approximate Edit distance algorithm). On the other hand, it is more efficient to use the YASHE scheme for a low-degree computation, such as minor allele frequencies or χ2 test statistic in a case-control study.

Posted Content
TL;DR: In this paper, the authors show that the edit distance of two sequences of length n cannot be computed in strongly subquadratic time under Strong Exponential Time Hypothesis (SETH).
Abstract: A recent and active line of work achieves tight lower bounds for fundamental problems under the Strong Exponential Time Hypothesis (SETH). A celebrated result of Backurs and Indyk (STOC'15) proves that the Edit Distance of two sequences of length n cannot be computed in strongly subquadratic time under SETH. The result was extended by follow-up works to simpler looking problems like finding the Longest Common Subsequence (LCS). SETH is a very strong assumption, asserting that even linear size CNF formulas cannot be analyzed for satisfiability with an exponential speedup over exhaustive search. We consider much safer assumptions, e.g. that such a speedup is impossible for SAT on much more expressive representations, like NC circuits. Intuitively, this seems much more plausible: NC circuits can implement complex cryptographic primitives, while CNFs cannot even approximately compute an XOR of bits. Our main result is a surprising reduction from SAT on Branching Programs to fundamental problems in P like Edit Distance, LCS, and many others. Truly subquadratic algorithms for these problems therefore have consequences that we consider to be far more remarkable than merely faster CNF SAT algorithms. For example, SAT on arbitrary o(n)-depth bounded fan-in circuits (and therefore also NC-Circuit-SAT) can be solved in (2-eps)^n time. A very interesting feature of our work is that we can prove major consequences even from mildly subquadratic algorithms for Edit Distance or LCS. For example, we show that if we can shave an arbitrarily large polylog factor from n^2 for Edit Distance then NEXP does not have non-uniform NC^1 circuits. A more fine-grained examination shows that even shaving a $\log^c{n}$ factor, for a specific constant $c \approx 10^3$, already implies new circuit lower bounds.

BookDOI
01 Jan 2015
TL;DR: This chapter introduces pattern recognition as a computer science discipline and outlines the major differences between statistical and structural pattern recognition, and formally introduces and complemented by a list of applications where graphs are actually employed.
Abstract: In this chapter we first introduce pattern recognition as a computer science discipline and then outline the major differences between statistical and structural pattern recognition. In particular, we discuss the advantages and drawbacks of both approaches. Eventually, graph-based pattern representation is formally introduced and complemented by a list of applications where graphs are actually employed. The remaining parts of this chapter are then dedicated to formal introductions of diverse graph matching definitions. We particularly delve into the difference between exact and inexact graph matching. Last but not least, we give a brief survey of existing graphmatchingmethodologies that somehowdiffer from the approach that is actually pursued in the present book. 1.1 Pattern Recognition The ability of recognizing patterns has been essential for our survival and thus, evolution has led to highly sophisticated neural and cognitive systems in humans for solving pattern recognition tasks [1]. In fact, humans are faced with a great diversity of pattern recognition problems in their everyday life. Examples of pattern recognition tasks—which are in the majority of cases intuitively solved—include the recognition of a written or a spoken word, the face of a friend, an object on the table, a traffic sign on the road, and many others. These simple examples illustrate the essence of pattern recognition. In the world there exist classes of patterns which are recognized by humans according to certain knowledge learned before [2]. The terminology pattern refers to any observation in the real world (e.g., an image, an object, a symbol, or a word, to name just a few). The overall aim of pattern recognition as a computer science discipline is to develop methods that are able to (partially) imitate the human capacity of perception and intelligence. In other words, pattern recognition aims at defining algorithms that automate or (at least) support the process of recognizing patterns stemming from the real world. However, pattern recognition refers to a highly complex process which cannot be solved by means of explicitly specified algorithms in general. For instance, to date one is not able to write an analytic algorithm to recognize, say, a face in a photo [3]. © Springer International Publishing Switzerland 2015 K. Riesen, Structural Pattern Recognition with Graph Edit Distance, Advances in Computer Vision and Pattern Recognition, DOI 10.1007/978-3-319-27252-8_1 3 4

Journal ArticleDOI
TL;DR: The classification experiment leads to the conclusion that when the pairwise distance matrix obtained from the training data is far from definiteness, the positive definite recursive elastic kernels outperform in general the distance substituting kernels for several classical elastic distances.
Abstract: This paper proposes some extensions to the work on kernels dedicated to string or time series global alignment based on the aggregation of scores obtained by local alignments. The extensions that we propose allow us to construct, from classical recursive definition of elastic distances, recursive edit distance (or time-warp) kernels that are positive definite if some sufficient conditions are satisfied. The sufficient conditions we end up with are original and weaker than those proposed in earlier works, although a recursive regularizing term is required to get proof of the positive definiteness as a direct consequence of the Haussler’s convolution theorem. Furthermore, the positive definiteness is maintained when a symmetric corridor is used to reduce the search space, and thus the algorithmic complexity, which is quadratic in the worst case. The classification experiment we conducted on three classical time-warp distances (two of which are metrics), using support vector machine classifier, leads to the conclusion that when the pairwise distance matrix obtained from the training data is far from definiteness, the positive definite recursive elastic kernels outperform in general the distance substituting kernels for several classical elastic distances we have tested.

Journal ArticleDOI
TL;DR: A family of Gaussian elastic matching kernels was introduced to deal with the problems of time shift and nonlinear representation and Experimental results showed that the proposed methods generally outperformed state-of-the-arts methods in terms of classification accuracy.

Posted Content
TL;DR: A framework for proving quadratic-time hardness of similarity measures is introduced, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability, and conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.
Abstract: Classic similarity measures of strings are longest common subsequence and Levenshtein distance (i.e., the classic edit distance). A classic similarity measure of curves is dynamic time warping. These measures can be computed by simple $O(n^2)$ dynamic programming algorithms, and despite much effort no algorithms with significantly better running time are known. We prove that, even restricted to binary strings or one-dimensional curves, respectively, these measures do not have strongly subquadratic time algorithms, i.e., no algorithms with running time $O(n^{2-\varepsilon})$ for any $\varepsilon > 0$, unless the Strong Exponential Time Hypothesis fails. We generalize the result to edit distance for arbitrary fixed costs of the four operations (deletion in one of the two strings, matching, substitution), by identifying trivial cases that can be solved in constant time, and proving quadratic-time hardness on binary strings for all other cost choices. This improves and generalizes the known hardness result for Levenshtein distance [Backurs, Indyk STOC'15] by the restriction to binary strings and the generalization to arbitrary costs, and adds important problems to a recent line of research showing conditional lower bounds for a growing number of quadratic time problems. As our main technical contribution, we introduce a framework for proving quadratic-time hardness of similarity measures. To apply the framework it suffices to construct a single gadget, which encapsulates all the expressive power necessary to emulate a reduction from satisfiability. Finally, we prove quadratic-time hardness for longest palindromic subsequence and longest tandem subsequence via reductions from longest common subsequence, showing that conditional lower bounds based on the Strong Exponential Time Hypothesis also apply to string problems that are not necessarily similarity measures.

Proceedings ArticleDOI
17 Oct 2015
TL;DR: For the k-clique problem, Valiant as discussed by the authors showed that any improvement on Valiant's algorithm, either in terms of runtime or by avoiding the inefficiencies of fast matrix multiplication, would imply a breakthrough algorithm for the K-Clique problem.
Abstract: The CFG recognition problem is: given a context-free grammar G and a string w of length n, decide if w can be obtained from G. This is the most basic parsing question and is a core computer science problem. Valiant's parser from 1975 solves the problem in O(nO) time, where ? < 2:373 is the matrix multiplication exponent. Dozens of parsing algorithms have been proposed over the years, yet Valiant's upper bound remains unbeaten. The best combinatorial algorithms have mildly subcubic O(n3= log3 n) complexity. Lee (JACM'01) provided evidence that fast matrix multiplication is needed for CFG parsing, and that very efficient and practical algorithms might be hard or even impossible to obtain. Lee showed that any algorithm for a more general parsing problem with running time O(|G| n3 -- e) can be converted into a surprising subcubic algorithm for Boolean Matrix Multiplication. Unfortunately, Lee' s hardness result required that the grammar size be |G| = O(n6). Nothing was known for the more relevant case of constant size grammars. In this work, we prove that any improvement on Valiant' s algorithm, even for constant size grammars, either in terms of runtime or by avoiding the inefficiencies of fast matrix multiplication, would imply a breakthrough algorithm for the k-Clique problem: given a graph on n nodes, decide if there are k that form a clique. Besides classifying the complexity of a fundamental problem, our reduction has led us to similar lower bounds for more modern and well-studied cubic time problems for which faster algorithms are highly desirable in practice: RNA Folding, a central problem in computational biology, and Dyck Language Edit Distance, answering an open question of Saha (FOCS'14).

Journal ArticleDOI
TL;DR: An optimization method to learn the value of these costs such that the Hamming distance between an oracle's node correspondence and the automatically correspondence is minimized and Experimental validation shows that the clustering and classification experiments drastically increase their accuracy with the automatically learned costs.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The Hamming-Ipsen-Mikhailov (HIM) distance is introduced, a novel metric to quantitatively measure the difference between two graphs sharing the same vertices, to overcome the drawbacks affecting the two components when considered separately.
Abstract: Comparing and classifying graphs represent two essential steps for network analysis, across different scientific and applicative domains. Here we deal with both operations by introducing the Hamming-Ipsen-Mikhailov (HIM) distance, a novel metric to quantitatively measure the difference between two graphs sharing the same vertices. The new measure combines the local Hamming edit distance and the global Ipsen-Mikhailov spectral distance so to overcome the drawbacks affecting the two components when considered separately. Building the kernel function derived from the HIM distance makes possible to move from network comparison to network classification via the Support Vector Machine (SVM) algorithm. Applications of HIM-based methods on synthetic dynamical networks as well as in trade economy and diplomacy datasets demonstrate the effectiveness of HIM as a general purpose solution. An Open Source implementation is provided by the R package nettools, (already configured for High Performance Computing) and the Django-Celery web interface ReNette http://renette.fbk.eu.

Journal ArticleDOI
TL;DR: An integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes is proposed and it is demonstrated that this method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments.
Abstract: Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.

Book ChapterDOI
Jie Zhu1, Wei Jiang1, An Liu1, Guanfeng Liu1, Lei Zhao1 
01 Nov 2015
TL;DR: This paper aims to detect the anomalous trajectories from the trajectory dataset and proposes a novel time-dependent popular routes based algorithm to quantitatively measure the "difference" between a trajectory and a route.
Abstract: With the rapid proliferation of the GPS-equipped devices, a myriad of trajectory data representing the mobility of the various moving objects in two-dimensional space have been generated. In this paper, we aim to detect the anomalous trajectories from the trajectory dataset and propose a novel time-dependent popular routes based algorithm. In our algorithm, spatial and temporal abnormalities are taken into consideration simultaneously to improve the accuracy of the detection. For each group of trajectories with the same source and destination, we firstly design a time-dependent transfer graph and in different time period, we can obtain the top-k most popular routes as reference routes. For a pending inspecting trajectory in this time period, we will label it as an outlier if has a great difference with the selected routes in both spatial and temporal dimension. To quantitatively measure the "difference" between a trajectory and a route, we propose a novel time-dependent distance measure which is based on Edit distance in both spatial and temporal domain. The comparative experimental results with two famous trajectory outlier detection methods TRAOD and IBAT on real dataset demonstrate the good accuracy and efficiency of the proposed algorithm.

Journal ArticleDOI
TL;DR: It is shown that restrictedly embedded subtree detection can be achieved by calculating the restricted edit distance between a candidate subtree and a data tree, and the correctness of the FRESTM algorithm is proved and the time and space complexities of the algorithm are discussed.
Abstract: We consider a new tree mining problem that aims to discover restrictedly embedded subtree patterns from a set of rooted labeled unordered trees. We study the properties of a canonical form of unordered trees, and develop new Apriori-based techniques to generate all candidate subtrees level by level through two efficient rightmost expansion operations: 1) pairwise joining and 2) leg attachment. Next, we show that restrictedly embedded subtree detection can be achieved by calculating the restricted edit distance between a candidate subtree and a data tree. These techniques are then integrated into an efficient algorithm, named frequent restrictedly embedded subtree miner (FRESTM), to solve the tree mining problem at hand. The correctness of the FRESTM algorithm is proved and the time and space complexities of the algorithm are discussed. Experimental results on synthetic and real-world data demonstrate the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: How the well-known bipartite graph edit distance approximation can substantially be improved with respect to distance accuracy is shown and six different methodologies for extending the graph matching framework are introduced.

Journal ArticleDOI
TL;DR: An efficient all-mapper, BitMapper, is developed based on a new vectorized bit-vector algorithm, which simultaneously calculates the edit distance of one read to multiple locations in a given reference genome.
Abstract: As the next-generation sequencing (NGS) technologies producing hundreds of millions of reads every day, a tremendous computational challenge is to map NGS reads to a given reference genome efficiently. However, existing methods of all-mappers, which aim at finding all mapping locations of each read, are very time consuming. The majority of existing all-mappers consist of 2 main parts, filtration and verification. This work significantly reduces verification time, which is the dominant part of the running time. An efficient all-mapper, BitMapper, is developed based on a new vectorized bit-vector algorithm, which simultaneously calculates the edit distance of one read to multiple locations in a given reference genome. Experimental results on both simulated and real data sets show that BitMapper is from several times to an order of magnitude faster than the current state-of-the-art all-mappers, while achieving higher sensitivity, i.e., better quality solutions. We present BitMapper, which is designed to return all mapping locations of raw reads containing indels as well as mismatches. BitMapper is implemented in C under a GPL license. Binaries are freely available at http://home.ustc.edu.cn/%7Echhy .

Journal ArticleDOI
26 Oct 2015
TL;DR: This work presents shape edit distance (SHED), a distance measure that measures the amount of effort needed to transform one shape into the other, in terms of re-arranging the parts of one shape to match the part of the other shape, as well as possibly adding and removing parts.
Abstract: Computing similarities or distances between 3D shapes is a crucial building block for numerous tasks, including shape retrieval, exploration and classification. Current state-of-the-art distance measures mostly consider the overall appearance of the shapes and are less sensitive to fine changes in shape structure or geometry. We present shape edit distance (SHED) that measures the amount of effort needed to transform one shape into the other, in terms of re-arranging the parts of one shape to match the parts of the other shape, as well as possibly adding and removing parts. The shape edit distance takes into account both the similarity of the overall shape structure and the similarity of individual parts of the shapes. We show that SHED is favorable to state-of-the-art distance measures in a variety of applications and datasets, and is especially successful in scenarios where detecting fine details of the shapes is important, such as shape retrieval and exploration.

Book ChapterDOI
13 May 2015
TL;DR: This work proposes to extend previous heuristics by considering both less local and more accurate patterns using subgraphs defined around each node, based on bipartite assignment algorithms proposed in this work.
Abstract: Graph edit distance corresponds to a flexible graph dissimilarity measure. Unfortunately, its computation requires an exponential complexity according to the number of nodes of both graphs being compared. Some heuristics based on bipartite assignment algorithms have been proposed in order to approximate the graph edit distance. However, these heuristics lack of accuracy since they are based either on small patterns providing a too local information or walks whose tottering induce some bias in the edit distance calculus. In this work, we propose to extend previous heuristics by considering both less local and more accurate patterns using subgraphs defined around each node.

Proceedings ArticleDOI
29 Jan 2015
TL;DR: This paper reports work on evaluating students' performances by comparing how far their action strings are from the action string that corresponds to the best performance, where the proximity is quantified by the edit distance between the strings.
Abstract: Students' activities in game/scenario-based tasks (G/SBTs, hereafter) can be characterized by a sequence of time-stamped actions of different types with different attributes. For a subset of the G/SBTs where only the order of the actions are of great interest, the process data can be well characterized as a string of characters (action string, hereafter) if we encode each action name as a single character. In this paper, we report our work on evaluating students' performances by comparing how far their action strings are from the action string that corresponds to the best performance, where the proximity is quantified by the edit distance between the strings. Specifically, we choose the Levenshtein distance, which is defined as the minimum number of insertion, deletion and replacement needed to convert one character string to another. Our results show a strong correlation between the edit distances and the scores obtained from the scoring rubrics of the WELL task from NAEP TEL, implying the edit distance to the best performance sequence can be considered as a new feature variable that encodes information about students' proficiency, which sheds light on the value of data driven scoring rules for test and task development and for refining the scoring rubrics.

Journal ArticleDOI
01 Feb 2015
TL;DR: This paper proposes a unified framework to support various similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance, and results show the method achieves high performance and outperforms state-of-the-art studies significantly.
Abstract: Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from documents. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings from documents that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programing efforts, the hardware requirements, and the manpower. In this paper, we propose a unified framework to support various similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. Since many real-world applications have high-performance requirement for approximate entity extraction on data streams (e.g., Twitter), we focus on devising efficient algorithms to achieve high performance. We find that many substrings in documents have overlaps, and we can utilize the shared computation across the overlaps to avoid unnecessary redundant computation. To this end, we propose efficient filtering algorithms and develop effective pruning techniques. Experimental results show our method achieves high performance and outperforms state-of-the-art studies significantly.

Journal ArticleDOI
TL;DR: A method to perform active and interactive graph matching in which an active module queries one of the nodes of the graphs and the oracle returns the node of the other graph it has to be mapped with is proposed.

Journal ArticleDOI
TL;DR: This paper shows how one can preprocess T in O in O ( z log ? n ) time and space such that later, given a pattern P 1 . . m ] and an edit distance k, one can perform approximate pattern matching in O.