scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Book ChapterDOI
09 Jul 2007
TL;DR: A pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts are indicated.
Abstract: What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) compute the number of different symbols (Hamming distance) between two compressed texts of the same length. We present an algorithm that solves the first problem in O(n3) time and the second problem in O(n2m) time. Here n is the size of compressed representation (we consider representations by straight-line programs) of the text and m is the size of compressed representation of the pattern. Next, we prove that the third problem is actually #P-complete. Thus, we indicate a pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts. Our algorithmic technique used for problems (1) and (2) helps for computing minimal periods and covers of compressed texts.

129 citations

Proceedings ArticleDOI
12 Oct 2015
TL;DR: This paper proposes GENSETS, a genome-wide, privacy- preserving similar patient query system able to support search- ing large-scale, distributed genome databases across the nation, and implements a prototype of GENSET, a combination of a novel genomic edit distance ap- proximation algorithm and new construction of private set difference size protocols.
Abstract: Edit distance has been proven to be an important and frequently-used metric in many human genomic research, with Similar Patient Query (SPQ) being a particularly promising and attractive example However, due to the widespread privacy concerns on revealing personal genomic data, the scope and scale of many novel use of genome edit distance are substantially limited While the problem of private genomic edit distance has been studied by the research community for over a decade [6], the state-of-the-art solution [31] is far from even close to be applicable to real genome sequences In this paper, we propose several private edit distance protocols that feature unprecedentedly high efficiency and precision Our construction is a combination of a novel genomic edit distance ap- proximation algorithm and new construction of private set difference size protocols With the private edit distance based secure SPQ primitive, we propose GENSETS, a genome-wide, privacy- preserving similar patient query system It is able to support search- ing large-scale, distributed genome databases across the nation We have implemented a prototype of GENSETS The experimental results show that, with 100 Mbps network connection, it would take GENSETS less than 200 minutes to search through 1 million breast cancer patients (distributed nation-wide in 250 hospitals, each having 4000 patients), based on edit distances between their genomes of lengths about 75 million nucleotides each

128 citations

Patent
09 Jul 1999
TL;DR: A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and/or symbols, or sequences thereof, in a text T and a query Q.
Abstract: A search system for information retrieval includes a data structure in the form of a non-evenly spaced sparse suffix tree for storing suffixes of words and/or symbols, or sequences thereof, in a text T, a metric M including combined edit distance metrics for an approximate degree of matching respectively between words and/or symbols, or between sequences thereof, in the text T and a query Q, the latter distance metric including weighting cost functions for edit operations which transform a sequence S of the text into a sequence P of the query Q, and search algorithms for determining the degree of matching respectively between words and/or symbols, or between sequences thereof, in respectively the text T and the query Q, such that information R is retrieved with a specified degree of matching with the query Q Optionally the search system also includes algorithms for determining exact matching such that information R may be retrieved with an exact degree of matching with the query Q

128 citations

Journal ArticleDOI
TL;DR: Previous work on structural entropy to the metamorphic detection problem is applied and it is shown that this technique relies on an analysis of variations in the complexity of data within a file to obtain strong results in certain challenging cases.
Abstract: Metamorphic malware is capable of changing its internal structure without altering its functionality. A common signature is nonexistent in highly metamorphic malware and, consequently, such malware can remain undetected under standard signature scanning. In this paper, we apply previous work on structural entropy to the metamorphic detection problem. This technique relies on an analysis of variations in the complexity of data within a file. The process consists of two stages, namely, file segmentation and sequence comparison. In the segmentation stage, we use entropy measurements and wavelet analysis to segment files. The second stage measures the similarity of file pairs by computing an edit distance between the sequences of segments obtained in the first stage. We apply this similarity measure to the metamorphic detection problem and show that we obtain strong results in certain challenging cases.

128 citations

Journal ArticleDOI
TL;DR: A novel approach for finding similar trajectories, using trajectory segmentation based on movement parameters (MPs) such as speed, acceleration, or direction, using a modified version of edit distance called normalized weighted edit distance (NWED) is introduced as a similarity measure.
Abstract: This article describes a novel approach for finding similar trajectories, using trajectory segmentation based on movement parameters MPs such as speed, acceleration, or direction. First, a segmentation technique is applied to decompose trajectories into a set of segments with homogeneous characteristics with respect to a particular MP. Each segment is assigned to a movement parameter class MPC, representing the behavior of the MP. Accordingly, the segmentation procedure transforms a trajectory to a sequence of class labels, that is, a symbolic representation. A modified version of edit distance called normalized weighted edit distance NWED is introduced as a similarity measure between different sequences. As an application, we demonstrate how the method can be employed to cluster trajectories. The performance of the approach is assessed in two case studies using real movement datasets from two different application domains, namely, North Atlantic Hurricane trajectories and GPS tracks of couriers in London. Three different experiments have been conducted that respond to different facets of the proposed techniques and that compare our NWED measure to a related method.

128 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139