scispace - formally typeset
Search or ask a question
Topic

Edit distance

About: Edit distance is a research topic. Over the lifetime, 2887 publications have been published within this topic receiving 71491 citations.


Papers
More filters
Proceedings ArticleDOI
17 Jun 2018
TL;DR: This paper uses the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding criteria, and algorithms motivated by them, and demonstrates improvement in terms of edit distance error.
Abstract: The problem of reconstructing a sequence when observed through multiple looks over deletion channels occurs in “de novo” DNA sequencing. The DNA could be sequenced multiple times, yielding several “looks” of it, but each time the sequencer could be noisy with (independent) deletion impairments. The main goal of this paper is to develop reconstruction algorithms for a sequence observed through the lens of a fixed number of deletion channels. We use the probabilistic model of the deletion channels to develop both symbol-wise and sequence maximum likelihood decoding criteria, and algorithms motivated by them. Numerical evaluations demonstrate improvement in terms of edit distance error, over earlier algorithms,

33 citations

Book ChapterDOI
13 May 2015
TL;DR: This work proposes to extend previous heuristics by considering both less local and more accurate patterns using subgraphs defined around each node, based on bipartite assignment algorithms proposed in this work.
Abstract: Graph edit distance corresponds to a flexible graph dissimilarity measure. Unfortunately, its computation requires an exponential complexity according to the number of nodes of both graphs being compared. Some heuristics based on bipartite assignment algorithms have been proposed in order to approximate the graph edit distance. However, these heuristics lack of accuracy since they are based either on small patterns providing a too local information or walks whose tottering induce some bias in the edit distance calculus. In this work, we propose to extend previous heuristics by considering both less local and more accurate patterns using subgraphs defined around each node.

33 citations

Journal ArticleDOI
TL;DR: Five commonly used measures of trajectory similarity are introduced: dynamic time warping, longest common subsequence (LCSS), edit distance for real sequences (EDR), Fréchet distance and nearest neighbour distance and NND, of which only NND is routinely used by ecologists.
Abstract: Identifying and understanding patterns in movement data are amongst the principal aims of movement ecology. By quantifying the similarity of movement trajectories, inferences can be made about diverse processes, ranging from individual specialisation to the ontogeny of foraging strategies. Movement analysis is not unique to ecology however, and methods for estimating the similarity of movement trajectories have been developed in other fields but are currently under-utilised by ecologists. Here, we introduce five commonly used measures of trajectory similarity: dynamic time warping (DTW), longest common subsequence (LCSS), edit distance for real sequences (EDR), Frechet distance and nearest neighbour distance (NND), of which only NND is routinely used by ecologists. We investigate the performance of each of these measures by simulating movement trajectories using an Ornstein-Uhlenbeck (OU) model in which we varied the following parameters: (1) the point of attraction, (2) the strength of attraction to this point and (3) the noise or volatility added to the movement process in order to determine which measures were most responsive to such changes. In addition, we demonstrate how these measures can be applied using movement trajectories of breeding northern gannets (Morus bassanus) by performing trajectory clustering on a large ecological dataset. Simulations showed that DTW and Frechet distance were most responsive to changes in movement parameters and were able to distinguish between all the different parameter combinations we trialled. In contrast, NND was the least sensitive measure trialled. When applied to our gannet dataset, the five similarity measures were highly correlated despite differences in their underlying calculation. Clustering of trajectories within and across individuals allowed us to easily visualise and compare patterns of space use over time across a large dataset. Trajectory clusters reflected the bearing on which birds departed the colony and highlighted the use of well-known bathymetric features. As both the volume of movement data and the need to quantify similarity amongst animal trajectories grow, the measures described here and the bridge they provide to other fields of research will become increasingly useful in ecology.

33 citations

Patent
02 Aug 2006
TL;DR: In this article, the authors proposed a system for tax forms with handwritten material, which is trained with a variety of Roman text fonts and has a back end dictionary that can be customized to account for the fact that the system knows which field it is recognizing.
Abstract: Proprietary suite of underlying document image analysis capabilities, including a novel forms enhancement, segmentation and modeling component, forms recognition and optical character recognition. Future version of the system will include form reasoning to detect and classify fields on forms with varying layout. Product provides acquisition, modeling, recognition and processing components, and has the ability to verify recognized data on the image with a line by line comparison. The key enabling technologies center around the recognition and processing of the scanned forms. The system learns the positions of lines and the location of text on the pre-printed form, and associates various regions of the form with specific required fields in the electronic version. Once the form is recognized, the preprinted material is removed and individual regions are passed to an optical character recognition component. The current proprietary OCR engine is trained with a variety of Roman text fonts and has a back end dictionary that can be customized to account for the fact that the system knows which field it is recognizing. The engine performs segmentation to obtain isolated characters and computes a structure based feature vector. The characters are normalized and classified using a cluster centric classifier, which responds well to variations in the symbols contour. An efficient dictionary lookup scheme provides exact and edit distance lookup using a TRIE structure. An edit distance is computed and a collection of near misses can be output in a lattice to enhance the final recognition result. The current classification rate can exceed 99% with context. The ultimate goal of this system is to enable the processing of all tax forms including forms with handwritten material.

33 citations

Patent
05 Apr 1995
TL;DR: In this article, a method for comparing an electronic handwritten pattern to a stored string is presented, where a linear systolic array processor determines an edit distance between the string and the pattern, and a plurality of edit distance components are generated based on the comparison.
Abstract: Apparatus and a method for comparing an electronic handwritten pattern to a stored string are provided. The string includes a group of portions, each having at least one stroke. Movement of a stylus forms the pattern, and a sequence of strokes is generated. Each stroke represents a stylus movement within a predetermined alphabet. The sequence of strokes has a plurality of portions. A linear systolic array processor determines an edit distance between the string and the pattern. The processor compares a first portion of the string to a first portion of the pattern. A plurality of edit distance components are generated based on the comparison. Each component corresponds to a different set of operations that transforms the first portion of the stored string into the first portion of the pattern. The components are calculated based on a further comparison between additional portions of the stored string and the pattern. The component which has a minimum value is selected. The comparison is performed between each respective portion of the pattern and the corresponding portion of the stored string. The total edit distance is based on the component selected during a last comparison between a last portion of the stored string and a last portion of the pattern.

33 citations


Network Information
Related Topics (5)
Graph (abstract data type)
69.9K papers, 1.2M citations
86% related
Unsupervised learning
22.7K papers, 1M citations
81% related
Feature vector
48.8K papers, 954.4K citations
81% related
Cluster analysis
146.5K papers, 2.9M citations
81% related
Scalability
50.9K papers, 931.6K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202339
202296
2021111
2020149
2019145
2018139