scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2017"


Journal ArticleDOI
TL;DR: Edlib is presented, an open‐source C/C ++ library for exact pairwise sequence alignment using edit distance and is expected to be easily adopted as a building block for future bioinformatics tools.
Abstract: Summary We present Edlib, an open-source C/C ++ library for exact pairwise sequence alignment using edit distance. We compare Edlib to other libraries and show that it is the fastest while not lacking in functionality and can also easily handle very large sequences. Being easy to use, flexible, fast and low on memory usage, we expect it to be easily adopted as a building block for future bioinformatics tools. Availability and implementation Source code, installation instructions and test data are freely available for download at https://github.com/Martinsos/edlib, under the MIT licence. Edlib is implemented in C/C ++ and supported on Linux, MS Windows, and Mac OS. Contact mile.sikic@fer.hr. Supplementary information Supplementary data are available at Bioinformatics online.

179 citations


Journal ArticleDOI
TL;DR: This work formally shows that the GED, restricted to the paths in this family, is equivalent to a quadratic assignment problem, and proposes to compute an approximate solution by adapting two algorithms: Integer Projected Fixed Point method and Graduated Non Convexity and Concavity Procedure.

89 citations


Journal ArticleDOI
TL;DR: In this paper, an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance was introduced, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction.
Abstract: This paper studies problems in data reconstruction, an important area with numerous applications. In particular, we examine the reconstruction of binary and nonbinary sequences from synchronization (insertion/deletion-correcting) codes. These sequences have been corrupted by a fixed number of symbol insertions (larger than the minimum edit distance of the code), yielding a number of distinct traces to be used for reconstruction. We wish to know the minimum number of traces needed for exact reconstruction. This is a general version of a problem tackled by Levenshtein for uncoded sequences. We introduce an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction. Without specific knowledge of the code words, this upper bound is tight. We apply our results to the famous single deletion/insertion-correcting Varshamov–Tenengolts (VT) codes and show that a significant number of VT code word pairs achieve the worst case number of outputs needed for exact reconstruction. We also consider extensions to other channels, such as adversarial deletion and insertion/deletion channels and probabilistic channels.

77 citations


Journal ArticleDOI
TL;DR: This paper presents a user-centered system for signature verification that performs one of the first systems that is based on a direct comparison of the elementary neuromuscular strokes which are detected in the handwriting to verify the identity of the user.
Abstract: When using tablet computers, smartphones, or digital pens, human users perform movements with a stylus or their fingers that can be analyzed by the kinematic theory of rapid human movements. In this paper, we present a user-centered system for signature verification that performs such a kinematic analysis to verify the identity of the user. It is one of the first systems that is based on a direct comparison of the elementary neuromuscular strokes which are detected in the handwriting. Taking into account the number of strokes, their similarity, and their timing, the string edit distance is employed to derive a dissimilarity measure for signature verification. On several benchmark datasets, we demonstrate that this neuromuscular analysis is complementary to a well-established verification using dynamic time warping. By combining both approaches, our verifier is able to outperform current state-of-the-art results in on-line signature verification.

54 citations


Proceedings Article
01 Jan 2017
TL;DR: This work presents a novel distributed algorithm for approximately computing the underlying clusters of DNA sequences that achieves higher accuracy and a 1000x speedup on three real datasets.
Abstract: Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA data storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.

50 citations


Proceedings ArticleDOI
04 Aug 2017
TL;DR: This paper proposes an algorithm named EmbedJoin which scales very well with string length and distance threshold, built on the recent advance of metric embeddings for edit distance, and is very different from all the previous approaches.
Abstract: We study the problem of edit similarity joins, where given a set of strings and a threshold value K, we want to output all pairs of strings whose edit distances are at most K. Edit similarity join is a fundamental problem in data cleaning/integration, bioinformatics, collaborative filtering and natural language processing, and has been identified as a primitive operator for database systems. This problem has been studied extensively in the literature. However, we have observed that all the existing algorithms fall short on long strings and large distance thresholds. In this paper we propose an algorithm named EmbedJoin which scales very well with string length and distance threshold. Our algorithm is built on the recent advance of metric embeddings for edit distance, and is very different from all of the previous approaches. We demonstrate via an extensive set of experiments that EmbedJoin significantly outperforms the previous best algorithms on long strings and large distance thresholds.

47 citations


Book ChapterDOI
16 May 2017
TL;DR: An international workshop on Graph-Based Representations in Pattern Recognition and its applications in machine learning and natural language understanding.
Abstract: International Workshop on Graph-Based Representations in Pattern Recognition. GbRPR 2017: Graph-Based Representations in Pattern Recognition pp. 242-252.

45 citations


Proceedings ArticleDOI
25 Mar 2017
TL;DR: In the process of similarity calculation, the Solving algorithm of the LD and LCS has been optimized in the data structure, reduce the space complexity of the algorithm from the order of magnitude, which proves the feasibility and correctness of the results.
Abstract: The application of string similarity is very extensive, and the algorithm based on Levenshtein Distance is particularly classic, but it is still insufficient in the aspect of universal applicability and accuracy of results. Combined with the Longest Common Subsequence (LCS) and Longest Common Substring (LCCS), similarity algorithm based on Levenshtein Distance is improved, and the string similarity result of the improved algorithm is more distinct, reasonable and accurate, and also has a better universal applicability. What's more in the process of similarity calculation, the Solving algorithm of the LD and LCS has been optimized in the data structure, reduce the space complexity of the algorithm from the order of magnitude. And the experimental results are analyzed in detail, which proves the feasibility and correctness of the results.

44 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: A framework that exhibits barriers for truly subquadratic and deterministic algorithms with good approximation guarantees is introduced and highlights a novel connection between deterministic approximation algorithms for natural problems in P and circuit lower bounds.
Abstract: Proving hardness of approximation is a major challenge in the field of fine-grained complexity and conditional lower bounds in P. How well can the Longest Common Subsequence (LCS) or the Edit Distance be approximated by an algorithm that runs in near-linear time? In this paper, we make progress towards answering these questions. We introduce a framework that exhibits barriers for truly subquadratic and deterministic algorithms with good approximation guarantees. Our framework highlights a novel connection between deterministic approximation algorithms for natural problems in P and circuit lower bounds. In particular, we discover a curious connection of the following form: if there exists a \delta>0 such that for all \eps>0 there is a deterministic (1+\eps)-approximation algorithm for LCS on two sequences of length n over an alphabet of size n^{o(1)} that runs in O(n^{2-\delta}) time, then a certain plausible hypothesis is refuted, and the class E^NP does not have non-uniform linear size Valiant Series-Parallel circuits. Thus, designing a "truly subquadratic PTAS" for LCS is as hard as resolving an old open question in complexity theory.

41 citations


Journal ArticleDOI
TL;DR: In an experimental evaluation on the IAM graph database repository, it is demonstrated that the proposed quadratic-time methods perform equally well or, quite surprisingly, in some cases even better than the cubic-time method.

40 citations


Journal ArticleDOI
TL;DR: The context of this competition, the metrics and datasets used for evaluation, and the results obtained by the eight submitted methods are presented.

Journal ArticleDOI
TL;DR: By using the grid, a route similarity ranking can be computed in real-time on the Mopsi20141 route dataset, which consists of over 6,000 routes, and is an extension of the most similar route search and contains an ordered list of all similar routes from the database.
Abstract: Grids are commonly used as histograms to process spatial data in order to detect frequent patterns, predict destinations, or to infer popular places However, they have not been previously used for GPS trajectory similarity searches or retrieval in general Instead, slower and more complicated algorithms based on individual point-pair comparison have been used We demonstrate how a grid representation can be used to compute four different route measures: novelty, noteworthiness, similarity, and inclusion The measures may be used in several applications such as identifying taxi fraud, automatically updating GPS navigation software, optimizing traffic, and identifying commuting patterns We compare our proposed route similarity measure, C-SIM, to eight popular alternatives including Edit Distance on Real sequence (EDR) and Frechet distance The proposed measure is simple to implement and we give a fast, linear time algorithm for the task It works well under noise, changes in sampling rate, and point shifting We demonstrate that by using the grid, a route similarity ranking can be computed in real-time on the Mopsi20141 route dataset, which consists of over 6,000 routes This ranking is an extension of the most similar route search and contains an ordered list of all similar routes from the database The real-time search is due to indexing the cell database and comes at the cost of spending 80% more memory space for the index The methods are implemented inside the Mopsi2 route module

Posted Content
TL;DR: Empirical evaluation of an attention-based neural machine translation model by allowing it to access an entire training set of parallel sentence pairs even after training shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.
Abstract: In this paper, we extend an attention-based neural machine translation (NMT) model by allowing it to access an entire training set of parallel sentence pairs even after training. The proposed approach consists of two stages. In the first stage--retrieval stage--, an off-the-shelf, black-box search engine is used to retrieve a small subset of sentence pairs from a training set given a source sentence. These pairs are further filtered based on a fuzzy matching score based on edit distance. In the second stage--translation stage--, a novel translation model, called translation memory enhanced NMT (TM-NMT), seamlessly uses both the source sentence and a set of retrieved sentence pairs to perform the translation. Empirical evaluation on three language pairs (En-Fr, En-De, and En-Es) shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.

Journal ArticleDOI
06 Mar 2017-Sensors
TL;DR: Three distances and their corresponding computation methods are proposed in this paper and show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.
Abstract: With the rapid spread of built-in GPS handheld smart devices, the trajectory data from GPS sensors has grown explosively. Trajectory data has spatio-temporal characteristics and rich information. Using trajectory data processing techniques can mine the patterns of human activities and the moving patterns of vehicles in the intelligent transportation systems. A trajectory similarity measure is one of the most important issues in trajectory data mining (clustering, classification, frequent pattern mining, etc.). Unfortunately, the main similarity measure algorithms with the trajectory data have been found to be inaccurate, highly sensitive of sampling methods, and have low robustness for the noise data. To solve the above problems, three distances and their corresponding computation methods are proposed in this paper. The point-segment distance can decrease the sensitivity of the point sampling methods. The prediction distance optimizes the temporal distance with the features of trajectory data. The segment-segment distance introduces the trajectory shape factor into the similarity measurement to improve the accuracy. The three kinds of distance are integrated with the traditional dynamic time warping algorithm (DTW) algorithm to propose a new segment–based dynamic time warping algorithm (SDTW). The experimental results show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.

Journal ArticleDOI
10 Oct 2017-PLOS ONE
TL;DR: An efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs) that all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory.
Abstract: Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively.

Posted ContentDOI
TL;DR: MAGNET is proposed, a new filtering strategy that maintains high accuracy across different edit distance thresholds and data sets and significantly improves the accuracy of pre-alignment filtering by one to two orders of magnitude.
Abstract: In the era of high throughput DNA sequencing (HTS) technologies, calculating the edit distance (i.e., the minimum number of substitutions, insertions, and deletions between a pair of sequences) for billions of genomic sequences is the computational bottleneck in todays read mappers. The shifted Hamming distance (SHD) algorithm proposes a fast filtering strategy that can rapidly filter out invalid mappings that have more edits than allowed. However, SHD shows high inaccuracy in its filtering by admitting invalid mappings to be marked as correct ones. This wastes the execution time and imposes a large computational burden. In this work, we comprehensively investigate four sources that lead to the filtering inaccuracy. We propose MAGNET, a new filtering strategy that maintains high accuracy across different edit distance thresholds and data sets. It significantly improves the accuracy of pre-alignment filtering by one to two orders of magnitude. The MATLAB implementations of MAGNET and SHD are open source and available at: this https URL.

Journal ArticleDOI
TL;DR: This paper proposes two different approximation methods to securely compute the edit distance among genomic sequences and uses shingling, private set intersection methods, the banded alignment algorithm, and garbled circuits to implement these methods.
Abstract: Edit distance is a well established metric to quantify how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. It is utilized in the domain of human genomic sequence similarity as it captures the requirements and leads to a better diagnosis of diseases. However, in addition to the computational complexity due to the large genomic sequence length, the privacy of these sequences are highly important. As these genomic sequences are unique and can identify an individual, these cannot be shared in a plaintext. In this paper, we propose two different approximation methods to securely compute the edit distance among genomic sequences. We use shingling, private set intersection methods, the banded alignment algorithm, and garbled circuits to implement these methods. We experimentally evaluate these methods and discuss both advantages and limitations. Experimental results show that our first approximation method is fast and achieves similar accuracy compared to existing techniques. However, for longer genomic sequences, both the existing techniques and our proposed first method are unable to achieve a good accuracy. On the other hand, our second approximation method is able to achieve higher accuracy on such datasets. However, the second method is relatively slower than the first proposed method. The proposed algorithms are generally accurate, time-efficient and can be applied individually and jointly as they have complimentary properties (runtime vs. accuracy) on different types of datasets.

Posted ContentDOI
08 Nov 2017-bioRxiv
TL;DR: An algorithm is introduced to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V, E) in O( |V |+m|E|) time and O(|V |) space.
Abstract: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Given the wide range of applications of this basic problem, it is surprising that algorithms with optimal runtime are, to the best of our knowledge, yet unknown. In particular, aligning sequences to cyclic graphs currently represents a challenge both in theory and practice. Here, we introduce an algorithm to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V,E) in O(V+m|E|) time and O(|V|) space. The corresponding alignment can be obtained in the same runtime using O(√m|V|) space. The time complexity depends only on the length of the sequence and the size of the graph. In particular, it does not depend on the cyclicity of the graph, or any other topological features.

Journal ArticleDOI
TL;DR: A novel approach to compute the global delay in subquadratic time using a fast Fourier transform (FFT) is developed and it is demonstrated how to validate the consistency of pairwise matchings by computing matchings between more than two trajectories.
Abstract: The analysis of interaction between movement trajectories is of interest for various domains when movement of multiple objects is concerned. Interaction often includes a delayed response, making it difficult to detect interaction with current methods that compare movement at specific time intervals. We propose analyses and visualizations, on a local and global scale, of delayed movement responses, where an action is followed by a reaction over time, on trajectories recorded simultaneously. We developed a novel approach to compute the global delay in subquadratic time using a fast Fourier transform FFT. Central to our local analysis of delays is the computation of a matching between the trajectories in a so-called delay space. It encodes the similarities between all pairs of points of the trajectories. In the visualization, the edges of the matching are bundled into patches, such that shape and color of a patch help to encode changes in an interaction pattern. To evaluate our approach experimentally, we have implemented it as a prototype visual analytics tool and have applied the tool on three bidimensional data sets. For this we used various measures to compute the delay space, including the directional distance, a new similarity measure, which captures more complex interactions by combining directional and spatial characteristics. We compare matchings of various methods computing similarity between trajectories. We also compare various procedures to compute the matching in the delay space, specifically the Frechet distance, dynamic time warping DTW, and edit distance ED. Finally, we demonstrate how to validate the consistency of pairwise matchings by computing matchings between more than two trajectories.

Journal ArticleDOI
Minghe Yu1, Jin Wang1, Guoliang Li1, Yong Zhang1, Dong Deng1, Jianhua Feng1 
01 Apr 2017
TL;DR: This work recursively partition strings into disjoint segments and builds a hierarchical segment tree index and develops effective pruning techniques to further improve the performance, and extends the techniques to support the disk-based setting.
Abstract: String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-$$k$$k string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index ($${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$HS-Tree) on top of the segments. Then, we utilize the $${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-$$k$$k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5---10 times.

Journal ArticleDOI
TL;DR: An anomaly detection model based on encoder-decoder framework with recurrent neural network (RNN) is proposed that is able to successfully capture anomalies with a precision higher than 95%.

Proceedings ArticleDOI
01 Nov 2017
TL;DR: This paper presents the LSDE string representation and its application to handwritten word spotting and shows how such a representation produces a more semantically interpretable retrieval from the user's perspective than other state of the art ones such as PHOC and DCToW.
Abstract: In this paper we present the LSDE string representation and its application to handwritten word spotting LSDE is a novel embedding approach for representing strings that learns a space in which distances between projected points are correlated with the Levenshtein edit distance between the original strings We show how such a representation produces a more semantically interpretable retrieval from the user's perspective than other state of the art ones such as PHOC and DCToW We also conduct a preliminary handwritten word spotting experiment on the George Washington dataset

Posted Content
TL;DR: The first truly sub-cubic algorithm for the bounded-difference version of the APSP problem was given by Chan and Lewenstein this paper, who gave a time complexity of O(n 3 − ϵ ) for the problem.
Abstract: It is a major open problem whether the $(\min,+)$-product of two $n\times n$ matrices has a truly sub-cubic (i.e. $O(n^{3-\epsilon})$ for $\epsilon>0$) time algorithm, in particular since it is equivalent to the famous All-Pairs-Shortest-Paths problem (APSP) in $n$-vertex graphs. Some restrictions of the $(\min,+)$-product to special types of matrices are known to admit truly sub-cubic algorithms, each giving rise to a special case of APSP that can be solved faster. In this paper we consider a new, different and powerful restriction in which all matrix entries are integers and one matrix can be arbitrary, as long as the other matrix has "bounded differences" in either its columns or rows, i.e. any two consecutive entries differ by only a small amount. We obtain the first truly sub-cubic algorithm for this bounded-difference $(\min,+)$-product (answering an open problem of Chan and Lewenstein). Our new algorithm, combined with a strengthening of an approach of L.~Valiant for solving context-free grammar parsing with matrix multiplication, yields the first truly sub-cubic algorithms for the following problems: Language Edit Distance (a major problem in the parsing community), RNA-folding (a major problem in bioinformatics) and Optimum Stack Generation (answering an open problem of Tarjan).

Journal ArticleDOI
TL;DR: The prospects of the method to help enforcing moderation rules of obscenity expressions or as a preprocessing mechanism for sequence cleaning and/or feature extraction in more sophisticated text categorization techniques are discussed.
Abstract: Obscenity (the use of rude words or offensive expressions) has spread from informal verbal conversations to digital media, becoming increasingly common on user-generated comments found in Web forums, newspaper user boards, social networks, blogs, and media-sharing sites. The basic obscenity-blocking mechanism is based on verbatim comparisons against a blacklist of banned vocabulary; however, creative users circumvent these filters by obfuscating obscenity with symbol substitutions or bogus segmentations that still visually preserve the original semantics, such as writing shit as dhi;t or s.h.i.t or even worse mixing them as d.hm.i.t. The number of potential obfuscated variants is combinatorial, yielding the verbatim filter impractical. Here we describe a method intended to obstruct this anomaly inspired by sequence alignment algorithms used in genomics, coupled with a tailor-made edit penalty function. The method only requires to set up the vocabulary of plain obscenities; no further training is needed. Its complexity on screening a single obscenity is linear, both in runtime and memory, on the length of the user-generated text. We validated the method on three different experiments. The first one involves a new dataset that is also introduced in this article; it consists of a set of manually annotated real-life comments in Spanish, gathered from the news user boards of an online newspaper, containing this type of obfuscation. The second one is a publicly available dataset of comments in Portuguese from a sports Web site. In these experiments, at the obscenity level, we observed recall rates greater than 90%, whereas precision rates varied between 75% and 95%, depending on their sequence length (shorter lengths yielded a higher number of false alarms). On the other hand, at the comment level, we report recall of 86%, precision of 91%, and specificity of 98%. The last experiment revealed that the method is more effective in matching this type of obfuscation compared to the classical Levenshtein edit distance. We conclude discussing the prospects of the method to help enforcing moderation rules of obscenity expressions or as a preprocessing mechanism for sequence cleaning and/or feature extraction in more sophisticated text categorization techniques.

Journal ArticleDOI
06 Mar 2017-PLOS ONE
TL;DR: It is concluded that DTW-based distances provide a useful metric for the automated identification of fungi based on HRM curves of the ITS region and that this provides the foundation for a robust and automatable method applicable to the clinical setting.
Abstract: Fungal infections are a global problem imposing considerable disease burden. One of the unmet needs in addressing these infections is rapid, sensitive diagnostics. A promising molecular diagnostic approach is high-resolution melt analysis (HRM). However, there has been little effort in leveraging HRM data for automated, objective identification of fungal species. The purpose of these studies was to assess the utility of distance methods developed for comparison of time series data to classify HRM curves as a means of fungal species identification. Dynamic time warping (DTW), first introduced in the context of speech recognition to identify temporal distortion of similar sounds, is an elastic distance measure that has been successfully applied to a wide range of time series data. Comparison of HRM curves of the rDNA internal transcribed spacer (ITS) region from 51 strains of 18 fungal species using DTW distances allowed accurate classification and clustering of all 51 strains. The utility of DTW distances for species identification was demonstrated by matching HRM curves from 243 previously identified clinical isolates against a database of curves from standard reference strains. The results revealed a number of prior misclassifications, discriminated species that are not resolved by routine phenotypic tests, and accurately identified all 243 test strains. In addition to DTW, several other distance functions, Edit Distance on Real sequence (EDR) and Shape-based Distance (SBD), showed promise. It is concluded that DTW-based distances provide a useful metric for the automated identification of fungi based on HRM curves of the ITS region and that this provides the foundation for a robust and automatable method applicable to the clinical setting.

Proceedings ArticleDOI
01 Jun 2017
TL;DR: This work presents an efficient algorithm for MU codes with linear encoding and decoding complexity and shows an efficient construction of these codes with nearly optimal redundancy and draw connections to the problem of comma-free and prefix synchronized codes.
Abstract: Mutually Uncorrelated (MU) codes are a class of codes in which no proper prefix of one codeword is a suffix of another codeword. These codes were originally studied for synchronization purposes and recently, Yazdi et al. showed their applicability to enable random access in DNA storage. In this work we follow the research of Yazdi et al. and study MU codes along with their extensions to correct errors and balanced codes. We first review a well known construction of MU codes and study the asymptotic behavior of its cardinality. Then, we present an efficient algorithm for MU codes with linear encoding and decoding complexity. Next, we extend these results for (dh, dm)-MU codes that impose a minimum Hamming distance of dh between different codewords and d m between prefixes and suffixes. Particularly we show an efficient construction of these codes with nearly optimal redundancy and draw connections to the problem of comma-free and prefix synchronized codes. Lastly, we provide similar results for the edit distance and balanced MU codes.

Journal ArticleDOI
Hao Liu1, Qingjie Zhao1, Hao Wang1, Peng Lv1, Yanming Chen1 
TL;DR: An image-based algorithm using improved Edit distance for near-duplicate video retrieval and localization and a detect-and-refine-strategy-based dynamic programming algorithm is proposed to generate the path matrix, which can be used to aggregate scores for video similarity measure and localize the similar parts.
Abstract: The rapid development of social network in recent years has spurred enormous growth of near-duplicate videos. The existence of huge volumes of near-duplicates shows a rising demand on effective near-duplicate video retrieval technique in copyright violation and search result reranking. In this paper, we propose an image-based algorithm using improved Edit distance for near-duplicate video retrieval and localization. By regarding video sequences as strings, Edit distance is used and extended to retrieve and localize near-duplicate videos. Firstly, bag-of-words (BOW) model is utilized to measure the frame similarities, which is robust to spatial transformations. Then, non-near-duplicate videos are filtered out by computing the proposed relative Edit distance similarity (REDS). Next, a detect-and-refine-strategy-based dynamic programming algorithm is proposed to generate the path matrix, which can be used to aggregate scores for video similarity measure and localize the similar parts. Experiments on CC_WEB_VIDEO and TREC CBCD 2011 datasets demonstrated the effectiveness and robustness of the proposed method in retrieval and localization tasks.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: In this paper, the authors present deterministic algorithms for computing DTW and GED between two sequences of n points in R in O(n 2 log log log n/ log log N ) time.
Abstract: Dynamic Time Warping (DTW) and Geometric Edit Distance (GED) are basic similarity measures between curves or general temporal sequences (e.g., time series) that are represented as sequences of points in some metric space (X, dist). The DTW and GED measures are massively used in various fields of computer science and computational biology, consequently, the tasks of computing these measures are among the core problems in P. Despite extensive efforts to find more efficient algorithms, the best-known algorithms for computing the DTW or GED between two sequences of points in X = R^d are long-standing dynamic programming algorithms that require quadratic runtime, even for the one-dimensional case d = 1, which is perhaps one of the most used in practice. In this paper, we break the nearly 50 years old quadratic time bound for computing DTW or GED between two sequences of n points in R, by presenting deterministic algorithms that run in O( n^2 log log log n / log log n ) time. Our algorithms can be extended to work also for higher dimensional spaces R^d, for any constant d, when the underlying distance-metric dist is polyhedral (e.g., L_1, L_infty).

Posted Content
TL;DR: The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance.
Abstract: The edit distance between two rooted ordered trees with $n$ nodes labeled from an alphabet~$\Sigma$ is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. Tree edit distance is a well known generalization of string edit distance. The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance. In this paper we show that a truly subcubic $O(n^{3-\varepsilon})$ time algorithm for tree edit distance is unlikely: For $|\Sigma| = \Omega(n)$, a truly subcubic algorithm for tree edit distance implies a truly subcubic algorithm for the all pairs shortest paths problem. For $|\Sigma| = O(1)$, a truly subcubic algorithm for tree edit distance implies an $O(n^{k-\varepsilon})$ algorithm for finding a maximum weight $k$-clique. Thus, while in terms of upper bounds string edit distance and tree edit distance are highly related, in terms of lower bounds string edit distance exhibits the hardness of the strong exponential time hypothesis [Backurs, Indyk STOC'15] whereas tree edit distance exhibits the hardness of all pairs shortest paths. Our result provides a matching conditional lower bound for one of the last remaining classic dynamic programming problems.

Journal ArticleDOI
TL;DR: A set of feature vectors (FVs) which are based on shape geometry (SG) decoding of the input character which are represented as the string of shape operators are proposed and evaluated using the characters extracted from printed aged multilingual Indian documents having English, Devanagari, and Marathi scripts.
Abstract: Multilingual character recognition from the images of aged Indian documents is challenging because of the complex character grapheme of the Indian language scripts. Feature extraction plays the most important role in recognition of such images. In this paper, we have proposed a set of feature vectors (FVs) which are based on shape geometry (SG) decoding of the input character. The first FV is based on SG decoding of the input character using triangular area (TA) calculation. The second FV, namely, SG using perpendicular distance is extracted by dividing the input image into individual components and the shape of the individual component is decoded into shape symbols by comparing the normalized perpendicular distances of the individual pixels of the component onto the line joining the end points of the component. Apart from the proposed FVs, we have used crossing count features. These FVs are represented as the string of shape operators; hence, we have used minimum edit distance classifier to recog...