Showing papers on "Edit distance published in 2017"

PDF

Open Access

Journal Article•DOI•

Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance

[...]

Martin Sosic¹, Mile Šikić², Mile Šikić¹•Institutions (2)

University of Zagreb¹, Agency for Science, Technology and Research²

01 May 2017-Bioinformatics

TL;DR: Edlib is presented, an open‐source C/C ++ library for exact pairwise sequence alignment using edit distance and is expected to be easily adopted as a building block for future bioinformatics tools.

...read moreread less

Abstract: Summary We present Edlib, an open-source C/C ++ library for exact pairwise sequence alignment using edit distance. We compare Edlib to other libraries and show that it is the fastest while not lacking in functionality and can also easily handle very large sequences. Being easy to use, flexible, fast and low on memory usage, we expect it to be easily adopted as a building block for future bioinformatics tools. Availability and implementation Source code, installation instructions and test data are freely available for download at https://github.com/Martinsos/edlib, under the MIT licence. Edlib is implemented in C/C ++ and supported on Linux, MS Windows, and Mac OS. Contact mile.sikic@fer.hr. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

179 citations

Journal Article•DOI•

Graph edit distance as a quadratic assignment problem

[...]

Sbastien Bougleux¹, Luc Brun¹, Vincenzo Carletti², Pasquale Foggia², Benoit Gazre³, Mario Vento² - Show less +2 more•Institutions (3)

University of Caen Lower Normandy¹, University of Salerno², Institut national des sciences appliquées de Rouen³

01 Feb 2017-Pattern Recognition Letters

TL;DR: This work formally shows that the GED, restricted to the paths in this family, is equivalent to a quadratic assignment problem, and proposes to compute an approximate solution by adapting two algorithms: Integer Projected Fixed Point method and Graduated Non Convexity and Concavity Procedure.

...read moreread less

89 citations

Journal Article•DOI•

Exact Reconstruction From Insertions in Synchronization Codes

[...]

Frederic Sala¹, Ryan Gabrys², Clayton Schoeny¹, Lara Dolecek¹•Institutions (2)

University of California, Los Angeles¹, Space and Naval Warfare Systems Center Pacific²

01 Apr 2017-IEEE Transactions on Information Theory

TL;DR: In this paper, an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance was introduced, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction.

...read moreread less

Abstract: This paper studies problems in data reconstruction, an important area with numerous applications. In particular, we examine the reconstruction of binary and nonbinary sequences from synchronization (insertion/deletion-correcting) codes. These sequences have been corrupted by a fixed number of symbol insertions (larger than the minimum edit distance of the code), yielding a number of distinct traces to be used for reconstruction. We wish to know the minimum number of traces needed for exact reconstruction. This is a general version of a problem tackled by Levenshtein for uncoded sequences. We introduce an exact formula for the maximum number of common supersequences shared by sequences at a certain edit distance, yielding an upper bound on the number of distinct traces necessary to guarantee exact reconstruction. Without specific knowledge of the code words, this upper bound is tight. We apply our results to the famous single deletion/insertion-correcting Varshamov–Tenengolts (VT) codes and show that a significant number of VT code word pairs achieve the worst case number of outputs needed for exact reconstruction. We also consider extensions to other channels, such as adversarial deletion and insertion/deletion channels and probabilistic channels.

...read moreread less

77 citations

Journal Article•DOI•

Signature Verification Based on the Kinematic Theory of Rapid Human Movements

[...]

Andreas Fischer¹, Réjean Plamondon¹•Institutions (1)

École Polytechnique de Montréal¹

01 Apr 2017-IEEE Transactions on Human-Machine Systems

TL;DR: This paper presents a user-centered system for signature verification that performs one of the first systems that is based on a direct comparison of the elementary neuromuscular strokes which are detected in the handwriting to verify the identity of the user.

...read moreread less

Abstract: When using tablet computers, smartphones, or digital pens, human users perform movements with a stylus or their fingers that can be analyzed by the kinematic theory of rapid human movements. In this paper, we present a user-centered system for signature verification that performs such a kinematic analysis to verify the identity of the user. It is one of the first systems that is based on a direct comparison of the elementary neuromuscular strokes which are detected in the handwriting. Taking into account the number of strokes, their similarity, and their timing, the string edit distance is employed to derive a dissimilarity measure for signature verification. On several benchmark datasets, we demonstrate that this neuromuscular analysis is complementary to a well-established verification using dynamic time warping. By combining both approaches, our verifier is able to outperform current state-of-the-art results in on-line signature verification.

...read moreread less

54 citations

Proceedings Article•

Clustering Billions of Reads for DNA Data Storage

[...]

Cyrus Rashtchian¹, Cyrus Rashtchian², Konstantin Makarychev, Miklos Z. Racz³, Miklos Z. Racz¹, Siena Dumas Ang¹, Djordje Jevdjic¹, Sergey Yekhanin¹, Luis Ceze², Luis Ceze¹, Karin Strauss¹ - Show less +7 more•Institutions (3)

Microsoft¹, University of Washington², Princeton University³

01 Jan 2017

TL;DR: This work presents a novel distributed algorithm for approximately computing the underlying clusters of DNA sequences that achieves higher accuracy and a 1000x speedup on three real datasets.

...read moreread less

Abstract: Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. Datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA data storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.

...read moreread less

50 citations

Proceedings Article•DOI•

EmbedJoin: Efficient Edit Similarity Joins via Embeddings

[...]

Haoyu Zhang¹, Qin Zhang¹•Institutions (1)

Indiana University¹

04 Aug 2017

TL;DR: This paper proposes an algorithm named EmbedJoin which scales very well with string length and distance threshold, built on the recent advance of metric embeddings for edit distance, and is very different from all the previous approaches.

...read moreread less

Abstract: We study the problem of edit similarity joins, where given a set of strings and a threshold value K, we want to output all pairs of strings whose edit distances are at most K. Edit similarity join is a fundamental problem in data cleaning/integration, bioinformatics, collaborative filtering and natural language processing, and has been identified as a primitive operator for database systems. This problem has been studied extensively in the literature. However, we have observed that all the existing algorithms fall short on long strings and large distance thresholds. In this paper we propose an algorithm named EmbedJoin which scales very well with string length and distance threshold. Our algorithm is built on the recent advance of metric embeddings for edit distance, and is very different from all of the previous approaches. We demonstrate via an extensive set of experiments that EmbedJoin significantly outperforms the previous best algorithms on long strings and large distance thresholds.

...read moreread less

47 citations

Book Chapter•DOI•

A survey on applications of bipartite graph edit distance

[...]

Michael Stauffer¹, Michael Stauffer², Thomas Tschachtli², Andreas Fischer³, Andreas Fischer⁴, Kaspar Riesen² - Show less +2 more•Institutions (4)

University of Pretoria¹, University of Applied Sciences and Arts Northwestern Switzerland FHNW², University of Applied Sciences Western Switzerland³, University of Fribourg⁴

16 May 2017

TL;DR: An international workshop on Graph-Based Representations in Pattern Recognition and its applications in machine learning and natural language understanding.

...read moreread less

Abstract: International Workshop on Graph-Based Representations in Pattern Recognition. GbRPR 2017: Graph-Based Representations in Pattern Recognition pp. 242-252.

...read moreread less

45 citations

Proceedings Article•DOI•

Research on string similarity algorithm based on Levenshtein Distance

[...]

Shengnan Zhang¹, Yan Hu¹, Guangrong Bian•Institutions (1)

Wuhan University of Technology¹

25 Mar 2017

TL;DR: In the process of similarity calculation, the Solving algorithm of the LD and LCS has been optimized in the data structure, reduce the space complexity of the algorithm from the order of magnitude, which proves the feasibility and correctness of the results.

...read moreread less

Abstract: The application of string similarity is very extensive, and the algorithm based on Levenshtein Distance is particularly classic, but it is still insufficient in the aspect of universal applicability and accuracy of results. Combined with the Longest Common Subsequence (LCS) and Longest Common Substring (LCCS), similarity algorithm based on Levenshtein Distance is improved, and the string similarity result of the improved algorithm is more distinct, reasonable and accurate, and also has a better universal applicability. What's more in the process of similarity calculation, the Solving algorithm of the LD and LCS has been optimized in the data structure, reduce the space complexity of the algorithm from the order of magnitude. And the experimental results are analyzed in detail, which proves the feasibility and correctness of the results.

...read moreread less

44 citations

Proceedings Article•DOI•

Towards Hardness of Approximation for Polynomial Time Problems

[...]

Amir Abboud¹, Arturs Backurs²•Institutions (2)

Stanford University¹, Massachusetts Institute of Technology²

01 Jan 2017

TL;DR: A framework that exhibits barriers for truly subquadratic and deterministic algorithms with good approximation guarantees is introduced and highlights a novel connection between deterministic approximation algorithms for natural problems in P and circuit lower bounds.

...read moreread less

Abstract: Proving hardness of approximation is a major challenge in the field of fine-grained complexity and conditional lower bounds in P. How well can the Longest Common Subsequence (LCS) or the Edit Distance be approximated by an algorithm that runs in near-linear time? In this paper, we make progress towards answering these questions. We introduce a framework that exhibits barriers for truly subquadratic and deterministic algorithms with good approximation guarantees. Our framework highlights a novel connection between deterministic approximation algorithms for natural problems in P and circuit lower bounds. In particular, we discover a curious connection of the following form: if there exists a \delta>0 such that for all \eps>0 there is a deterministic (1+\eps)-approximation algorithm for LCS on two sequences of length n over an alphabet of size n^{o(1)} that runs in O(n^{2-\delta}) time, then a certain plausible hypothesis is refuted, and the class E^NP does not have non-uniform linear size Valiant Series-Parallel circuits. Thus, designing a "truly subquadratic PTAS" for LCS is as hard as resolving an old open question in complexity theory.

...read moreread less

41 citations

Journal Article•DOI•

Improved quadratic time approximation of graph edit distance by combining Hausdorff matching and greedy assignment

[...]

Andreas Fischer, Kaspar Riesen¹, Horst Bunke¹•Institutions (1)

University of Bern¹

01 Feb 2017-Pattern Recognition Letters

TL;DR: In an experimental evaluation on the IAM graph database repository, it is demonstrated that the proposed quadratic-time methods perform equally well or, quite surprisingly, in some cases even better than the cubic-time method.

...read moreread less

40 citations

Journal Article•DOI•

Graph edit distance contest: Results and future challenges

[...]

Zeina Abu-Aisheh¹, Benoit Gaüzère², Sébastien Bougleux³, Jean-Yves Ramel¹, Luc Brun³, Romain Raveaux¹, Pierre Héroux², Sébastien Adam² - Show less +4 more•Institutions (3)

François Rabelais University¹, Intelligence and National Security Alliance², University of Caen Lower Normandy³

01 Dec 2017-Pattern Recognition Letters

TL;DR: The context of this competition, the metrics and datasets used for evaluation, and the results obtained by the eight submitted methods are presented.

...read moreread less

Journal Article•DOI•

Grid-Based Method for GPS Route Analysis for Retrieval

[...]

Radu Mariescu-Istodor¹, Pasi Fränti¹•Institutions (1)

University of Eastern Finland¹

29 Sep 2017-ACM Transactions on Spatial Algorithms and Systems

TL;DR: By using the grid, a route similarity ranking can be computed in real-time on the Mopsi20141 route dataset, which consists of over 6,000 routes, and is an extension of the most similar route search and contains an ordered list of all similar routes from the database.

...read moreread less

Abstract: Grids are commonly used as histograms to process spatial data in order to detect frequent patterns, predict destinations, or to infer popular places However, they have not been previously used for GPS trajectory similarity searches or retrieval in general Instead, slower and more complicated algorithms based on individual point-pair comparison have been used We demonstrate how a grid representation can be used to compute four different route measures: novelty, noteworthiness, similarity, and inclusion The measures may be used in several applications such as identifying taxi fraud, automatically updating GPS navigation software, optimizing traffic, and identifying commuting patterns We compare our proposed route similarity measure, C-SIM, to eight popular alternatives including Edit Distance on Real sequence (EDR) and Frechet distance The proposed measure is simple to implement and we give a fast, linear time algorithm for the task It works well under noise, changes in sampling rate, and point shifting We demonstrate that by using the grid, a route similarity ranking can be computed in real-time on the Mopsi20141 route dataset, which consists of over 6,000 routes This ranking is an extension of the most similar route search and contains an ordered list of all similar routes from the database The real-time search is due to indexing the cell database and comes at the cost of spending 80% more memory space for the index The methods are implemented inside the Mopsi2 route module

...read moreread less

Posted Content•

Search Engine Guided Non-Parametric Neural Machine Translation.

[...]

Jiatao Gu, Yong Wang, Kyunghyun Cho, Victor O. K. Li

20 May 2017-arXiv: Computation and Language

TL;DR: Empirical evaluation of an attention-based neural machine translation model by allowing it to access an entire training set of parallel sentence pairs even after training shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.

...read moreread less

Abstract: In this paper, we extend an attention-based neural machine translation (NMT) model by allowing it to access an entire training set of parallel sentence pairs even after training. The proposed approach consists of two stages. In the first stage--retrieval stage--, an off-the-shelf, black-box search engine is used to retrieve a small subset of sentence pairs from a training set given a source sentence. These pairs are further filtered based on a fuzzy matching score based on edit distance. In the second stage--translation stage--, a novel translation model, called translation memory enhanced NMT (TM-NMT), seamlessly uses both the source sentence and a set of retrieved sentence pairs to perform the translation. Empirical evaluation on three language pairs (En-Fr, En-De, and En-Es) shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.

...read moreread less

Journal Article•DOI•

A Segment-Based Trajectory Similarity Measure in the Urban Transportation Systems.

[...]

Yingchi Mao¹, Haishi Zhong¹, Xianjian Xiao, Xiaofang Li•Institutions (1)

Hohai University¹

06 Mar 2017-Sensors

TL;DR: Three distances and their corresponding computation methods are proposed in this paper and show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.

...read moreread less

Abstract: With the rapid spread of built-in GPS handheld smart devices, the trajectory data from GPS sensors has grown explosively. Trajectory data has spatio-temporal characteristics and rich information. Using trajectory data processing techniques can mine the patterns of human activities and the moving patterns of vehicles in the intelligent transportation systems. A trajectory similarity measure is one of the most important issues in trajectory data mining (clustering, classification, frequent pattern mining, etc.). Unfortunately, the main similarity measure algorithms with the trajectory data have been found to be inaccurate, highly sensitive of sampling methods, and have low robustness for the noise data. To solve the above problems, three distances and their corresponding computation methods are proposed in this paper. The point-segment distance can decrease the sensitivity of the point sampling methods. The prediction distance optimizes the temporal distance with the features of trajectory data. The segment-segment distance introduces the trajectory shape factor into the similarity measurement to improve the accuracy. The three kinds of distance are integrated with the traditional dynamic time warping algorithm (DTW) algorithm to propose a new segment–based dynamic time warping algorithm (SDTW). The experimental results show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.

...read moreread less

Journal Article•DOI•

A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations.

[...]

ThienLuan Ho¹, Seung-Rohk Oh¹, HyunJin Kim¹•Institutions (1)

Dankook University¹

10 Oct 2017-PLOS ONE

TL;DR: An efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs) that all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory.

...read moreread less

Abstract: Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively.

...read moreread less

Posted Content•DOI•

MAGNET: Understanding and Improving the Accuracy of Genome Pre-Alignment Filtering

[...]

Mohammed Alser, Onur Mutlu, Can Alkan

01 Jan 2017-arXiv: Genomics

TL;DR: MAGNET is proposed, a new filtering strategy that maintains high accuracy across different edit distance thresholds and data sets and significantly improves the accuracy of pre-alignment filtering by one to two orders of magnitude.

...read moreread less

Abstract: In the era of high throughput DNA sequencing (HTS) technologies, calculating the edit distance (i.e., the minimum number of substitutions, insertions, and deletions between a pair of sequences) for billions of genomic sequences is the computational bottleneck in todays read mappers. The shifted Hamming distance (SHD) algorithm proposes a fast filtering strategy that can rapidly filter out invalid mappings that have more edits than allowed. However, SHD shows high inaccuracy in its filtering by admitting invalid mappings to be marked as correct ones. This wastes the execution time and imposes a large computational burden. In this work, we comprehensively investigate four sources that lead to the filtering inaccuracy. We propose MAGNET, a new filtering strategy that maintains high accuracy across different edit distance thresholds and data sets. It significantly improves the accuracy of pre-alignment filtering by one to two orders of magnitude. The MATLAB implementations of MAGNET and SHD are open source and available at: this https URL.

...read moreread less

Journal Article•DOI•

Secure approximation of edit distance on genomic data.

[...]

Momin Al Aziz¹, Dima Alhadidi², Noman Mohammed¹•Institutions (2)

University of Manitoba¹, Zayed University²

26 Jul 2017-BMC Medical Genomics

TL;DR: This paper proposes two different approximation methods to securely compute the edit distance among genomic sequences and uses shingling, private set intersection methods, the banded alignment algorithm, and garbled circuits to implement these methods.

...read moreread less

Abstract: Edit distance is a well established metric to quantify how dissimilar two strings are by counting the minimum number of operations required to transform one string into the other. It is utilized in the domain of human genomic sequence similarity as it captures the requirements and leads to a better diagnosis of diseases. However, in addition to the computational complexity due to the large genomic sequence length, the privacy of these sequences are highly important. As these genomic sequences are unique and can identify an individual, these cannot be shared in a plaintext. In this paper, we propose two different approximation methods to securely compute the edit distance among genomic sequences. We use shingling, private set intersection methods, the banded alignment algorithm, and garbled circuits to implement these methods. We experimentally evaluate these methods and discuss both advantages and limitations. Experimental results show that our first approximation method is fast and achieves similar accuracy compared to existing techniques. However, for longer genomic sequences, both the existing techniques and our proposed first method are unable to achieve a good accuracy. On the other hand, our second approximation method is able to achieve higher accuracy on such datasets. However, the second method is relatively slower than the first proposed method. The proposed algorithms are generally accurate, time-efficient and can be applied individually and jointly as they have complimentary properties (runtime vs. accuracy) on different types of datasets.

...read moreread less

Posted Content•DOI•

Aligning sequences to general graphs in O(V + mE) time

[...]

Mikko Rautiainen¹, Tobias Marschall¹•Institutions (1)

Max Planck Society¹

08 Nov 2017-bioRxiv

TL;DR: An algorithm is introduced to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V, E) in O( |V |+m|E|) time and O(|V |) space.

...read moreread less

Abstract: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction, and variant calling with respect to a variation graph. Given the wide range of applications of this basic problem, it is surprising that algorithms with optimal runtime are, to the best of our knowledge, yet unknown. In particular, aligning sequences to cyclic graphs currently represents a challenge both in theory and practice. Here, we introduce an algorithm to compute the minimum edit distance of a sequence of length m to any path in a node-labeled directed graph (V,E) in O(V+m|E|) time and O(|V|) space. The corresponding alignment can be obtained in the same runtime using O(√m|V|) space. The time complexity depends only on the length of the sequence and the size of the graph. In particular, it does not depend on the cyclicity of the graph, or any other topological features.

...read moreread less

Journal Article•DOI•

Visual analytics of delays and interaction in movement data

[...]

Maximilian Konzack¹, Thomas John McKetterick², Tim Ophelders¹, Maike Buchin³, Luca Giuggioli², Jed A. Long⁴, Trisalyn A. Nelson⁵, Michel A. Westenberg¹, Kevin Buchin¹ - Show less +5 more•Institutions (5)

Eindhoven University of Technology¹, University of Bristol², Ruhr University Bochum³, University of St Andrews⁴, University of Victoria⁵

01 Feb 2017-International Journal of Geographical Information Science

TL;DR: A novel approach to compute the global delay in subquadratic time using a fast Fourier transform (FFT) is developed and it is demonstrated how to validate the consistency of pairwise matchings by computing matchings between more than two trajectories.

...read moreread less

Abstract: The analysis of interaction between movement trajectories is of interest for various domains when movement of multiple objects is concerned. Interaction often includes a delayed response, making it difficult to detect interaction with current methods that compare movement at specific time intervals. We propose analyses and visualizations, on a local and global scale, of delayed movement responses, where an action is followed by a reaction over time, on trajectories recorded simultaneously. We developed a novel approach to compute the global delay in subquadratic time using a fast Fourier transform FFT. Central to our local analysis of delays is the computation of a matching between the trajectories in a so-called delay space. It encodes the similarities between all pairs of points of the trajectories. In the visualization, the edges of the matching are bundled into patches, such that shape and color of a patch help to encode changes in an interaction pattern. To evaluate our approach experimentally, we have implemented it as a prototype visual analytics tool and have applied the tool on three bidimensional data sets. For this we used various measures to compute the delay space, including the directional distance, a new similarity measure, which captures more complex interactions by combining directional and spatial characteristics. We compare matchings of various methods computing similarity between trajectories. We also compare various procedures to compute the matching in the delay space, specifically the Frechet distance, dynamic time warping DTW, and edit distance ED. Finally, we demonstrate how to validate the consistency of pairwise matchings by computing matchings between more than two trajectories.

...read moreread less

Journal Article•DOI•

A unified framework for string similarity search with edit-distance constraint

[...]

Minghe Yu¹, Jin Wang¹, Guoliang Li¹, Yong Zhang¹, Dong Deng¹, Jianhua Feng¹ - Show less +2 more•Institutions (1)

Tsinghua University¹

01 Apr 2017

TL;DR: This work recursively partition strings into disjoint segments and builds a hierarchical segment tree index and develops effective pruning techniques to further improve the performance, and extends the techniques to support the disk-based setting.

...read moreread less

Abstract: String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-$$k$$k string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index ($${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$HS-Tree) on top of the segments. Then, we utilize the $${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-$$k$$k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5---10 times.

...read moreread less

Journal Article•DOI•

Anomaly detection in smart grid based on encoder-decoder framework with recurrent neural network

[...]

Zheng Fengming¹, Li Shufang¹, Guo Zhimin², Wu Bo², Tian Shiming², Pan Mingming² - Show less +2 more•Institutions (2)

Beijing University of Posts and Telecommunications¹, Electric Power Research Institute²

01 Dec 2017-The Journal of China Universities of Posts and Telecommunications

TL;DR: An anomaly detection model based on encoder-decoder framework with recurrent neural network (RNN) is proposed that is able to successfully capture anomalies with a precision higher than 95%.

...read moreread less

Proceedings Article•DOI•

LSDE: Levenshtein Space Deep Embedding for Query-by-String Word Spotting

[...]

Lluis Gomez¹, Marçal Rusiñol¹, Dimosthenis Karatzas²•Institutions (2)

Autonomous University of Barcelona¹, CVC Capital Partners²

01 Nov 2017

TL;DR: This paper presents the LSDE string representation and its application to handwritten word spotting and shows how such a representation produces a more semantically interpretable retrieval from the user's perspective than other state of the art ones such as PHOC and DCToW.

...read moreread less

Abstract: In this paper we present the LSDE string representation and its application to handwritten word spotting LSDE is a novel embedding approach for representing strings that learns a space in which distances between projected points are correlated with the Levenshtein edit distance between the original strings We show how such a representation produces a more semantically interpretable retrieval from the user's perspective than other state of the art ones such as PHOC and DCToW We also conduct a preliminary handwritten word spotting experiment on the George Washington dataset

...read moreread less

Posted Content•

Truly Sub-cubic Algorithms for Language Edit Distance and RNA Folding via Fast Bounded-Difference Min-Plus Product

[...]

Karl Bringmann¹, Fabrizio Grandoni¹, Barna Saha², Virginia Vassilevska Williams³•Institutions (3)

Max Planck Society¹, University of Massachusetts Amherst², Stanford University³

17 Jul 2017-arXiv: Data Structures and Algorithms

TL;DR: The first truly sub-cubic algorithm for the bounded-difference version of the APSP problem was given by Chan and Lewenstein this paper, who gave a time complexity of O(n 3 − ϵ ) for the problem.

...read moreread less

Abstract: It is a major open problem whether the $(\min,+)$-product of two $n\times n$ matrices has a truly sub-cubic (i.e. $O(n^{3-\epsilon})$ for $\epsilon>0$) time algorithm, in particular since it is equivalent to the famous All-Pairs-Shortest-Paths problem (APSP) in $n$-vertex graphs. Some restrictions of the $(\min,+)$-product to special types of matrices are known to admit truly sub-cubic algorithms, each giving rise to a special case of APSP that can be solved faster. In this paper we consider a new, different and powerful restriction in which all matrix entries are integers and one matrix can be arbitrary, as long as the other matrix has "bounded differences" in either its columns or rows, i.e. any two consecutive entries differ by only a small amount. We obtain the first truly sub-cubic algorithm for this bounded-difference $(\min,+)$-product (answering an open problem of Chan and Lewenstein). Our new algorithm, combined with a strengthening of an approach of L.~Valiant for solving context-free grammar parsing with matrix multiplication, yields the first truly sub-cubic algorithms for the following problems: Language Edit Distance (a major problem in the parsing community), RNA-folding (a major problem in bioinformatics) and Optimum Stack Generation (answering an open problem of Tarjan).

...read moreread less

Journal Article•DOI•

On Obstructing Obscenity Obfuscation

[...]

Sergio Rojas-Galeano¹•Institutions (1)

District University of Bogotá¹

24 Apr 2017-ACM Transactions on The Web

TL;DR: The prospects of the method to help enforcing moderation rules of obscenity expressions or as a preprocessing mechanism for sequence cleaning and/or feature extraction in more sophisticated text categorization techniques are discussed.

...read moreread less

Abstract: Obscenity (the use of rude words or offensive expressions) has spread from informal verbal conversations to digital media, becoming increasingly common on user-generated comments found in Web forums, newspaper user boards, social networks, blogs, and media-sharing sites. The basic obscenity-blocking mechanism is based on verbatim comparisons against a blacklist of banned vocabulary; however, creative users circumvent these filters by obfuscating obscenity with symbol substitutions or bogus segmentations that still visually preserve the original semantics, such as writing shit as dhi;t or s.h.i.t or even worse mixing them as d.hm.i.t. The number of potential obfuscated variants is combinatorial, yielding the verbatim filter impractical. Here we describe a method intended to obstruct this anomaly inspired by sequence alignment algorithms used in genomics, coupled with a tailor-made edit penalty function. The method only requires to set up the vocabulary of plain obscenities; no further training is needed. Its complexity on screening a single obscenity is linear, both in runtime and memory, on the length of the user-generated text. We validated the method on three different experiments. The first one involves a new dataset that is also introduced in this article; it consists of a set of manually annotated real-life comments in Spanish, gathered from the news user boards of an online newspaper, containing this type of obfuscation. The second one is a publicly available dataset of comments in Portuguese from a sports Web site. In these experiments, at the obscenity level, we observed recall rates greater than 90%, whereas precision rates varied between 75% and 95%, depending on their sequence length (shorter lengths yielded a higher number of false alarms). On the other hand, at the comment level, we report recall of 86%, precision of 91%, and specificity of 98%. The last experiment revealed that the method is more effective in matching this type of obfuscation compared to the classical Levenshtein edit distance. We conclude discussing the prospects of the method to help enforcing moderation rules of obscenity expressions or as a preprocessing mechanism for sequence cleaning and/or feature extraction in more sophisticated text categorization techniques.

...read moreread less

Journal Article•DOI•

Dynamic time warping assessment of high-resolution melt curves provides a robust metric for fungal identification.

[...]

Sha Lu¹, Gordana Mirchevska, Sayali S. Phatak², Dongmei Li², Janos Luka³, Richard Calderone², William A. Fonzi² - Show less +3 more•Institutions (3)

Sun Yat-sen University¹, Georgetown University², Walter Reed Army Institute of Research³

06 Mar 2017-PLOS ONE

TL;DR: It is concluded that DTW-based distances provide a useful metric for the automated identification of fungi based on HRM curves of the ITS region and that this provides the foundation for a robust and automatable method applicable to the clinical setting.

...read moreread less

Abstract: Fungal infections are a global problem imposing considerable disease burden. One of the unmet needs in addressing these infections is rapid, sensitive diagnostics. A promising molecular diagnostic approach is high-resolution melt analysis (HRM). However, there has been little effort in leveraging HRM data for automated, objective identification of fungal species. The purpose of these studies was to assess the utility of distance methods developed for comparison of time series data to classify HRM curves as a means of fungal species identification. Dynamic time warping (DTW), first introduced in the context of speech recognition to identify temporal distortion of similar sounds, is an elastic distance measure that has been successfully applied to a wide range of time series data. Comparison of HRM curves of the rDNA internal transcribed spacer (ITS) region from 51 strains of 18 fungal species using DTW distances allowed accurate classification and clustering of all 51 strains. The utility of DTW distances for species identification was demonstrated by matching HRM curves from 243 previously identified clinical isolates against a database of curves from standard reference strains. The results revealed a number of prior misclassifications, discriminated species that are not resolved by routine phenotypic tests, and accurately identified all 243 test strains. In addition to DTW, several other distance functions, Edit Distance on Real sequence (EDR) and Shape-based Distance (SBD), showed promise. It is concluded that DTW-based distances provide a useful metric for the automated identification of fungi based on HRM curves of the ITS region and that this provides the foundation for a robust and automatable method applicable to the clinical setting.

...read moreread less

Proceedings Article•DOI•

Mutually uncorrelated codes for DNA storage

[...]

Maya Levy¹, Eitan Yaakobi¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Jun 2017

TL;DR: This work presents an efficient algorithm for MU codes with linear encoding and decoding complexity and shows an efficient construction of these codes with nearly optimal redundancy and draw connections to the problem of comma-free and prefix synchronized codes.

...read moreread less

Abstract: Mutually Uncorrelated (MU) codes are a class of codes in which no proper prefix of one codeword is a suffix of another codeword. These codes were originally studied for synchronization purposes and recently, Yazdi et al. showed their applicability to enable random access in DNA storage. In this work we follow the research of Yazdi et al. and study MU codes along with their extensions to correct errors and balanced codes. We first review a well known construction of MU codes and study the asymptotic behavior of its cardinality. Then, we present an efficient algorithm for MU codes with linear encoding and decoding complexity. Next, we extend these results for (dh, dm)-MU codes that impose a minimum Hamming distance of dh between different codewords and d m between prefixes and suffixes. Particularly we show an efficient construction of these codes with nearly optimal redundancy and draw connections to the problem of comma-free and prefix synchronized codes. Lastly, we provide similar results for the edit distance and balanced MU codes.

...read moreread less

Journal Article•DOI•

An image-based near-duplicate video retrieval and localization using improved Edit distance

[...]

Hao Liu¹, Qingjie Zhao¹, Hao Wang¹, Peng Lv¹, Yanming Chen¹ - Show less +1 more•Institutions (1)

Beijing Institute of Technology¹

01 Nov 2017-Multimedia Tools and Applications

TL;DR: An image-based algorithm using improved Edit distance for near-duplicate video retrieval and localization and a detect-and-refine-strategy-based dynamic programming algorithm is proposed to generate the path matrix, which can be used to aggregate scores for video similarity measure and localize the similar parts.

...read moreread less

Abstract: The rapid development of social network in recent years has spurred enormous growth of near-duplicate videos. The existence of huge volumes of near-duplicates shows a rising demand on effective near-duplicate video retrieval technique in copyright violation and search result reranking. In this paper, we propose an image-based algorithm using improved Edit distance for near-duplicate video retrieval and localization. By regarding video sequences as strings, Edit distance is used and extended to retrieve and localize near-duplicate videos. Firstly, bag-of-words (BOW) model is utilized to measure the frame similarities, which is robust to spatial transformations. Then, non-near-duplicate videos are filtered out by computing the proposed relative Edit distance similarity (REDS). Next, a detect-and-refine-strategy-based dynamic programming algorithm is proposed to generate the path matrix, which can be used to aggregate scores for video similarity measure and localize the similar parts. Experiments on CC_WEB_VIDEO and TREC CBCD 2011 datasets demonstrated the effectiveness and robustness of the proposed method in retrieval and localization tasks.

...read moreread less

Proceedings Article•DOI•

Dynamic Time Warping and Geometric Edit Distance: Breaking the Quadratic Barrier

[...]

Omer Gold¹, Micha Sharir¹•Institutions (1)

Tel Aviv University¹

01 Jan 2017

TL;DR: In this paper, the authors present deterministic algorithms for computing DTW and GED between two sequences of n points in R in O(n 2 log log log n/ log log N ) time.

...read moreread less

Abstract: Dynamic Time Warping (DTW) and Geometric Edit Distance (GED) are basic similarity measures between curves or general temporal sequences (e.g., time series) that are represented as sequences of points in some metric space (X, dist). The DTW and GED measures are massively used in various fields of computer science and computational biology, consequently, the tasks of computing these measures are among the core problems in P. Despite extensive efforts to find more efficient algorithms, the best-known algorithms for computing the DTW or GED between two sequences of points in X = R^d are long-standing dynamic programming algorithms that require quadratic runtime, even for the one-dimensional case d = 1, which is perhaps one of the most used in practice. In this paper, we break the nearly 50 years old quadratic time bound for computing DTW or GED between two sequences of n points in R, by presenting deterministic algorithms that run in O( n^2 log log log n / log log n ) time. Our algorithms can be extended to work also for higher dimensional spaces R^d, for any constant d, when the underlying distance-metric dist is polyhedral (e.g., L_1, L_infty).

...read moreread less

Posted Content•

Tree Edit Distance Cannot be Computed in Strongly Subcubic Time (unless APSP can)

[...]

Karl Bringmann¹, Paweł Gawrychowski², Shay Mozes³, Oren Weimann⁴•Institutions (4)

Max Planck Society¹, University of Wrocław², Interdisciplinary Center Herzliya³, University of Haifa⁴

27 Mar 2017-arXiv: Data Structures and Algorithms

TL;DR: The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance.

...read moreread less

Abstract: The edit distance between two rooted ordered trees with $n$ nodes labeled from an alphabet~$\Sigma$ is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. Tree edit distance is a well known generalization of string edit distance. The fastest known algorithm for tree edit distance runs in cubic $O(n^3)$ time and is based on a similar dynamic programming solution as string edit distance. In this paper we show that a truly subcubic $O(n^{3-\varepsilon})$ time algorithm for tree edit distance is unlikely: For $|\Sigma| = \Omega(n)$, a truly subcubic algorithm for tree edit distance implies a truly subcubic algorithm for the all pairs shortest paths problem. For $|\Sigma| = O(1)$, a truly subcubic algorithm for tree edit distance implies an $O(n^{k-\varepsilon})$ algorithm for finding a maximum weight $k$-clique. Thus, while in terms of upper bounds string edit distance and tree edit distance are highly related, in terms of lower bounds string edit distance exhibits the hardness of the strong exponential time hypothesis [Backurs, Indyk STOC'15] whereas tree edit distance exhibits the hardness of all pairs shortest paths. Our result provides a matching conditional lower bound for one of the last remaining classic dynamic programming problems.

...read moreread less

Journal Article•DOI•

Novel Geometrical Shape Feature Extraction Techniques for Multilingual Character Recognition

[...]

Narasimha Reddy Soora¹, Parag S. Deshpande¹•Institutions (1)

Visvesvaraya National Institute of Technology¹

02 Nov 2017-Iete Technical Review

TL;DR: A set of feature vectors (FVs) which are based on shape geometry (SG) decoding of the input character which are represented as the string of shape operators are proposed and evaluated using the characters extracted from printed aged multilingual Indian documents having English, Devanagari, and Marathi scripts.

...read moreread less

Abstract: Multilingual character recognition from the images of aged Indian documents is challenging because of the complex character grapheme of the Indian language scripts. Feature extraction plays the most important role in recognition of such images. In this paper, we have proposed a set of feature vectors (FVs) which are based on shape geometry (SG) decoding of the input character. The first FV is based on SG decoding of the input character using triangular area (TA) calculation. The second FV, namely, SG using perpendicular distance is extracted by dividing the input image into individual components and the shape of the individual component is decoded into shape symbols by comparing the normalized perpendicular distances of the individual pixels of the component onto the line joining the end points of the component. Apart from the proposed FVs, we have used crossing count features. These FVs are represented as the string of shape operators; hence, we have used minimum edit distance classifier to recog...

...read moreread less