scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2022"


Book ChapterDOI
TL;DR: This article shows that both graph modification problems are Open image in new window -hard, resolving a conjecture by Natanzon, Shamir, and Sharan (2001), and gives a subexponential time parameterized algorithm solving this problem.

31 citations


Proceedings ArticleDOI
07 Aug 2022
TL;DR: In this article , the authors propose to automatically extract semantically-labeled edit directions from StyleGAN, finding and naming meaningful edit operations in a fully unsupervised setup, without additional human guidance.
Abstract: The success of StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. However, such editing operations are either trained with semantic supervision or annotated manually by users. In another development, the CLIP architecture has been trained with internet-scale loose image and text pairings, and has been shown to be useful in several zero-shot learning settings. In this work, we investigate how to effectively link the pretrained latent spaces of StyleGAN and CLIP, which in turn allows us to automatically extract semantically-labeled edit directions from StyleGAN, finding and naming meaningful edit operations, in a fully unsupervised setup, without additional human guidance. Technically, we propose two novel building blocks; one for discovering interesting CLIP directions and one for semantically labeling arbitrary directions in CLIP latent space. The setup does not assume any pre-determined labels and hence we do not require any additional supervised text/attributes to build the editing framework. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible, revealing interesting and non-trivial edit directions.

6 citations


Proceedings ArticleDOI
16 Feb 2022
TL;DR: This work builds upon the algorithm of Andoni, Krauthgamer and Onak which approximates the edit distance in almost-linear time O(n1+ε) within a polylogarithmic factor and shows how to to effectively prune their computation tree to obtain a sublinear-time algorithm in the given time bound.
Abstract: We revisit the task of computing the edit distance in sublinear time. In the (k,K)-gap edit distance problem we are given oracle access to two strings of length n and the task is to distinguish whether their edit distance is at most k or at least K. It has been established by Goldenberg, Krauthgamer and Saha (FOCS ’19), with improvements by Kociumaka and Saha (FOCS ’20), that the (k,k2)-gap problem can be solved in time O(n/k + poly(k)). One of the most natural questions in this line of research is whether the (k,k2)-gap is best-possible for the running time O(n/k + poly(k)). In this work we answer this question by significantly improving the gap. Specifically, we show that in time O(n/k + poly(k)) we can even solve the (k,k1+o(1))-gap problem. This is the first algorithm that breaks the (k,k2)-gap in this running time. Our algorithm is almost optimal in the following sense: In the low distance regime (k ≤ n0.19) our running time becomes O(n/k), which matches a known n/k1+o(1) lower bound for the (k,k1+o(1))-gap problem up to lower order factors. Our result also reveals a surprising similarity of Hamming distance and edit distance in the low distance regime: For both, the (k,k1+o(1))-gap problem has time complexity n/k1± o(1) for small k. In contrast to previous work, which employed a subsampled variant of the Landau-Vishkin algorithm, we instead build upon the algorithm of Andoni, Krauthgamer and Onak (FOCS ’10) which approximates the edit distance in almost-linear time O(n1+ε) within a polylogarithmic factor. We first simplify their approach and then show how to to effectively prune their computation tree in order to obtain a sublinear-time algorithm in the given time bound. Towards that, we use a variety of structural insights on the (local and global) patterns that can emerge during this process and design appropriate property testers to effectively detect these patterns.

6 citations


Journal ArticleDOI
TL;DR: In this article , a functorial pipeline for persistent homology is proposed, where the input is a filtered simplicial complex indexed by any finite metric lattice, and the output is a persistence diagram defined as the Möbius inversion of its birth-death function.
Abstract: We build a functorial pipeline for persistent homology. The input to this pipeline is a filtered simplicial complex indexed by any finite metric lattice, and the output is a persistence diagram defined as the Möbius inversion of its birth-death function. We adapt the Reeb graph edit distance to each of our categories and prove that both functors in our pipeline are 1-Lipschitz, making our pipeline stable. Our constructions generalize the classical persistence diagram, and in this setting, the bottleneck distance is strongly equivalent to the edit distance.

5 citations


Journal ArticleDOI
TL;DR: In this article , the authors make a surprising claim: in any realistic data mining application, the FastDTW algorithm is much slower than the exact DTW algorithm, and this fact clearly has implications for the community that uses this algorithm: allowing it to address much larger datasets, get exact results, and do so in less time.
Abstract: Many time series data mining problems can be solved with repeated use of distance measure. Examples of such tasks include similarity search, clustering, classification, anomaly detection and segmentation. For over two decades it has been known that the Dynamic Time Warping (DTW) distance measure is the best measure to use for most tasks, in most domains. Because the classic DTW algorithm has quadratic time complexity, many ideas have been introduced to reduce its amortized time, or to quickly approximate it. One of the most cited approximate approaches is FastDTW. The FastDTW algorithm has well over a thousand citations and has been explicitly used in several hundred research efforts. In this work, we make a surprising claim. In any realistic data mining application, the approximate FastDTW is much slower than the exact DTW. This fact clearly has implications for the community that uses this algorithm: allowing it to address much larger datasets, get exact results, and do so in less time.

5 citations


Journal ArticleDOI
TL;DR: This paper proposed branch mappings, a novel approach to the construction of edit mappings for merge trees, which is faster than the only other branch decomposition‐independent method in the literature by more than a linear factor and compared on synthetic and real‐world examples to demonstrate its practicality and utility.
Abstract: Edit distances between merge trees of scalar fields have many applications in scientific visualization, such as ensemble analysis, feature tracking or symmetry detection. In this paper, we propose branch mappings, a novel approach to the construction of edit mappings for merge trees. Classic edit mappings match nodes or edges of two trees onto each other, and therefore have to either rely on branch decompositions of both trees or have to use auxiliary node properties to determine a matching. In contrast, branch mappings employ branch properties instead of node similarity information, and are independent of predetermined branch decompositions. Especially for topological features, which are typically based on branch properties, this allows a more intuitive distance measure which is also less susceptible to instabilities from small‐scale perturbations. For trees with 𝒪(n) nodes, we describe an 𝒪(n4) algorithm for computing optimal branch mappings, which is faster than the only other branch decomposition‐independent method in the literature by more than a linear factor. Furthermore, we compare the results of our method on synthetic and real‐world examples to demonstrate its practicality and utility.

5 citations


Journal ArticleDOI
TL;DR: In this article , the authors present an algorithm to solve the colinear chaining problem with anchor overlaps and gap costs in Õ(n) time, where n denotes the count of anchors.
Abstract: Colinear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the colinear chaining problem with anchor overlaps and gap costs in Õ(n) time, where n denotes the count of anchors. The degree of the polylogarithmic factor depends on the type of anchors used (e.g., fixed-length anchors) and the type of precedence an optimal anchor chain is required to satisfy. We also establish the first theoretical connection between colinear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal "anchored" edit distance equals the optimal colinear chaining cost. The anchored edit distance for two sequences and a set of anchors is only a slight generalization of the standard edit distance. It adds an additional cost of one to an alignment of two matching symbols that are not supported by any anchor. Finally, we demonstrate experimentally that optimal colinear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient >0.9 with edit distance for closely as well as distantly related sequences.

4 citations


Journal ArticleDOI
TL;DR: In this paper , the authors introduce the notion of string set universe diameter of a genome graph and use it to model the distance between heterogeneous string sets and show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%.
Abstract: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets.We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover's Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%.Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/.Supplementary data are available at Bioinformatics online.

4 citations


Book ChapterDOI
01 Jan 2022
TL;DR: In this paper , the threshold Dyck edit distance problem is considered, where the input is a sequence of parentheses and a positive integer $k, and the goal is to compute the Dyck edits distance of a given pair of parentheses only if the distance is at most $k.
Abstract: A Dyck sequence is a sequence of opening and closing parentheses (of various types) that is balanced. The Dyck edit distance of a given sequence of parentheses $S$ is the smallest number of edit operations (insertions, deletions, and substitutions) needed to transform $S$ into a Dyck sequence. We consider the threshold Dyck edit distance problem, where the input is a sequence of parentheses $S$ and a positive integer $k$, and the goal is to compute the Dyck edit distance of $S$ only if the distance is at most $k$, and otherwise report that the distance is larger than $k$. Backurs and Onak [PODS'16] showed that the threshold Dyck edit distance problem can be solved in $O(n+k^{16})$ time. In this work, we design new algorithms for the threshold Dyck edit distance problem which costs $O(n+k^{4.544184})$ time with high probability or $O(n+k^{4.853059})$ deterministically. Our algorithms combine several new structural properties of the Dyck edit distance problem, a refined algorithm for fast $(\min,+)$ matrix product, and a careful modification of ideas used in Valiant's parsing algorithm.

3 citations


Proceedings ArticleDOI
01 Mar 2022
TL;DR: Hephaestus is implemented, a novel method to improve the accuracy of automated bug repair through learning to apply edit operations, which evidences that learning edit operations does not offer an advantage over the standard approach of translating directly from buggy code to fixed code.
Abstract: There has been much work done in the area of automated program repair, specifically through using machine learning methods to correct buggy code. Whereas some degree of success has been attained by those efforts, there is still considerable room for growth with regard to the accuracy of results produced by such tools. In that vein, we implement Hephaestus, a novel method to improve the accuracy of automated bug repair through learning to apply edit operations. Hephaestus leverages neural machine translation and attempts to produce the edit operations needed to correct a given buggy code segment to a fixed version. We examine the effects of using various forms of edit operations in the completion of this task. Our study found that all models which learned from edit operations were not as effective at repairing bugs as models which learned from fixed code segments directly. This evidences that learning edit operations does not offer an advantage over the standard approach of translating directly from buggy code to fixed code. We conduct an analysis of this lowered efficiency and explore why the complexity of the edit operations-based models may be suboptimal. Interestingly, even though our Hephaestus model exhibited lower translation accuracy than the baseline, Hephaestus was able to perform successful bug repair. This success, albeit small, leaves the door open for other researchers to innovate unique solutions in the realm of automatic bug repair.

3 citations


Journal ArticleDOI
TL;DR: The notion of string set universe diameter of a genome graph is introduced and it is shown that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%.
Abstract: Motivation Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated TCR sequences and Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/ Contact carlk@cs.cmu.edu

Journal ArticleDOI
TL;DR: In this article , a code for correcting short tandem duplication and edit errors was proposed, where an edit error may be a substitution, deletion, or insertion, and the asymptotic cost of protecting against an additional edit is only 0.003 bits/symbol.
Abstract: Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and edit errors, where an edit error may be a substitution, deletion, or insertion. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one edit error. Because an edited symbol can be duplicated many times (as part of substrings of various lengths), a single edit can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional edit is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA.

Proceedings ArticleDOI
01 Feb 2022
TL;DR: The current best algorithm for the unweighted tree edit distance problem runs in O(n 2.9546 ) time as discussed by the authors . But this is the best known algorithm for any algorithm using the decomposition strategy, which underlies almost all the known algorithms.
Abstract: The (unweighted) tree edit distance problem for $n$ node trees asks to compute a measure of dissimilarity between two rooted trees with node labels. The current best algorithm from more than a decade ago runs in $O(n^{3})$ time [Demaine, Mozes, Rossman, and Weimann, ICALP 2007]. The same paper also showed that $O(n^{3})$ is the best possible running time for any algorithm using the so-called decomposition strategy, which underlies almost all the known algorithms for this problem. These algorithms would also work for the weighted tree edit distance problem, which cannot be solved in truly sub-cubic time under the APSP conjecture [Bringmann, Gawrychowski, Mozes, and Weimann, SODA 2018]. In this paper, we break the cubic barrier by showing an $O(n^{2.9546})$ time algorithm for the unweighted tree edit distance problem. We consider an equivalent maximization problem and use a dynamic programming scheme involving matrices with many special properties. By using a decomposition scheme as well as several combinatorial techniques, we reduce tree edit distance to the max-plus product of bounded-difference matrices, which can be solved in truly sub-cubic time [Bringmann, Grandoni, Saha, and Vassilevska Williams, FOCS 2016].

Journal ArticleDOI
TL;DR: eWFA-GPU as mentioned in this paper is a GPU-accelerated tool to compute the exact edit-distance sequence alignment based on the wavefront alignment algorithm (WFA), which exploits the similarities between the input sequences to accelerate the alignment process while requiring less memory than other algorithms.
Abstract: Sequence alignment remains a fundamental problem with practical applications ranging from pattern recognition to computational biology. Traditional algorithms based on dynamic programming are hard to parallelize, require significant amounts of memory, and fail to scale for large inputs. This work presents eWFA-GPU, a GPU (graphics processing unit)-accelerated tool to compute the exact edit-distance sequence alignment based on the wavefront alignment algorithm (WFA). This approach exploits the similarities between the input sequences to accelerate the alignment process while requiring less memory than other algorithms. Our implementation takes full advantage of the massive parallel capabilities of modern GPUs to accelerate the alignment process. In addition, we propose a succinct representation of the alignment data that successfully reduces the overall amount of memory required, allowing the exploitation of the fast shared memory of a GPU. Our results show that our GPU implementation outperforms by 3-9× the baseline edit-distance WFA implementation running on a 20 core machine. As a result, eWFA-GPU is up to 265 times faster than state-of-the-art CPU implementation, and up to 56 times faster than state-of-the-art GPU implementations.

Journal ArticleDOI
TL;DR: In this paper , the authors improved the performance of secure computation of these string metrics without sacrificing security, generality, composability, and accuracy, and explored a new design methodology that allows them to reduce the asymptotic cost by a factor of O(log n) (where n denotes the input string length).
Abstract: Secure string-comparison by some non-linear metrics such as edit-distance and its variations is an important building block of many applications including patient genome matching and text-based intrusion detection. Despite the significance of these string metrics, computing them in a provably secure manner is very expensive. In this paper, we improve the performance of secure computation of these string metrics without sacrificing security, generality, composability, and accuracy. We explore a new design methodology that allows us to reduce the asymptotic cost by a factor of O(log n) (where n denotes the input string length). In our experiments, we observe up to an order-of-magnitude savings in time and bandwidth compared to the best prior results. We extended our semi-honest protocols to work in the malicious model, which is by-far the most efficient actively-secure protocols for computing these string metrics.

Journal ArticleDOI
01 Jan 2022
TL;DR: In this paper , the effectiveness of evolutionary programming (EP) as a general approach for finding side effect machines (SEMs) for edit metric decoding was examined. And the results were analyzed to find potential trends and relationships among the parameters, with the most consistent trend being that the longer codes generally show a propensity for larger machines.
Abstract: A number of applications use DNA as a storage mechanism. Because processes in these applications may cause errors in the data, the information must be encoded as one of a chosen set of words that are well separated from one another - a DNA error-correcting code. Typically, the types of errors that may occur include insertions, deletions and substitutions of symbols, making the edit metric the most suitable choice to measure the distance between strings. Decoding, the process of recovering the original word when errors occur, is complicated by biological restrictions combined with a high cost to calculate edit distance. Side effect machines (SEMs), an extension of finite state machines, can provide efficient decoding algorithms for such codes. Several codes of varying lengths are used to study the effectiveness of evolutionary programming (EP) as a general approach for finding SEMs for edit metric decoding. Two classification methods (direct and fuzzy classification) are compared, and different EP settings are examined to observe how decoding accuracy is affected. Regardless of code length, the best results are found using fuzzy classification. The best accuracy is seen for codes of length 10, for which a maximum accuracy of up to 99.4% is achieved for distance 1 and distance 2 and 3 achieve up to 97.1% and 85.9%, respectively. Additionally, the SEMs are examined for potential bloat by comparing the number of reachable states against the total number of states. Bloat is seen more in larger machines than in smaller machines. Furthermore, the results are analysed to find potential trends and relationships among the parameters, with the most consistent trend being that, when allowed, the longer codes generally show a propensity for larger machines.

Book ChapterDOI
01 Jan 2022
TL;DR: In this paper , the authors proposed the algorithm of generating correction candidates with different edit distances and evaluated their performance on the VNOnDB database used in the Vietnamese online handwritten text recognition competition (VOHTR 2018).
Abstract: Candidate word generation by character edit operations is an important method that has been employed in many OCR error correction approaches. In this paper, we study how character edit distances impact the performance of OCR error correction. We propose the algorithm of generating correction candidates with different edit distances. Correction candidates for both non-word and real-word errors are considered. The candidates are scored and ranked based on linguistic features and edit probability. The experiments are tested on the VNOnDB database used in the Vietnamese online handwritten text recognition competition (VOHTR 2018). We evaluate the error correction performance on different edit distances in terms of two error metrics, character error rate (CER) and word error rate (WER). It is shown that the edit distances of 1 and 2 obtain better correction results instead of higher edit distances.


Journal ArticleDOI
TL;DR: In this article , a bidirectional frameshift allowance with end-user determined accommodation caps combined with weighted error discrimination was introduced to improve the computational speed of Levenstein distance.
Abstract: Third-generation sequencing offers some advantages over next-generation sequencing predecessors, but with the caveat of harboring a much higher error rate. Clustering-related sequences is an essential task in modern biology. To accurately cluster sequences rich in errors, error type and frequency need to be accounted for. Levenshtein distance is a well-established mathematical algorithm for measuring the edit distance between words and can specifically weight insertions, deletions and substitutions. However, there are drawbacks to using Levenshtein distance in a biological context and hence has rarely been used for this purpose. We present novel modifications to the Levenshtein distance algorithm to optimize it for clustering error-rich biological sequencing data.We successfully introduced a bidirectional frameshift allowance with end-user determined accommodation caps combined with weighted error discrimination. Furthermore, our modifications dramatically improved the computational speed of Levenstein distance. For simulated ONT MinION and PacBio Sequel datasets, the average clustering sensitivity for 3GOLD was 41.45% (S.D. 10.39) higher than Sequence-Levenstein distance, 52.14% (S.D. 9.43) higher than Levenshtein distance, 55.93% (S.D. 8.67) higher than Starcode, 42.68% (S.D. 8.09) higher than CD-HIT-EST and 61.49% (S.D. 7.81) higher than DNACLUST. For biological ONT MinION data, 3GOLD clustering sensitivity was 27.99% higher than Sequence-Levenstein distance, 52.76% higher than Levenshtein distance, 56.39% higher than Starcode, 48% higher than CD-HIT-EST and 70.4% higher than DNACLUST.Our modifications to Levenshtein distance have improved its speed and accuracy compared to the classic Levenshtein distance, Sequence-Levenshtein distance and other commonly used clustering approaches on simulated and biological third-generation sequenced datasets. Our clustering approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. A strength of our approach is high accuracy in resolving small clusters and mitigating the number of singletons.

Book ChapterDOI
01 Jan 2022
TL;DR: Ganesh et al. as mentioned in this paper showed that combining compression and approximation can help to reduce the running time of string distance measures by up to O(Nk 2 nk/2 nk 2 ) for k ≥ 3.
Abstract: Previous chapter Next chapter Full AccessProceedings Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)How Compression and Approximation Affect Efficiency in String Distance MeasuresArun Ganesh, Tomasz Kociumaka, Andrea Lincoln, and Barna SahaArun Ganesh, Tomasz Kociumaka, Andrea Lincoln, and Barna Sahapp.2867 - 2919Chapter DOI:https://doi.org/10.1137/1.9781611977073.112PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract Real-world data often comes in compressed form. Analyzing compressed data directly (without first decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compression schemes such as the Lempel–Ziv family, dictionary methods, and others. For two strings of total length N and total compressed size n, it is known that the edit distance and a longest common subsequence (LCS) can be computed exactly in time Õ(nN), as opposed to O(N2) for the uncompressed setting. Many real-world applications need to align multiple sequences simultaneously, and the fastest known exact algorithms for median edit distance and LCS of k strings run in O(Nk) time, whereas the one for center edit distance has a time complexity of O(N2k). This naturally raises the question if compression can help to reduce the running time significantly for k ≥ 3, perhaps to O(Nk/2 nk/2) or, more optimistically, to O(Nnk–1).1 Unfortunately, we show new lower bounds that rule out any improvement beyond Ω(Nk–1 n) time for any of these problems assuming the Strong Exponential Time Hypothesis (SETH), where again N and n represent the total length and the total compressed size, respectively. This answers an open question of Abboud, Backurs, Bringmann, and Künnemann (FOCS'17). In presence of such negative results, we ask if allowing approximation can help, and we show that approximation and compression together can be surprisingly effective for both multiple and two strings. We develop an Õ(Nk/2 nk/2)-time FPTAS for the median edit distance of k sequences, leading to a saving of nearly half the dimensions for highly-compressible sequences. In comparison, no O(Nk–Ω(1))-time PTAS is known for the median edit distance problem in the uncompressed setting. We obtain an improvement from for the center edit distance problem. For two strings, we get an -time FPTAS for both edit distance and LCS; note that this running time is o(N) whenever n ≪ N1/4. In contrast, for uncompressed strings, there is not even a subquadratic algorithm for LCS that has less than polynomial gap in the approximation factor. Building on the insight from our approximation algorithms, we also obtain several new and improved results for many fundamental distance measures including the edit, Hamming, and shift distances. Previous chapter Next chapter RelatedDetails Published:2022eISBN:978-1-61197-707-3 https://doi.org/10.1137/1.9781611977073Book Series Name:ProceedingsBook Code:PRDA22Book Pages:xvii + 3771

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a novel method to compute approximate distance distributions with error bound guarantees, which outperforms the sampling-based solution (without error guarantees) by up to three orders of magnitude.
Abstract: In this work we study the distance distribution computation problem. It has been widely used in many real-world applications, e.g., human genome clustering, cosmological model analysis, and parameter tuning. The straightforward solution for the exact distance distribution computation problem is unacceptably slow due to (i) massive data size, and (ii) expensive distance computation. In this paper, we propose a novel method to compute approximate distance distributions with error bound guarantees. Furthermore, our method is generic to different distance measures. We conduct extensive experimental studies on three widely used distance measures with real-world datasets. The experimental results demonstrate that our proposed method outperforms the sampling-based solution (without error guarantees) by up to three orders of magnitude.

Proceedings ArticleDOI
17 Sep 2022
TL;DR: The authors proposed a spell checker for typed words based on the Modified minimum edit distance algorithm (MEDA), and the Syllable Error Detection Algorithm (SEDA) to identify the component of the word and the position of the letter which has an error.
Abstract: Automatic spelling correction for a language is critical since the current world is almost entirely dependent on digital devices that employ electronic keyboards. Correct spelling adds to textual document accessibility and readability. Many NLP applications, such as web search engines, text summarization, sentiment analysis, and so on, rely on automatic spelling correction. A few efforts on automatic spelling correction in Bantu languages have been completed; however, the numbers are insufficient. We proposed a spell checker for typed words based on the Modified minimum edit distance Algorithm (MEDA), and the Syllable Error Detection Algorithm (SEDA). In this study, we adjusted the minimal edit distance Algorithm by including a frequency score for letters and ordered operations. The SEDA identifies the component of the word and the position of the letter which has an error. For this research, the Setswana language was utilized for testing, and other languages related to Setswana will use this spell checker. Setswana is a Bantu language spoken mostly in Botswana, South Africa, and Namibia and its automatic spelling correction are still in its early stages. Setswana is Botswana’s national language and is mostly utilized in schools and government offices. The accuracy was measured in 2500 Setswana words for assessment. The SEDA discovered incorrect Setswana words with 99% accuracy. When evaluating MEDA, the edit distance algorithm was utilized as the baseline, and it generated an accuracy of 52%. In comparison, the edit distance algorithm with ordered operations provided 64% accuracy, and MEDA produced 92% accuracy. The model failed in the closely related terms.

Proceedings ArticleDOI
01 Oct 2022
TL;DR: Recently, Mao et al. as discussed by the authors gave the first O(nk 2 )-time algorithm for tree edit distance, which is the fastest known algorithm for the problem and the fastest algorithm for any edit distance problem.
Abstract: Computing the edit distance of two strings is one of the most basic problems in computer science and combinatorial optimization. Tree edit distance is a natural generalization of edit distance in which the task is to compute a measure of dissimilarity between two (unweighted) rooted trees with node labels. Perhaps the most notable recent application of tree edit distance is in NoSQL big databases, such as MongoDB, where each row of the database is a JSON document represented as a labeled rooted tree and finding dissimilarity between two rows is a basic operation. Until recently, the fastest algorithm for tree edit distance ran in cubic time (Demaine, Mozes, Rossman, Weimann; TALG’10); however, Mao (FOCS’21) broke the cubic barrier for the tree edit distance problem using fast matrix multiplication.Given a parameter k as an upper bound on the distance, an $\mathcal{O}(n+k^{2})$-time algorithm for edit distance has been known since the 1980s due to works of Myers (Algorithmica’86) and Landau and Vishkin (JCSS’88). The existence of an $\tilde{\mathcal{O}}(n+poly(k))$-time algorithm for tree edit distance has been posed as open question, e.g., by Akmal and Jin (ICALP’21), who give a stateof-the-art $O(nk^{2})$-time algorithm. In this paper, we answer this question positively.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed three parallel branch and bound (B&B) approaches based on shared memory to exploit the multi-core CPU processors, which reduce the computation time required to explore the whole search space by performing an implicit enumeration of the search space.

Journal ArticleDOI
TL;DR: In this article , the authors derive lower bounds for efficient filtering from restricted assignment problems, where the cost function is a tree metric and embedding the costs of optimal assignments isometrically into $$\ell _1$$ space, rendering efficient indexing possible.
Abstract: Abstract The graph edit distance is an intuitive measure to quantify the dissimilarity of graphs, but its computation is $$\mathsf {NP}$$ NP -hard and challenging in practice. We introduce methods for answering nearest neighbor and range queries regarding this distance efficiently for large databases with up to millions of graphs. We build on the filter-verification paradigm, where lower and upper bounds are used to reduce the number of exact computations of the graph edit distance. Highly effective bounds for this involve solving a linear assignment problem for each graph in the database, which is prohibitive in massive datasets. Index-based approaches typically provide only weak bounds leading to high computational costs verification. In this work, we derive novel lower bounds for efficient filtering from restricted assignment problems, where the cost function is a tree metric. This special case allows embedding the costs of optimal assignments isometrically into $$\ell _1$$ 1 space, rendering efficient indexing possible. We propose several lower bounds of the graph edit distance obtained from tree metrics reflecting the edit costs, which are combined for effective filtering. Our method termed EmbAssi can be integrated into existing filter-verification pipelines as a fast and effective pre-filtering step. Empirically we show that for many real-world graphs our lower bounds are already close to the exact graph edit distance, while our index construction and search scales to very large databases.

Posted ContentDOI
22 Feb 2022
TL;DR: In this article , the authors introduce the notion of string set universe diameter of a genome graph and use it to model the distance between heterogeneous string sets and show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%.
Abstract: Abstract Motivation Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated TCR sequences and Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/ Contact carlk@cs.cmu.edu

Journal ArticleDOI
TL;DR: In this article , a segmented-edit error-correcting code with the re-synchronization function was proposed, where the decoding complexity is linear in the codeword length.
Abstract: As a powerful tool for storing digital information in chemically synthesized molecules, DNA-based data storage has undergone continuous development and received increasingly more attention. Efficiently recovering information from large-scale DNA strands that suffer from insertions, deletions, and substitution errors (collectively referred to as edit errors), is one of the major bottlenecks in DNA-based storage systems. To cope with this challenge, in this paper, we provide a segmented-edit error-correcting code with the re-synchronization function, termed the DNA-LM code. Compared with the previous segmented-error-correcting codes, it has a systematic structure and does not require the endpoint of the received segment as pre-requisite information for decoding. In the case that the number of edit errors exceeds the edit error-correcting capability of a segment, it can easily regain synchronization to ensure that the subsequent decoding continues. Both encoding and decoding complexity is linear in the codeword length. The redundancy of each segment is $\lceil \log k\rceil +6$ quaternary symbols, where $k$ is the length of the message segment. We further generalize the decoding algorithm to deal with duplicated DNA strands, whereas it still maintains linear time complexity in the codeword length and the number of duplications. Simulations under a stochastic edit errors model show that, at a low raw error rate of the “next-gen” sequencing, our code can enable error-free decoding by concatenating with the (255,223) RS code.

Journal ArticleDOI
Phillip Law1
TL;DR: In this paper , the authors show that the similarity between a pair of time series can be estimated by the dynamic time warping distance, instead of any in the well-studied family of measures including the longest common subsequence (LCS) length and the edit distance.
Abstract: Abstract The similarity between a pair of time series, i.e., sequences of indexed values in time order, is often estimated by the dynamic time warping (DTW) distance, instead of any in the well-studied family of measures including the longest common subsequence (LCS) length and the edit distance. Although it may seem as if the DTW and the LCS(-like) measures are essentially different, we reveal that the DTW distance can be represented by the longest increasing subsequence (LIS) length of a sequence of integers, which is the LCS length between the integer sequence and itself sorted. For a given pair of time series of length n such that the dissimilarity between any elements is an integer between zero and c , we propose an integer sequence that represents any substring-substring DTW distance as its band-substring LIS length. The length of the produced integer sequence is $$O(c n^2)$$ O ( c n 2 ) , which can be translated to $$O(n^2)$$ O ( n 2 ) for constant dissimilarity functions. To demonstrate that techniques developed under the LCS(-like) measures are directly applicable to analysis of time series via our reduction of DTW to LIS, we present time-efficient algorithms for DTW-related problems utilizing the semi-local sequence comparison technique developed for LCS-related problems.

Book ChapterDOI
TL;DR: In this paper , the authors attempt to determine the degrees of association between automatic MT metrics and error classes from English into inflectional Slovak using a corpus, which consists of English journalistic texts, taken from the British online newspaper The Guardian and their human and machine translations.
Abstract: AbstractMachine translation (MT) evaluation plays an important task in the translation industry. The main issue in evaluating the MT quality is an unclear definition of translation quality. Several methods and techniques for measuring MT quality have been designed. Our study aims at interconnecting manual error classification with automatic metrics of MT evaluation. We attempt to determine the degrees of association between automatic MT metrics and error classes from English into inflectional Slovak. We created a corpus, which consists of English journalistic texts, taken from the British online newspaper The Guardian and their human and machine translations. The MT outputs, produced by Google translate, were manually annotated by three professionals using a categorical framework for error analysis and evaluated using reference proximity through the metrics of automated MT evaluation. The results showed that not all examined automatic metrics based on n-grams or edit distance should be implemented into a model for determining the MT quality. When determining the quality of machine translation in respect to syntactic-semantic correlativeness, it is sufficient to consider only the Recall, BLEU-4 or F-measure, ROUGE-L and NIST (based on n-grams) and the metric CharacTER, which is based on edit distance.KeywordsMachine translationAutomatic metricsError classification

Journal ArticleDOI
TL;DR: This work presents scalable parallel algorithms to support efficient similarity search under edit distance and addresses the problem of uneven workload across different processing units, which is mainly caused by the significant variance in the size of the sequences.
Abstract: Edit distance is the most widely used method to quantify similarity between two strings. We investigate the problem of similarity search under edit distance. Given a collection of sequences, the goal of similarity search under edit distance is to find sequences in the collection that are similar to a given query sequence where the similarity score is computed using edit distance. The canonical method of computing edit distance between two strings uses a dynamic programming-based approach that runs in quadratic time and space, which may not provide results in a reasonable amount of time for large sequences. It advocates for parallel algorithms to reduce the time taken by edit distance computation. To this end, we present scalable parallel algorithms to support efficient similarity search under edit distance. The efficiency and scalability of the proposed algorithms is demonstrated through an extensive set of experiments on real datasets. Moreover, to address the problem of uneven workload across different processing units, which is mainly caused due to the significant variance in the size of the sequences, different data distribution schemes are discussed and empirically analyzed. Experimental results have shown that the speedup achieved by the hybrid approach over inter-task and intra-task parallelism is 18 and 13, respectively.