scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2012"


Journal ArticleDOI
TL;DR: This paper develops a methodology to detect domain fluxing as used by Conficker botnet with minimal false positives, and applies it to packet traces collected at a Tier-1 ISP.
Abstract: Recent botnets such as Conficker, Kraken, and Torpig have used DNS-based "domain fluxing" for command-and-control, where each Bot queries for existence of a series of domain names and the owner has to register only one such domain name. In this paper, we develop a methodology to detect such "domain fluxes" in DNS traffic by looking for patterns inherent to domain names that are generated algorithmically, in contrast to those generated by humans. In particular, we look at distribution of alphanumeric characters as well as bigrams in all domains that are mapped to the same set of IP addresses. We present and compare the performance of several distance metrics, including K-L distance, Edit distance, and Jaccard measure. We train by using a good dataset of domains obtained via a crawl of domains mapped to all IPv4 address space and modeling bad datasets based on behaviors seen so far and expected. We also apply our methodology to packet traces collected at a Tier-1 ISP and show we can automatically detect domain fluxing as used by Conficker botnet with minimal false positives, in addition to discovering a new botnet within the ISP trace. We also analyze a campus DNS trace to detect another unknown botnet exhibiting advanced domain-name generation technique.

227 citations


Posted Content
TL;DR: This article proposed a discriminative string-edit CRF, a conditional random field model for edit sequences between strings, which is trained on both positive and negative instances of string pairs.
Abstract: The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finitestate conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets.

133 citations


Journal ArticleDOI
TL;DR: A novel approach for finding similar trajectories, using trajectory segmentation based on movement parameters (MPs) such as speed, acceleration, or direction, using a modified version of edit distance called normalized weighted edit distance (NWED) is introduced as a similarity measure.
Abstract: This article describes a novel approach for finding similar trajectories, using trajectory segmentation based on movement parameters MPs such as speed, acceleration, or direction. First, a segmentation technique is applied to decompose trajectories into a set of segments with homogeneous characteristics with respect to a particular MP. Each segment is assigned to a movement parameter class MPC, representing the behavior of the MP. Accordingly, the segmentation procedure transforms a trajectory to a sequence of class labels, that is, a symbolic representation. A modified version of edit distance called normalized weighted edit distance NWED is introduced as a similarity measure between different sequences. As an application, we demonstrate how the method can be employed to cluster trajectories. The performance of the approach is assessed in two case studies using real movement datasets from two different application domains, namely, North Atlantic Hurricane trajectories and GPS tracks of couriers in London. Three different experiments have been conducted that respond to different facets of the proposed techniques and that compare our NWED measure to a related method.

128 citations


Proceedings ArticleDOI
01 Apr 2012
TL;DR: It is found that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered.
Abstract: Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements.

110 citations


Journal ArticleDOI
TL;DR: Two new graph kernels applied to regression and classification problems are presented, one based on the notion of edit distance while the other based on subtrees enumeration.

91 citations


Journal ArticleDOI
TL;DR: This paper focuses on the index structure for similarity search on a set of large sparse graphs and proposes an efficient indexing mechanism by introducing the Q-Gram idea and developed a series of techniques for inverted index construction and online query processing.
Abstract: The graph structure is a very important means to model schemaless data with complicated structures, such as protein-protein interaction networks, chemical compounds, knowledge query inferring systems, and road networks. This paper focuses on the index structure for similarity search on a set of large sparse graphs and proposes an efficient indexing mechanism by introducing the Q-Gram idea. By decomposing graphs to small grams (organized by κ-Adjacent Tree patterns) and pairing-up on those κ-Adjacent Tree patterns, the lower bound estimation of their edit distance can be calculated for candidate filtering. Furthermore, we have developed a series of techniques for inverted index construction and online query processing. By building the candidate set for the query graph before the exact edit distance calculation, the number of graphs need to proceed into exact matching can be greatly reduced. Extensive experiments on real and synthetic data sets have been conducted to show the effectiveness and efficiency of the proposed indexing mechanism.

83 citations


Book ChapterDOI
10 Sep 2012
TL;DR: The problem of secure outsourcing of sequence comparisons by a client to remote servers, which given two strings λ and μ of respective lengths n and m, consists of finding a minimum-cost sequence of insertions, deletions, and substitutions that transform λ into μ is treated.
Abstract: We treat the problem of secure outsourcing of sequence comparisons by a client to remote servers, which given two strings λ and μ of respective lengths n and m, consists of finding a minimum-cost sequence of insertions, deletions, and substitutions (also called an edit script) that transform λ into μ. In our setting a client owns λ and μ and outsources the computation to two servers without revealing to them information about either the input strings or the output sequence. Our solution is non-interactive for the client (who only sends information about the inputs and receives the output) and the client’s work is linear in its input/output. The servers’ performance is O(σmn) computation (which is optimal) and communication, where σ is the alphabet size, and the solution is designed to work when the servers have only O(σ(m + n)) memory. By utilizing garbled circuit evaluation in a novel way, we completely avoid public-key cryptography, which makes our solution particularly efficient.

77 citations


Proceedings ArticleDOI
01 Apr 2012
TL;DR: This paper studies the graph similarity join problem that returns pairs of graphs such that their edit distances are no larger than a threshold, and inspired by the q-gram idea for string similarity problem, extracts paths from graphs as features for indexing.
Abstract: Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources, such as erroneous data entry, and find similarity matches. In this paper, we study the graph similarity join problem that returns pairs of graphs such that their edit distances are no larger than a threshold. Inspired by the q-gram idea for string similarity problem, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. An efficient algorithm is proposed to exploit both matching and mismatching features to improve the filtering and verification on candidates. We demonstrate the proposed algorithm significantly outperforms existing approaches with extensive experiments on publicly available datasets.

73 citations


Journal ArticleDOI
01 Jul 2012
TL;DR: This paper introduces a new pairwise distance measure, based on matching, for phylogenetic trees, and proves that it induces a metric on the space of trees, shows how to compute it in low polynomial time, and verify through statistical testing that it is robust, and notes that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures.
Abstract: Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes-reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance.

73 citations


Journal ArticleDOI
TL;DR: Hobbes is a new gram-based program for aligning short reads, supporting Hamming and edit distance, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS.
Abstract: Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2-10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu.

70 citations


Journal ArticleDOI
01 Aug 2012
TL;DR: This paper designs efficient trie-join algorithms and pruning techniques to achieve high performance and shows that these algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.
Abstract: A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.

Journal ArticleDOI
TL;DR: This work models the edit distance as a function in a labeling space and characterize the distance value through the labeling space, finding some interesting properties that are useful for a better understanding of the editdistance.
Abstract: We model the edit distance as a function in a labeling space. A labeling space is an Euclidean space where coordinates are the edit costs. Through this model, we define a class of cost. A class of cost is a region in the labeling space that all the edit costs have the same optimal labeling. Moreover, we characterize the distance value through the labeling space. This new point of view of the edit distance gives us the opportunity of defining some interesting properties that are useful for a better understanding of the edit distance. Finally, we show the usefulness of these properties through some applications.

Journal ArticleDOI
TL;DR: This is the first subpolynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^{1/3+o(1)}$ approximation.
Abstract: We show how to compute the edit distance between two strings of length $n$ up to a factor of $2^{\tilde{O}(\sqrt{\log n})}$ in $n^{1+o(1)}$ time. This is the first subpolynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art $n^{1/3+o(1)}$ approximation. Previously, approximation of $2^{\tilde{O}(\sqrt{\log n})}$ was known only for embedding edit distance into $\ell_1$, and it is not known if that embedding can be computed in less than quadratic time.

Dissertation
19 Dec 2012
TL;DR: This thesis presents a method for reconstructing virtual models of plants from laser scanning of real-world vegetation by combining a contraction phase and a local point tracking algorithm and presents a quantitative evaluation procedure to compare the authors' reconstructions against expertised structures of real plants.
Abstract: In the last decade, very realistic rendering of plant architectures have been produced in computer graphics applications. However, in the context of biology and agronomy, acquisition of accurate models of real plants is still a tedious task and a major bottleneck for the construction of quantitative models of plant development. Recently, 3D laser scanners made it possible to acquire 3D images on which each pixel has an associate depth corresponding to the distance between the scanner and the pinpointed surface of the object. Standard geometrical reconstructions fail on plants structures as they usually contain a complex set of discontinuous or branching surfaces distributed in space with varying orientations. In this thesis, we present a method for reconstructing virtual models of plants from laser scanning of real-world vegetation. Measuring plants with laser scanners produces data with different levels of precision. Points set are usually dense on the surface of the main branches, but only sparsely cover thin branches. The core of our method is to iteratively create the skeletal structure of the plant according to local density of point set. This is achieved thanks to a method that locally adapts to the levels of precision of the data by combining a contraction phase and a local point tracking algorithm. In addition, we present a quantitative evaluation procedure to compare our reconstructions against expertised structures of real plants. For this, we first explore the use of an edit distance between tree graphs. Alternatively, we formalize the comparison as an assignment problem to find the best matching between the two structures and quantify their differences.

Journal ArticleDOI
TL;DR: In this approach, a context-based two-layer system is utilized to automatically correct misspelled words in large datasets to improve performance as the size of the training set increases.
Abstract: This paper presents a stochastic-based approach for misspelling correction of Arabic text. In this approach, a context-based two-layer system is utilized to automatically correct misspelled words in large datasets. The first layer produces a list in which possible alternatives for each misspelled word are ranked using the Damerau-Levenshtein edit distance. The same layer also considers merged and split words resulting from deletion and insertion of space character. The right alternative for each misspelled word is stochastically selected based on the maximum marginal probability via A* lattice search and m-gram probability estimation. A large dataset was utilized to build and test the system. The testing results show that as we increase the size of the training set, the performance improves reaching 97.9% of F1 score for detection and 92.3% of F1 score for correction.

Journal ArticleDOI
TL;DR: This work proposes a new weight function based on a probability model to match the observed outcomes of a dual LPR setup and proposes new editing constraints as a function of the string lengths to avoid compensation for reversal errors.
Abstract: License-plate recognition (LPR) technology has been widely applied in many different transportation applications such as enforcement, vehicle monitoring, and access control. Recently, there has been effort to exploit an LPR database for vehicle tracking using popular template matching procedures. Existing template matching procedures assume that the true reference string is always available. However, under a two-point LPR survey, a vehicle could have its plate misread at both locations generating a pair of misread strings (or templates) with no reference for matching. To compensate for LPR misreading problem, we propose a new weight function based on a probability model to match the observed outcomes of a dual LPR setup. Also, considering that reversal errors are never made in LPR machines, new editing constraints as a function of the string lengths are proposed to avoid compensation for reversal errors. These editing constraints are incorporated into the constraint edit distance formulation to improve the performance of the matching procedure. Finally, considering that previous template matching procedures do not take advantage of passage time information available in LPR databases, we present an online tracking procedure that considers the properties of probability distribution of vehicle journey times in order to increase the probability of correct matches. Experimental results show that our proposed procedure can improve the accuracy of LPR systems and achieve up to 97% of positive matches with no false matches. Further research is needed to extend the ideas proposed herein to plate-matching with multiple, i.e., more than two, LPR units.

Journal ArticleDOI
TL;DR: This work presents a word spotting method for scanned documents in order to find the word images that are similar to a query word, without assuming a correct segmentation of the words into characters.

Proceedings Article
01 May 2012
TL;DR: This work creates an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words and creates a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors.
Abstract: Arabic is a language known for its rich and complex morphology. Although many research projects have focused on the problem of Arabic morphological analysis using different techniques and approaches, very few have addressed the issue of generation of fully inflected words for the purpose of text authoring. Available open-source spell checking resources for Arabic are too small and inadequate. Ayaspell, for example, the official resource used with OpenOffice applications, contains only 300,000 fully inflected words. We try to bridge this critical gap by creating an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words. Furthermore, from a large list of valid forms and invalid forms we create a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors. Testing of this language model gives a precision of 98.2% at a recall of 100%. We take our research a step further by creating a context-independent spelling correction tool using a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules. Our system performs significantly better than Hunspell in choosing the best solution, but it is still below the MS Spell Checker.

Journal ArticleDOI
TL;DR: This paper proposes an approach to edit similarity learning based on loss minimization, called GESL, driven by the notion of (ϵ,γ,τ)-goodness, a theory that bridges the gap between the properties of a similarity function and its performance in classification.
Abstract: Similarity functions are a fundamental component of many learning algorithms. When dealing with string or tree-structured data, measures based on the edit distance are widely used, and there exist a few methods for learning them from data. However, these methods offer no theoretical guarantee as to the generalization ability and discriminative power of the learned similarities. In this paper, we propose an approach to edit similarity learning based on loss minimization, called GESL. It is driven by the notion of (∈,?,?)-goodness, a theory that bridges the gap between the properties of a similarity function and its performance in classification. Using the notion of uniform stability, we derive generalization guarantees that hold for a large class of loss functions. We also provide experimental results on two real-world datasets which show that edit similarities learned with GESL induce more accurate and sparser classifiers than other (standard or learned) edit similarities.

Journal ArticleDOI
TL;DR: This work exploits a classical embedding of the edit distance into the Hamming distance to enable some flexibility on the tolerated edit distance when looking for close keywords while preserving the confidentiality of the queries.
Abstract: Our work is focused on fuzzy keyword search over encrypted data in Cloud Computing. We adapt results on private identification schemes by Bringer et al. to this new context. We here exploit a classical embedding of the edit distance into the Hamming distance. Our way of doing enables some flexibility on the tolerated edit distance when looking for close keywords while preserving the confidentiality of the queries. Our proposal is proved secure in a security model taking into account privacy.

Journal ArticleDOI
19 Dec 2012
TL;DR: It is proved that computing the edit distance is equivalent to finding the optimal cycle decomposition of the corresponding adjacency graph, and an approximation algorithm with an approximation ratio of 1.5 + ∈.
Abstract: Computing the edit distance between two genomes under certain operations is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be easily computed for genomes without duplicate genes. In this paper, we study the edit distance for genomes with duplicate genes under a model that includes DCJ operations, insertions and deletions. We prove that computing the edit distance is equivalent to finding the optimal cycle decomposition of the corresponding adjacency graph, and give an approximation algorithm with an approximation ratio of 1.5 + ∈.

Book ChapterDOI
10 Sep 2012
TL;DR: A new conceptual framework is suggested and new questions regarding the embeddability of edit distance into the Hamming cube which might be of independent interest are raised.
Abstract: In this paper we present two communication protocols on computing edit distance. In our first result, we give a one-way protocol for the following Document Exchange problem. Namely given x∈Σn to Alice and y∈Σn to Bob and integer k to both, Alice sends a message to Bob so that he learns x or truthfully reports that the edit distance between x and y is greater than k. For this problem, we give a randomized protocol in which Alice transmits at most $\tilde{O}(k\log^2 n)$ bits and each party's time complexity is $\tilde{O}(n\log n +k^2\log^2n)$. Our second result is a simultaneous protocol for edit distance over permutations. Here Alice and Bob both send a message to a third party (the referee) who does not have access to the input strings. Given the messages, the referee decides if the edit distance between x and y is at most k or not. For this problem we give a protocol in which Alice and Bob run a O(nlogn)-time algorithm and they transmit at most $\tilde{O}(k\log^2 n)$ bits. The running time of the referee is bounded by $\tilde{O}(k^2\log^2n)$. To our knowledge, this result is the first upper bound for this problem. Our results are obtained through mapping strings to the Hamming cube. For this, we use the Locally Consistent Parsing method of [5,6] in combination with the Karp-Rabin fingerprints. In addition to yielding non-trivial bounds for the edit distance problem, this paper suggest a new conceptual framework and raises new questions regarding the embeddability of edit distance into the Hamming cube which might be of independent interest.

Proceedings Article
01 Dec 2012
TL;DR: This work semi-automatically develops a dictionary of 9.3 million fully inflected Arabic words using a morphological transducer and a large corpus and improves the error model and language model.
Abstract: A spelling error detection and correction application is based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We semi-automatically develop a dictionary of 9.3 million fully inflected Arabic words using a morphological transducer and a large corpus. We improve the error model by analysing error types and creating an edit distance based re-ranker. We also improve the language model by analysing the level of noise in different sources of data and selecting the optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2010, OpenOffice Ayaspell and Google Docs.

Book ChapterDOI
09 Jul 2012
TL;DR: The key observation that the empirical entropy of a string does not change much after a small change to the string, as well as the simple yet efficient method for maintaining an array of variable-length blocks under length modifications, may be useful for many other applications as well.
Abstract: We present a new data structure called the Compressed Random Access Memory (CRAM) that can store a dynamic string T of characters, e.g., representing the memory of a computer, in compressed form while achieving asymptotically almost-optimal bounds (in terms of empirical entropy) on the compression ratio. It allows short substrings of T to be decompressed and retrieved efficiently and, significantly, characters at arbitrary positions of T to be modified quickly during execution without decompressing the entire string. This can be regarded as a new type of data compression that can update a compressed file directly. Moreover, at the cost of slightly increasing the time spent per operation, the CRAM can be extended to also support insertions and deletions. Our key observation that the empirical entropy of a string does not change much after a small change to the string, as well as our simple yet efficient method for maintaining an array of variable-length blocks under length modifications, may be useful for many other applications as well.

Proceedings Article
23 Apr 2012
TL;DR: A principled protocol for evaluating parsing results across frameworks based on function trees, tree generalization and edit distance metrics is presented, which extends a previously proposed framework for cross-theory evaluation and allows us to compare a wider class of parsers.
Abstract: A serious bottleneck of comparative parser evaluation is the fact that different parsers subscribe to different formal frameworks and theoretical assumptions. Converting outputs from one framework to another is less than optimal as it easily introduces noise into the process. Here we present a principled protocol for evaluating parsing results across frameworks based on function trees, tree generalization and edit distance metrics. This extends a previously proposed framework for cross-theory evaluation and allows us to compare a wider class of parsers. We demonstrate the usefulness and language independence of our procedure by evaluating constituency and dependency parsers on English and Swedish.

Proceedings ArticleDOI
04 Jun 2012
TL;DR: An original clone detection technique which is an accurate approximation of the Levenshtein distance is presented which uses groups of tokens extracted from source code called windowed-tokens and is compared with the Manhattan distance in a metric tree.
Abstract: This paper presents an original clone detection technique which is an accurate approximation of the Levenshtein distance. It uses groups of tokens extracted from source code called windowed-tokens. From these, frequency vectors are then constructed and compared with the Manhattan distance in a metric tree. The goal of this new technique is to provide a very high precision clone detection technique while keeping a high recall. Precision and recall measurement is done with respect to the Levenshtein distance. The testbench is a large scale open source software. The collected results proved the technique to be fast, simple, and accurate. Finally, this article presents further research opportunities.

Proceedings ArticleDOI
01 Apr 2012
TL;DR: This paper proposes a trie-based method to address dictionary-based approximate entity extraction with edit-distance constraints that achieves much higher performance compared with state-of-the-art studies.
Abstract: Dictionary-based entity extraction has attracted much attention from the database community recently, which locates sub strings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between sub strings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. First, they need to tune many parameters to achieve high performance. Second, they are inefficient for large edit-distance thresholds. We propose a trie-based method to address these problems. We first partition each entity into a set of segments, and then use a trie structure to index segments. To extract similar entities, we search segments from the document, and extend the matching segments in both entities and the document to find similar pairs. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. Experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

Journal ArticleDOI
TL;DR: The smoothed edit distance is effectively reduced to a simpler variant of (worst-case) edit distance, namely, edit distance on permutations (a.k.a. Ulam's metric), based on algorithms developed for the Ulam metric.
Abstract: We initiate the study of the smoothed complexity of sequence alignment, by proposing a semi-random model of edit distance between two input strings, generated as follows: First, an adversary chooses two binary strings of length d and a longest common subsequence A of them. Then, every character is perturbed independently with probability p, except that A is perturbed in exactly the same way inside the two strings.We design two efficient algorithms that compute the edit distance on smoothed instances up to a constant factor approximation. The first algorithm runs in near-linear time, namely d{1+e} for any fixed e > 0. The second one runs in time sublinear in d, assuming the edit distance is not too small. These approximation and runtime guarantees are significantly better than the bounds that were known for worst-case inputs.Our technical contribution is twofold. First, we rely on finding matches between substrings in the two strings, where two substrings are considered a match if their edit distance is relatively small, a prevailing technique in commonly used heuristics, such as PatternHunter of Ma et al. [2002]. Second, we effectively reduce the smoothed edit distance to a simpler variant of (worst-case) edit distance, namely, edit distance on permutations (a.k.a. Ulam's metric). We are thus able to build on algorithms developed for the Ulam metric, whose much better algorithmic guarantees usually do not carry over to general edit distance.

Proceedings Article
07 Jun 2012
TL;DR: This paper describes Stanford University's submission to the Shared Evaluation Task of WMT 2012, where the proposed metric (SPEDE) computes probabilistic edit distance as predictions of translation quality as well as a novel pushdown automaton extension of the pFSM model.
Abstract: This paper describes Stanford University's submission to the Shared Evaluation Task of WMT 2012. Our proposed metric (SPEDE) computes probabilistic edit distance as predictions of translation quality. We learn weighted edit distance in a probabilistic finite state machine (pFSM) model, where state transitions correspond to edit operations. While standard edit distance models cannot capture long-distance word swapping or cross alignments, we rectify these shortcomings using a novel pushdown automaton extension of the pFSM model. Our models are trained in a regression framework, and can easily incorporate a rich set of linguistic features. Evaluated on two different prediction tasks across a diverse set of datasets, our methods achieve state-of-the-art correlation with human judgments.

Journal ArticleDOI
04 Jun 2012-PLOS ONE
TL;DR: This paper aims to present a new genetic approach that uses rank distance for solving two known NP-hard problems, and to compare rank distance with other distance measures for strings.
Abstract: This paper aims to present a new genetic approach that uses rank distance for solving two known NP-hard problems, and to compare rank distance with other distance measures for strings. The two NP-hard problems we are trying to solve are closest string and closest substring. For each problem we build a genetic algorithm and we describe the genetic operations involved. Both genetic algorithms use a fitness function based on rank distance. We compare our algorithms with other genetic algorithms that use different distance measures, such as Hamming distance or Levenshtein distance, on real DNA sequences. Our experiments show that the genetic algorithms based on rank distance have the best results.