scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2014"


Posted Content
TL;DR: In this article, it was shown that the edit distance can be computed in time O(n 2 − ε ) for some constant ε > 0, where ε is the number of insertions, deletions or substitutions of symbols needed to transform one string into another.
Abstract: The edit distance (a.k.a. the Levenshtein distance) between two strings is defined as the minimum number of insertions, deletions or substitutions of symbols needed to transform one string into another. The problem of computing the edit distance between two strings is a classical computational task, with a well-known algorithm based on dynamic programming. Unfortunately, all known algorithms for this problem run in nearly quadratic time. In this paper we provide evidence that the near-quadratic running time bounds known for the problem of computing edit distance might be tight. Specifically, we show that, if the edit distance can be computed in time $O(n^{2-\delta})$ for some constant $\delta>0$, then the satisfiability of conjunctive normal form formulas with $N$ variables and $M$ clauses can be solved in time $M^{O(1)} 2^{(1-\epsilon)N}$ for a constant $\epsilon>0$. The latter result would violate the Strong Exponential Time Hypothesis, which postulates that such algorithms do not exist.

192 citations


Journal ArticleDOI
TL;DR: A new algorithm to compute the Graph Edit Distance in a sub-optimal way is presented and it is demonstrated that the distance value is exactly the same than the one obtained by the algorithm called Bipartite but with a reduced run time.

129 citations


Patent
30 Sep 2014
TL;DR: In this paper, an exemplar-based NLP system for NLP is presented, and a semantic edit distance between the first text phrase and the second text phrase in a semantic space can be determined based on one or more of the insertion cost, the deletion cost, and the substitution cost.
Abstract: Systems and processes for exemplar-based natural language processing are provided. In one example process, a first text phrase can be received. It can be determined whether editing the first text phrase to match a second text phrase requires one or more of inserting, deleting, and substituting a word of the first text phrase. In response to determining that editing the first text phrase to match the second text phrase requires one or more of inserting, deleting, and substituting a word of the first text phrase, one or more of an insertion cost, a deletion cost, and a substitution cost can be determined. A semantic edit distance between the first text phrase and the second text phrase in a semantic space can be determined based on one or more of the insertion cost, the deletion cost, and the substitution cost.

112 citations


Journal ArticleDOI
TL;DR: The Spatio-temporal Edit Distance measure is developed, an extended algorithm to determine the similarity between user trajectories based on call detailed records (CDRs) and performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.
Abstract: The rapid development of information and communication technologies ICTs has provided rich data sources for analyzing, modeling, and interpreting human mobility patterns. This paper contributes to this research area by developing the Spatio-temporal Edit Distance measure, an extended algorithm to determine the similarity between user trajectories based on call detailed records CDRs. We improve the traditional Edit Distance algorithm by incorporating both spatial and temporal information into the cost functions. The extended algorithm can preserve both space and time information from string-formatted CDR data. The novel method is applied to a large data set from Northeast China in order to test its effectiveness. Three types of analyses are presented for scenarios with and without the effect of time: 1 Edit Distance with spatial information; 2 Edit Distance with time as a factor in the cost function; and 3 Edit Distance with time as a constraint in partitioning trajectories. The outcomes of this research contribute to both methodological and empirical perspectives. The extended algorithm performs well for measuring low-resolution tracking information in CDRs, as well as facilitating the interpretation of user mobility patterns in the age of instant access.

84 citations


Book ChapterDOI
01 May 2014
TL;DR: This work defines a notion of a distance between merge trees, which is compared to other ways used to characterize topological similarity (bottleneck distance for persistence diagrams) and numerical difference (L ∞ -norm of the difference between functions).
Abstract: Merge trees represent the topology of scalar functions To assess the topological similarity of functions, one can compare their merge trees To do so, one needs a notion of a distance between merge trees, which we define We provide examples of using our merge tree distance and compare this new measure to other ways used to characterize topological similarity (bottleneck distance for persistence diagrams) and numerical difference (L ∞ -norm of the difference between functions)

77 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: This work proposes a novel pivotal prefix filter which significantly reduces the number of signatures and develops a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query.
Abstract: We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate signatures for each string and then prune the dissimilar strings which have no common signatures to the query. However existing methods involve large numbers of signatures and many signatures are unnecessary. Reducing the number of signatures not only increases the pruning power but also decreases the filtering cost. To address this problem, we propose a novel pivotal prefix filter which significantly reduces the number of signatures. We prove the pivotal filter achieves larger pruning power and less filtering cost than state-of-the-art filters. We develop a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query. We propose an alignment filter that considers the alignments between signatures to prune large numbers of dissimilar pairs with consecutive errors to the query. Experimental results on three real datasets show that our method achieves high performance and outperforms the state-of-the-art methods by an order of magnitude.

65 citations


Journal ArticleDOI
TL;DR: This paper investigates four representative time-series distance/similarity measures based on dynamic programming, namely Dynamic Time Warping, Longest Common Subsequence, Edit distance with Real Penalty (ERP) and Edit Distance on Real sequence (EDR), and the effects of global constraints on them when applied via the Sakoe-Chiba band.
Abstract: A time series consists of a series of values or events obtained over repeated measurements in time. Analysis of time series represents an important tool in many application areas, such as stock-market analysis, process and quality control, observation of natural phenomena, and medical diagnosis. A vital component in many types of time-series analyses is the choice of an appropriate distance/similarity measure. Numerous measures have been proposed to date, with the most successful ones based on dynamic programming. Being of quadratic time complexity, however, global constraints are often employed to limit the search space in the matrix during the dynamic programming procedure, in order to speed up computation. Furthermore, it has been reported that such constrained measures can also achieve better accuracy. In this paper, we investigate four representative time-series distance/similarity measures based on dynamic programming, namely Dynamic Time Warping (DTW), Longest Common Subsequence (LCS), Edit distance with Real Penalty (ERP) and Edit Distance on Real sequence (EDR), and the effects of global constraints on them when applied via the Sakoe-Chiba band. To better understand the influence of global constraints and provide deeper insight into their advantages and limitations we explore the change of the 1-nearest neighbor graph with respect to the change of the constraint size. Also, we examine how these changes reflect on the classes of the nearest neighbors of time series, and evaluate the performance of the 1-nearest neighbor classifier with respect to different distance measures and constraints. Since we determine that constraints introduce qualitative differences in all considered measures, and that different measures are affected by constraints in various ways, we expect our results to aid researchers and practitioners in selecting and tuning appropriate time-series similarity measures for their respective tasks.

49 citations


Proceedings ArticleDOI
01 Aug 2014
TL;DR: Team UWM’s system for the Task 7 of SemEval 2014 that does disorder mention extraction and normalization from clinical text is described, which ranked third in Task A with 0.755 strict F-measure and second in Task B with0.66 strict accuracy.
Abstract: This paper describes Team UWM’s system for the Task 7 of SemEval 2014 that does disorder mention extraction and normalization from clinical text. For the disorder mention extraction (Task A), the system was trained using Conditional Random Fields with features based on words, their POS tags and semantic types, as well as features based on MetaMap matches. For the disorder mention normalization (Task B), variations of disorder mentions were considered whenever exact matches were not found in the training data or in the UMLS. Suitable types of variations for disorder mentions were automatically learned using a new method based on edit distance patterns. Among nineteen participating teams, UWM ranked third in Task A with 0.755 strict F-measure and second in Task B with 0.66 strict accuracy.

47 citations


Journal ArticleDOI
Tony Rees1
23 Sep 2014-PLOS ONE
TL;DR: Taxamatch is described, an improved name matching solution for this information domain that employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms n-grams and standard edit distance.
Abstract: Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms n-grams (as bigrams and trigrams) and standard edit distance. Although entirely phonetic methods are faster than Taxamatch, they are inferior in the area of recall since many real-world errors are non-phonetic in nature. Excellent performance of Taxamatch (as recall, precision and execution time) is demonstrated against a reference database of over 465,000 genus names and 1.6 million species names, as well as against a range of error types as present at both genus and species levels in three sets of sample data for species and four for genera alone. An ancillary authority matching component is included which can be used both for misspelled names and for otherwise matching names where the associated cited authorities are not identical.

42 citations


Proceedings ArticleDOI
12 Jul 2014
TL;DR: An evolutionary algorithm and software tool are proposed and developed, GEDEVO-M, which is able to align multiple PPI networks using topological information only and provides great potential for computing the 'core interactome' of different species.
Abstract: Motivation: We address the problem of multiple protein-protein interaction (PPI) network alignment. Given a set of such networks for different species we might ask how much the network topology is conserved throughout evolution. Solving this problem will help to derive a subset of interactions that is conserved over multiple species thus forming a 'core interactome'. Methods: We model the problem as Topological Multiple one-to-one Network Alignment (TMNA), where we aim to minimize the total Graph Edit Distance (GED) between pairs of the input networks. Here, the GED between two graphs is the number of deleted and inserted edges that are required to make one graph isomorphic to another. By minimizing the GED we indirectly maximize the number of edges that are aligned in multiple networks simultaneously. However, computing an optimal GED value is computationally intractable. We thus propose an evolutionary algorithm and developed a software tool, GEDEVO-M, which is able to align multiple PPI networks using topological information only. We demonstrate the power of our approach by computing a maximal common subnetwork for a set of bacterial and eukaryotic PPI networks. GEDEVO-M thus provides great potential for computing the 'core interactome' of different species. Availability: http://gedevo.mpi-inf.mpg.de/multiple-network-alignment/.

40 citations


Proceedings ArticleDOI
01 Jun 2014
TL;DR: This work shows how to construct and train a probabilistic finite-state transducer that computes stochastic contextual edit distance and model typos found in social media text to illustrate the improvement from conditioning on context.
Abstract: String similarity is most often measured by weighted or unweighted edit distance d(x, y). Ristad and Yianilos (1998) defined stochastic edit distance—a probability distribution p(y | x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic contextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.

Proceedings ArticleDOI
18 Oct 2014
TL;DR: This paper provides the very first near-linear time algorithm to tightly approximate the DYCK(s) language edit distance problem for any arbitrary s and shows that the framework for efficiently approximating edit distance to DYC k can be utilized for many other languages.
Abstract: Given a string σ over alphabet Σ and a grammar G defined over the same alphabet, how many minimum number of repairs (insertions, deletions and substitutions) are required to map σ into a valid member of G? The seminal work of Aho and Peterson in 1972 initiated the study of this language edit distance problem providing a dynamic programming algorithm for context free languages that runs in O(|G|2n3) time, where n is the string length and |G| is the grammar size. While later improvements reduced the running time to O(|G|n3), the cubic time complexity on the input length held a major bottleneck for applying these algorithms to their multitude of applications. In this paper, we study the language edit distance problem for a fundamental context free language, DYCK(s) representing the language of well-balanced parentheses of s different types, that has been pivotal in the development of formal language theory. We provide the very first it near-linear time} algorithm to tightly approximate the DYCK(s) language edit distance problem for any arbitrary s. DYCK(s) language edit distance significantly generalizes the well-studied string edit distance problem, and appears in most applications of language edit distance ranging from data quality in databases, generating automated error-correcting parsers in compiler optimization to structure prediction problems in biological sequences. Its nondeterministic counterpart is known as the hardest context free language. Our main result is an algorithm for edit distance computation to DYCK(s) for any positive integer s that runs in O(n1+ e polylog(n)) time and achieves an approximation factor of O(1/eβ(n)log|OPT|), for any e > 0. Here OPT is the optimal edit distance to DYCK(s) and β(n) is the best approximation factor known for the simpler problem of string edit distance running in analogous time. If we allow O(n1+e+|OPT|2ne) time, then the approximation factor can be reduced to O(1/e log|OPT|). Since the best known near-linear time algorithm for the string edit distance problem has β(n) = polylog(n), under near-linear time computation model both DYCK(s) language and string edit distance problems have polylog(n) approximation factors. This comes as a surprise since the former is a significant generalization of the latter and their exact computations via dynamic programming show a stark difference in time complexity. Rather less surprisingly, we show that the framework for efficiently approximating edit distance to DYCK(s) can be utilized for many other languages. We illustrate this by considering various memory checking languages (studied extensively under distributed verification) such as stack, queue, PQ and DEQUE which comprise of valid transcripts of stacks, queues, priority queues and double-ended queues respectively. Therefore, any language that can be recognized by these data structures, can also be repaired efficiently by our algorithm.

Journal ArticleDOI
TL;DR: Simulation experiments show, that with regard to classification accuracy, TWED performs very well over all measures, while SSD is the best linear measure, and SSD has the lowest run-times, the fastest nonlinear measure is DTW.

Journal ArticleDOI
TL;DR: It is proved that the optimal partitioning of the data set is an NP-hard problem, and therefore a heuristic approach for selecting the reference strings greedily is proposed and an optimal partition assignment strategy is presented to minimize the expected number of strings that need to be verified during the query evaluation.
Abstract: Edit distance is widely used for measuring the similarity between two strings. As a primitive operation, edit distance based string similarity search is to find strings in a collection that are similar to a given query string using edit distance. Existing approaches for answering such string similarity queries follow the filter-and-verify framework by using various indexes. Typically, most approaches assume that indexes and data sets are maintained in main memory. To overcome this limitation, in this paper, we propose B + -tree based approaches to answer edit distance based string similarity queries, and hence, our approaches can be easily integrated into existing RDBMSs. In general, we answer string similarity search using pruning techniques employed in the metric space in that edit distance is a metric. First, we split the string collection into partitions according to a set of reference strings. Then, we index strings in all partitions using a single B + -tree based on the distances of these strings to their corresponding reference strings. Finally, we propose two approaches to efficiently answer range and KNN queries, respectively, based on the B + -tree. We prove that the optimal partitioning of the data set is an NP-hard problem, and therefore propose a heuristic approach for selecting the reference strings greedily and present an optimal partition assignment strategy to minimize the expected number of strings that need to be verified during the query evaluation. Through extensive experiments over a variety of real data sets, we demonstrate that our B + -tree based approaches provide superior performance over state-of-the-art techniques on both range and KNN queries in most cases.

01 Jan 2014
TL;DR: This chapter detail center based clustering methods, namely methods based on finding a “best” set of center points and then assigning data points to their nearest center and popular heuristics for k-means and k-median clustering which are two of the most widely used clustering objectives.
Abstract: In the first part of this chapter we detail center based clustering methods, namely methods based on finding a “best” set of center points and then assigning data points to their nearest center. In particular, we focus on k-means and k-median clustering which are two of the most widely used clustering objectives. We describe popular heuristics for these methods and theoretical guarantees associated with them. We also describe how to design worst case approximately optimal algorithms for these problems. In the second part of the chapter we describe recent work on how to improve on these worst case algorithms even further by using insights from the nature of real world clustering problems and data sets. Finally, we also summarize theoretical work on clustering data generated from mixture models such as a mixture of Gaussians. 1 Approximation algorithms for k-means and k-median One of the most popular approaches to clustering is to define an objective function over the data points and find a partitioning which achieves the optimal solution, or an approximately optimal solution to the given objective function. Common objective functions include center based objective functions such as k-median and k-means where one selects k center points and the clustering is obtained by assigning each data point to its closest center point. Here closeness is measured in terms of a pairwise distance function d(), which the clustering algorithm has access to, encoding how dissimilar two data points are. For instance, the data could be points in Euclidean space with d() measuring Euclidean distance, or it could be strings with d() representing an edit distance, or some other dissimilarity score. For mathematical convenience it is also assumed that the distance function d() is a metric. In k-median clustering the objective is to find center points c1, c2, · · · ck, and a partitioning of the data so as to minimize Φk−median = ∑ x mini d(x, ci). This objective is historically very useful and well studied for facility location problems [16, 43]. Similarly the objective in k-means is to minimize Φk−means = ∑ x mini d(x, ci) . Optimizing this objective is closely related to fitting the maximum likelihood mixture model for a given dataset. For a given set of centers, the optimal clustering for that set is obtained by assigning each data point to its closest center point. This is known as the Voronoi partitioning of the data. Unfortunately, exactly optimizing the k-median and the k-means objectives is a notoriously hard problem. Intuitively this is expected since the objective function is a non-convex function of the variables involved. This apparent hardness can also be formally justified by appealing to the

Book ChapterDOI
02 Apr 2014
TL;DR: This paper proposes an ILPi¾źinteger linear programming formulation to compute the DCJ distance between two genomes with duplicate genes and provides an efficient preprocessing approach to simplify the ILP formulation while preserving optimality.
Abstract: Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-joini¾źDCJ model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this paper, we propose an ILPi¾źinteger linear programming formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse and rat genomes, where once again our method outperforms MSOAR.

Journal ArticleDOI
Longyue Wang1, Derek F. Wong1, Lidia S. Chao1, Yi Lu1, Junwen Xing1 
TL;DR: An in-depth analysis of three different sentence selection techniques for statistical machine translation (SMT) and achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.
Abstract: Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.

Posted Content
TL;DR: A combinatorial distance for Reeb graphs of orientable surfaces in terms of the cost necessary to transform one graph into another by edit operations is defined in order to determine the stability property of these graphs.
Abstract: Reeb graphs are structural descriptors that capture shape properties of a topological space from the perspective of a chosen function. In this work we define a combinatorial metric for Reeb graphs of orientable surfaces in terms of the cost necessary to transform one graph into another by edit operations. The main contributions of this paper are the stability property and the optimality of this edit distance. More precisely, the stability result states that changes in the functions, measured by the maximum norm, imply not greater changes in the corresponding Reeb graphs, measured by the edit distance. The optimality result states that our edit distance discriminates Reeb graphs better than any other metric for Reeb graphs of surfaces satisfying the stability property.

Book ChapterDOI
06 Oct 2014
TL;DR: A formal proof is given that this approximation framework reduces the computation of GED to an instance of a linear sum assignment problem and builds an upper bound of the true graph edit distance, which is simultaneously computed in cubic time.
Abstract: Exact computation of graph edit distance (GED) can be solved in exponential time complexity only. A previously introduced approximation framework reduces the computation of GED to an instance of a linear sum assignment problem. Major benefit of this reduction is that an optimal assignment of nodes (including local structures) can be computed in polynomial time. Given this assignment an approximate value of GED can be immediately derived. Yet, since this approach considers local — rather than the global — structural properties of the graphs only, the GED derived from the optimal assignment is suboptimal. The contribution of the present paper is twofold. First, we give a formal proof that this approximation builds an upper bound of the true graph edit distance. Second, we show how the existing approximation framework can be reformulated such that a lower bound of the edit distance can be additionally derived. Both bounds are simultaneously computed in cubic time.

Journal ArticleDOI
TL;DR: The method hierarchically matches the nodes in a road network using the Minimum Road Edit Distance and eliminates false matching nodes using M‐estimators regardless of differences in LoDs and road‐network coordinate systems.
Abstract: This article presents an approach to hierarchical matching of nodes in heterogeneous road networks inthe same urban area. Heterogeneous road networks not only exist at different levels of detail (LoD), butalso have different coordinate systems, leading to difficulties in matching and integrating them. To over-come these difficulties, a pattern-based method was implemented. Based on the authors’ previous work ondetecting patterns of divided highways, complex road junctions, and strokes to eliminate the LoD effectof road networks, the proposed method extracts the local networks around each node in a road networkand uses them as the matching units for the nodes. Second, the degree of shape similarity between thematching units is measured using a Minimum Road Edit Distance based on a transformation. Finally, theproposed method hierarchically matches the nodes in a road network using the Minimum Road Edit Dis-tance and eliminates false matching nodes using M-estimators. An experiment involving matching hetero-geneous road networks with different LoDs and coordinate systems was carried out to verify the validityof the proposed method. The method achieves good and effective matching regardless of differences inLoDs and road-network coordinate systems.

Book ChapterDOI
20 Aug 2014
TL;DR: A search procedure based on a genetic algorithm is implemented in order to improve the approximation quality and a substantial gain of distance accuracy is empirically verified.
Abstract: Many flexible methods for graph dissimilarity computation are based on the concept of edit distance. A recently developed approximation framework allows one to compute graph edit distances substantially faster than traditional methods. Yet, this novel procedure considers the local edge structure only during the primary optimization process. Hence, the speed up is at the expense of an overestimation of the true graph edit distances in general. The present paper introduces an extension of this approximation framework. Regarding the node assignment from the original approximation as a starting point, we implement a search procedure based on a genetic algorithm in order to improve the approximation quality. In an experimental evaluation on three real world data sets a substantial gain of distance accuracy is empirically verified.

Journal ArticleDOI
TL;DR: Efficient sequential and parallel algorithms for record linkage which handle any number of datasets and outperform previous algorithms are reported which are compared with TPA (FCED).

Proceedings ArticleDOI
01 Oct 2014
TL;DR: This work introduces an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatical features of social text, and achieves state-ofthe-art F-score performance on standard datasets.
Abstract: The informal nature of social media text renders it very difficult to be automatically processed by natural language processing tools. Text normalization, which corresponds to restoring the non-standard words to their canonical forms, provides a solution to this challenge. We introduce an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatical features of social text. The contextual and grammatical features are extracted from a word association graph built by using a large unlabeled social media text corpus. The graph encodes the relative positions of the words with respect to each other, as well as their part-ofspeech tags. The lexical features are obtained by using the longest common subsequence ratio and edit distance measures to encode the surface similarity among words, and the double metaphone algorithm to represent the phonetic similarity. Unlike most of the recent approaches that are based on generating normalization dictionaries, the proposed approach performs normalization by considering the context of the non-standard words in the input text. Our results show that it achieves state-ofthe-art F-score performance on standard datasets. In addition, the system can be tuned to achieve very high precision without sacrificing much from recall.

01 Jan 2014
TL;DR: In this article, a probabilistic correlation-based similarity measure was proposed for unstructured text record similarity evaluation, which enriches the information of records by considering correlations of tokens.
Abstract: Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity, can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

Journal ArticleDOI
TL;DR: A new hybrid algorithm combining character n-gram and neural network methodologies is developed, and it is concluded that Google could be used as a pre-processed spelling correction method.
Abstract: We used the character n-gram method to predict topic changes in search engine queries.We obtained more successful estimations than previous studies, and made remarkable contributions.We compared the character n-gram method with the Levenshtein edit-distance method.We analyzed ASPELL, Google and Bing search engines as pre-processed spelling correction methods.We conclude that Google could be used as a pre-processed spelling correction method. The widespread availability of the Internet and the variety of Internet-based applications have resulted in a significant increase in the amount of web pages. Determining the behaviors of search engine users has become a critical step in enhancing search engine performance. Search engine user behaviors can be determined by content-based or content-ignorant algorithms. Although many content-ignorant studies have been performed to automatically identify new topics, previous results have demonstrated that spelling errors can cause significant errors in topic shift estimates. In this study, we focused on minimizing the number of wrong estimates that were based on spelling errors. We developed a new hybrid algorithm combining character n-gram and neural network methodologies, and compared the experimental results with results from previous studies. For the FAST and Excite datasets, the proposed algorithm improved topic shift estimates by 6.987% and 2.639%, respectively. Moreover, we analyzed the performance of the character n-gram method in different aspects including the comparison with Levenshtein edit-distance method. The experimental results demonstrated that the character n-gram method outperformed to the Levensthein edit distance method in terms of topic identification.

Book ChapterDOI
06 Oct 2014
TL;DR: The original approximation framework is combined with a fast tree search procedure to improve the overall approximation quality, and the assignment from the original approximation as a starting point for a subsequent beam search is regarded.
Abstract: Graph edit distance (GED) is a powerful and flexible graph dissimilarity model. Yet, exact computation of GED is an instance of a quadratic assignment problem and can thus be solved in exponential time complexity only. A previously introduced approximation framework reduces the computation of GED to an instance of a linear sum assignment problem. Major benefit of this reduction is that an optimal assignment of nodes (including local structures) can be computed in polynomial time. Given this assignment an approximate value of GED can be immediately derived. Yet, the primary optimization process of this approximation framework is able to consider local edge structures only, and thus, the observed speed up is at the expense of approximative, rather than exact, distance values. In order to improve the overall approximation quality, the present paper combines the original approximation framework with a fast tree search procedure. More precisely, we regard the assignment from the original approximation as a starting point for a subsequent beam search. In an experimental evaluation on three real world data sets a substantial gain of assignment accuracy can be observed while the run time remains remarkable low.

Proceedings ArticleDOI
23 Mar 2014
TL;DR: It is shown that when chip area is normalized between the two platforms, it is possible to get more than a 50× runtime performance improvement and over 100× reduction in energy consumption compared to an optimized and vectorized x86 implementation.
Abstract: In this paper, we demonstrate the ability of spatial architectures to significantly improve both runtime performance and energy efficiency on edit distance, a broadly used dynamic programming algorithm. Spatial architectures are an emerging class of application accelerators that consist of a network of many small and efficient processing elements that can be exploited by a large domain of applications. In this paper, we utilize the dataflow characteristics and inherent pipeline parallelism within the edit distance algorithm to develop efficient and scalable implementations on a previously proposed spatial accelerator. We evaluate our edit distance implementations using a cycle-accurate performance and physical design model of a previously proposed triggered instruction-based spatial architecture in order to compare against real performance and power measurements on an x86 processor. We show that when chip area is normalized between the two platforms, it is possible to get more than a 50× runtime performance improvement and over 100× reduction in energy consumption compared to an optimized and vectorized x86 implementation. This dramatic improvement comes from leveraging the massive parallelism available in spatial architectures and from the dramatic reduction of expensive memory accesses through conversion to relatively inexpensive local communication.

Proceedings ArticleDOI
01 Apr 2014
TL;DR: This paper investigated and evaluated the use of several matching algorithms, including the edit distance algorithm that is believed to be at the heart of most modern commercial translation memory systems, and showed how well various matching algorithms correlate with human judgments of helpfulness.
Abstract: Translation Memory (TM) systems are one of the most widely used translation technologies. An important part of TM systems is the matching algorithm that determines what translations get retrieved from the bank of available translations to assist the human translator. Although detailed accounts of the matching algorithms used in commercial systems can’t be found in the literature, it is widely believed that edit distance algorithms are used. This paper investigates and evaluates the use of several matching algorithms, including the edit distance algorithm that is believed to be at the heart of most modern commercial TM systems. This paper presents results showing how well various matching algorithms correlate with human judgments of helpfulness (collected via crowdsourcing with Amazon’s Mechanical Turk). A new algorithm based on weighted n-gram precision that can be adjusted for translator length preferences consistently returns translations judged to be most helpful by translators for multiple domains and language pairs.

Patent
13 Aug 2014
TL;DR: In this paper, a technique for providing grammatical and semantic sense of statistical machine translation systems to assisting in reviewing tasks, provides for the construction of hypothesis generators and evaluators using sparse data, with the use of an edit distance metric for generating alignments of the sparse data.
Abstract: A technique for providing grammatical and semantic sense of statistical machine translation systems to assisting in reviewing tasks, provides for the construction of hypothesis generators and evaluators using sparse data, with the use of an edit distance metric for generating alignments of the sparse data.

Journal ArticleDOI
TL;DR: This paper proposes a novel probabilistic correlation-based similarity measure that enriches the information of records by considering correlations of tokens, and compute weights of tokens and discover correlations of records based on the Probabilistic correlations of Tokens.