Showing papers on "Edit distance published in 2007"

PDF

Open Access

Journal Article•DOI•

A Normalized Levenshtein Distance Metric

[...]

Li Yujian¹, Liu Bo¹•Institutions (1)

01 Jun 2007-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Experiments using the AESA algorithm in handwritten digit recognition show that the new normalized edit distance between X and Y can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.

...read moreread less

Abstract: Although a number of normalized edit distances presented so far may offer good performance in some applications, none of them can be regarded as a genuine metric between strings because they do not satisfy the triangle inequality. Given two strings X and Y over a finite alphabet, this paper defines a new normalized edit distance between X and Y as a simple function of their lengths (|X| and |Y|) and the Generalized Levenshtein Distance (GLD) between them. The new distance can be easily computed through GLD with a complexity of O(|X| \cdot |Y|) and it is a metric valued in [0, 1] under the condition that the weight function is a metric over the set of elementary edit operations with all costs of insertions/deletions having the same weight. Experiments using the AESA algorithm in handwritten digit recognition show that the new distance can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.

...read moreread less

624 citations

Journal Article•DOI•

The string edit distance matching problem with moves

[...]

Graham Cormode¹, S. Muthukrishnan²•Institutions (2)

AT&T Labs¹, Rutgers University²

02 Feb 2007-ACM Transactions on Algorithms

TL;DR: This work relaxes the problem so that an additional operation is allowed, namely, substring moves, and approximate the string edit distance upto a factor of O(log n log*n), and results are obtained, which are the first known significantly subquadratic algorithm for a string editdistance problem in which the distance involves nontrivial alignments.

...read moreread less

Abstract: The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes, and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit distance matching problem is to compute the smallest edit distance between p and substrings of t.We relax the problem so that: (a) we allow an additional operation, namely, substring moves; and (b) we allow approximation of this string edit distance. Our result is a near-linear time deterministic algorithm to produce a factor of O(log n loga n) approximation to the string edit distance with moves. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments. Our results are obtained by embedding strings into L1 vector space using a simplified parsing technique, which we call edit-sensitive parsing (ESP).

...read moreread less

258 citations

Proceedings Article•

VGRAM: improving performance of approximate queries on string collections using variable-length grams

[...]

Chen Li¹, Bin Wang², Xiaochun Yang²•Institutions (2)

University of California, Irvine¹, Northeastern University (China)²

23 Sep 2007

TL;DR: A novel technique, called VGRAM, to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection, and shows the significant performance improvements on three existing algorithms.

...read moreread less

Abstract: Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a novel technique, called VGRAM, to improve the performance of these algorithms. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. We give a full specification of this technique, including how to select high-quality grams from the collection, how to generate variable-length grams for a string based on the preselected grams, and what is the relationship between the similarity of the gram sets of two strings and their edit distance. A primary advantage of the technique is that it can be adopted by a plethora of approximate string algorithms without the need to modify them substantially. We present our extensive experiments on real data sets to evaluate the technique, and show the significant performance improvements on three existing algorithms.

...read moreread less

198 citations

Patent•

Method and Apparatus for Automatic Detection of Spelling Errors in One or More Documents

[...]

H. Richard Gail¹, Sidney L. Hantler¹, Meir M. Laker¹, Jonathan Lenchner¹, Daniel Milch¹ - Show less +1 more•Institutions (1)

IBM¹

09 Feb 2007

TL;DR: In this paper, a spelling error is detected by determining if at least one given word in the one or more documents satisfies a predefined misspelling criteria, where the predefined criteria comprises the at least word having a frequency below a low threshold, and the word being within an edit distance of one or mote other words in the documents having frequency above a high threshold; and identifying a given word as a potentially misspelled word if the given word satisfies the predicate criteria.

...read moreread less

Abstract: Methods and apparatus are provided for automatically detecting spelling errors in one or more documents, such as documents being processed for the creation of a lexicon According to one aspect of the invention, a spelling error is detected in one or more documents by determining if at least one given word in the one or more documents satisfies a predefined misspelling criteria, wherein the predefined misspelling criteria comprises the at least one given word having a frequency below a predefined low threshold and the at least one given word being within a predefined edit distance of one or mote other words in the one or more documents having a frequency above a predefined high threshold; and identifying a given word as a potentially misspelled word if the given word satisfies the predefined misspelling criteria

...read moreread less

158 citations

Proceedings Article•

Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing

[...]

Gunes Erkan¹, Arzucan Özgür¹, Dragomir R. Radev¹•Institutions (1)

University of Michigan¹

01 Jun 2007

TL;DR: This work introduces a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned, and investigates the performances of two classes of learning algorithms and the semisupervised counterparts of these algorithms, transductive SVMs and harmonic functions.

...read moreread less

Abstract: We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency trees, we define two separate similarity functions (kernels) based on cosine similarity and edit distance among the paths between the protein names. Using these similarity functions, we investigate the performances of two classes of learning algorithms, Support Vector Machines and k-nearest-neighbor, and the semisupervised counterparts of these algorithms, transductive SVMs and harmonic functions, respectively. Significant improvement over the previous results in the literature is reported as well as a new benchmark dataset is introduced. Semi-supervised algorithms perform better than their supervised version by a wide margin especially when the amount of labeled data is limited.

...read moreread less

145 citations

Book Chapter•DOI•

Bipartite graph matching for computing the edit distance of graphs

[...]

Kaspar Riesen¹, Michel Neuhaus¹, Horst Bunke¹•Institutions (1)

University of Bern¹

11 Jun 2007

TL;DR: This paper proposes an approach for the efficient compuation of edit distance based on bipartite graph matching by means of Munkres' algorithm, sometimes referred to as the Hungarian algorithm, which runs in polynomial time, but provides only suboptimal edit distance results.

...read moreread less

Abstract: In the field of structural pattern recognition graphs constitute a very common and powerful way of representing patterns. In contrast to string representations, graphs allow us to describe relational information in the patterns under consideration. One of the main drawbacks of graph representations is that the computation of standard graph similarity measures is exponential in the number of involved nodes. Hence, such computations are feasible for rather small graphs only. One of the most flexible error-tolerant graph similarity measures is based on graph edit distance. In this paper we propose an approach for the efficient compuation of edit distance based on bipartite graph matching by means of Munkres' algorithm, sometimes referred to as the Hungarian algorithm. Our proposed algorithm runs in polynomial time, but provides only suboptimal edit distance results. The reason for its suboptimality is that implied edge operations are not considered during the process of finding the optimal node assignment. In experiments on semi-artificial and real data we demonstrate the speedup of our proposed method over a traditional tree search based algorithm for graph edit distance computation. Also we show that classification accuracy remains nearly unaffected.

...read moreread less

142 citations

Book Chapter•DOI•

Processing compressed texts: a tractability border

[...]

Yury Lifshits¹•Institutions (1)

Steklov Mathematical Institute¹

09 Jul 2007

TL;DR: A pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts are indicated.

...read moreread less

Abstract: What kind of operations can we perform effectively (without full unpacking) with compressed texts? In this paper we consider three fundamental problems: (1) check the equality of two compressed texts, (2) check whether one compressed text is a substring of another compressed text, and (3) compute the number of different symbols (Hamming distance) between two compressed texts of the same length. We present an algorithm that solves the first problem in O(n3) time and the second problem in O(n2m) time. Here n is the size of compressed representation (we consider representations by straight-line programs) of the text and m is the size of compressed representation of the pattern. Next, we prove that the third problem is actually #P-complete. Thus, we indicate a pair of similar problems (equivalence checking, Hamming distance computation) that have radically different complexity on compressed texts. Our algorithmic technique used for problems (1) and (2) helps for computing minimal periods and covers of compressed texts.

...read moreread less

129 citations

Journal Article•DOI•

Automatic learning of cost functions for graph edit distance

[...]

Michel Neuhaus¹, Horst Bunke¹•Institutions (1)

University of Bern¹

01 Jan 2007-Information Sciences

TL;DR: A method to automatically learn cost functions from a labeled sample set of graphs using an Expectation Maximization algorithm to model graph variations by structural distortion operations and derive the desired cost functions is proposed.

...read moreread less

123 citations

Journal Article•DOI•

Indo-European languages tree by Levenshtein distance

[...]

Maurizio Serva, Filippo Petroni

22 Aug 2007-arXiv: Physics and Society

TL;DR: This work introduces a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all words contained in a Swadesh list and finds out a tree which closely resembles the one published in Gray and Atkinson (2003), with some significant differences.

...read moreread less

Abstract: The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited \cite{GA,GJ} to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to define these distances, one of this, used by glottochronology, compute distance from the percentage of shared ``cognates''. Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identification process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all the words contained in a Swadesh list \cite{Sw}. The subjectivity of process is consistently reduced and the reproducibility is highly facilitated. We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in \cite{GA} with some significant differences.

...read moreread less

111 citations

Proceedings Article•DOI•

Identifying Changed Source Code Lines from Version Repositories

[...]

Gerardo Canfora¹, Luigi Cerulo¹, M. Di Penta¹•Institutions (1)

University of Sannio¹

20 May 2007

TL;DR: This paper shows how the evolution of changes at source code line level can be inferred from CVS repositories, by combining information retrieval techniques and the Levenshtein edit distance.

...read moreread less

Abstract: Observing the evolution of software systems at different levels of granularity has been a key issue for a number of studies, aiming at predicting defects or at studying certain phenomena, such as the presence of clones or of crosscutting concerns. Versioning systems such as CVS and SVN, however, only provide information about lines added or deleted by a contributor: any change is shown as a sequence of additions and deletions. This provides an erroneous estimate of the amount of code changed. This paper shows how the evolution of changes at source code line level can be inferred from CVS repositories, by combining information retrieval techniques and the Levenshtein edit distance. The application of the proposed approach to the ArgoUML case study indicates a high precision and recall.

...read moreread less

102 citations

Journal Article•DOI•

Low distortion embeddings for edit distance

[...]

Rafail Ostrovsky¹, Yuval Rabani²•Institutions (2)

University of California, Los Angeles¹, Technion – Israel Institute of Technology²

01 Oct 2007-Journal of the ACM

TL;DR: Efficient implementation of the embedding that yield solutions to various computational problems involving edit distance, including sketching, communication complexity, nearest neighbor search are shown.

...read moreread less

Abstract: We show that l0, 1rd endowed with edit distance embeds into e1 with distortion 2O(√log d log log d). We further show efficient implementation of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds.

...read moreread less

Spelling-error tolerant, order-independent pass-phrases via the damerau-levenshtein string-edit distance metric

[...]

Gregory V. Bard¹•Institutions (1)

University of Maryland, College Park¹

30 Jan 2007

TL;DR: This paper explores methods for making pass-phrases suitable for use with password-based authentication and key-exchange (PAKE) protocols, and in particular, with schemes resilient to server-file compromise.

...read moreread less

Abstract: It is well understood that passwords must be very long and complex to have sufficient entropy for security purposes. Unfortunately, these passwords tend to be hard to memorize, and so alternatives are sought. Smart Cards, Biometrics, and Reverse Turing Tests (human-only solvable puzzles) are options, but another option is to use pass-phrases. This paper explores methods for making pass-phrases suitable for use with password-based authentication and key-exchange (PAKE) protocols, and in particular, with schemes resilient to server-file compromise. In particular, the Ω-method of Gentry, MacKenzie and Ramzan, is combined with the Bellovin-Merritt protocol to provide mutual authentication (in the random oracle model (Canetti, Goldreich & Halevi 2004, Bellare, Boldyreva & Palacio 2004, Maurer, Renner & Holenstein 2004)). Furthermore, since common password-related problems are typographical errors, and the CAPSLOCK key, we show how a dictionary can be used with the Damerau-Levenshtein string-edit distance metric to construct a case-insensitive pass-phrase system that can tolerate zero, one, or two spelling-errors per word, with no loss in security. Furthermore, we show that the system can be made to accept pass-phrases that have been arbitrarily reordered, with a security cost that can be calculated. While a pass-phrase space of 2128 is not achieved by this scheme, sizes in the range of 252 to 2112 result from various selections of parameter sizes. An attacker who has acquired the server-file must exhaust over this space, while an attacker without the server-file cannot succeed with non-negligible probability.

...read moreread less

Journal Article•DOI•

Tandem repeats over the edit distance

[...]

Dina Sokol¹, Gary Benson², Justin Tojeira¹•Institutions (2)

City University of New York¹, Boston University²

05 Jan 2007-Bioinformatics

TL;DR: An efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure is described and will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.

...read moreread less

Abstract: Motivation: A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. Results: In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. Availability: The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules. Contact: sokol@sci.brooklyn.cuny.edu

...read moreread less

Proceedings Article•

Speeding up Graph Edit Distance Computation with a Bipartite Heuristic

[...]

Kaspar Riesen, Stefan Fankhauser, Horst Bunke

01 Jan 2007

TL;DR: The idea is to use a fast but suboptimal bipartite graph matching algorithm as a heuristic function that estimates the future costs so that it is guaranteed to return the exact graph edit distance of two given graphs.

...read moreread less

Abstract: Graph edit distance is a dissimilarity measure for arbitrarily structured and arbitrarily labeled graphs. In contrast with other approaches, it does not suffer from any restrictions and can be applied to any type of graph, including hypergraphs [1]. Graph edit distance can be used to address various graph classification problems with different methods, for instance, k-nearest-neighbor classifier (k-NN), graph embedding classifier [2], or classification with graph kernel machines [3]. The main drawback of graph edit distance is its computational complexity which is exponential in the number of nodes of the involved graphs. Consequently, computation of graph edit distance is feasible for graphs of rather small size only. In order to overcome this restriction, a number of fast but suboptimal methods have been proposed in the literature (e.g. [4]). In the present paper we aim at speeding up the computation of exact graph edit distance. We propose to combine the standard tree search approach to graph edit distance computation with the suboptimal procedure described in [4]. The idea is to use a fast but suboptimal bipartite graph matching algorithm as a heuristic function that estimates the future costs. The overhead for computing this heuristic function is small, and easily compensated by the speed-up achieved in tree traversal. Since the heuristic function provides us with a lower bound of the future costs, it is guaranteed to return the exact graph edit distance of two given graphs.

...read moreread less

Book Chapter•DOI•

An optimal decomposition algorithm for tree edit distance

[...]

Erik D. Demaine¹, Shay Mozes¹, Benjamin Rossman¹, Oren Weimann¹•Institutions (1)

Massachusetts Institute of Technology¹

09 Jul 2007

TL;DR: The optimality of the algorithm is proved among the family of decomposition strategy algorithms--which also includes the previous fastest algorithms--by tightening the known lower bound of Ω(n2 log2 n) to O(n3), matching the algorithm's running time.

...read moreread less

Abstract: The edit distance between two ordered rooted trees with vertex labels is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. In this paper, we present a worst-case O(n3)-time algorithm for this problem, improving the previous best O(n3 log n)-time algorithm [7]. Our result requires a novel adaptive strategy for deciding how a dynamic program divides into subproblems, together with a deeper understanding of the previous algorithms for the problem. We prove the optimality of our algorithm among the family of decomposition strategy algorithms--which also includes the previous fastest algorithms--by tightening the known lower bound of Ω(n2 log2 n) [4] to O(n3), matching our algorithm's running time. Furthermore, we obtain matching upper and lower bounds of Θ(nm2(1+log n/m)) when the two trees have sizes m and n where m < n.

...read moreread less

Journal Article•DOI•

Distance measures based on the edit distance for permutation-type representations

[...]

Kenneth Sörensen¹•Institutions (1)

University of Antwerp¹

01 Feb 2007-Journal of Heuristics

TL;DR: This paper introduces several extensions to the simple edit distance, that can be used when a solution cannot be represented as a simple permutation, and develops algorithms to calculate them efficiently.

...read moreread less

Abstract: In this paper, we discuss distance measures for a number of different combinatorial optimization problems of which the solutions are best represented as permutations of items, sometimes composed of several permutation (sub)sets. The problems discussed include single-machine and multiple-machine scheduling problems, the traveling salesman problem, vehicle routing problems, and many others. Each of these problems requires a different distance measure that takes the specific properties of the representation into account. The distance measures discussed in this paper are based on a general distance measure for string comparison called the edit distance. We introduce several extensions to the simple edit distance, that can be used when a solution cannot be represented as a simple permutation, and develop algorithms to calculate them efficiently.

...read moreread less

Journal Article•DOI•

Approximate parameterized matching

[...]

Carmit Hazay¹, Moshe Lewenstein¹, Dina Sokol²•Institutions (2)

Bar-Ilan University¹, City University of New York²

01 Aug 2007-ACM Transactions on Algorithms

TL;DR: This work considers the problem for which an error threshold, k, is given, and the goal is to find all locations in for which there exists a bijection π which maps (p) into the appropriate |p mismatched mapped elements.

...read moreread less

Abstract: Two equal length strings s and s′, over alphabets Σs and Σs′, parameterize match if there exists a bijection π : Σs r Σs′ such that π (s) = s′, where π (s) is the renaming of each character of s via π. Parameterized matching is the problem of finding all parameterized matches of a pattern string p in a text t, and approximate parameterized matching is the problem of finding at each location a bijection π that maximizes the number of characters that are mapped from p to the appropriate vpv-length substring of t.Parameterized matching was introduced as a model for software duplication detection in software maintenance systems and also has applications in image processing and computational biology. For example, approximate parameterized matching models image searching with variable color maps in the presence of errors.We consider the problem for which an error threshold, k, is given, and the goal is to find all locations in t for which there exists a bijection π which maps p into the appropriate vpv-length substring of t with at most k mismatched mapped elements. Our main result is an algorithm for this problem with O(nk1.5 p mk log m) time complexity, where m = vpv and n=vtv. We also show that when vpv = vtv = m, the problem is equivalent to the maximum matching problem on graphs, yielding a O(m p k1.5) solution.
...read moreread less

Proceedings Article•

Extending q-grams to estimate selectivity of string matching with low edit distance

[...]

Hongrae Lee¹, Raymond T. Ng¹, Kyuseok Shim²•Institutions (2)
University of British Columbia¹, Seoul National University²

23 Sep 2007
TL;DR: This paper develops the formulas for selectivity estimation and provides the algorithm BasicEQ, and shows a comprehensive set of experiments using three benchmarks comparing Opt EQ with the state-of-the-art method SEPIA.
...read moreread less
Abstract: There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semi-lattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm Opt EQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing Opt EQ with the state-of-the-art method SEPIA. Our experimental results show that Opt EQ delivers more accurate selectivity estimations.
...read moreread less

Journal Article•DOI•

Computing the edit distance of a regular language

[...]

Stavros Konstantinidis¹•Institutions (1)
Saint Mary's University¹

01 Sep 2007-Information & Computation

TL;DR: The edit distance (or Levenshtein distance) between two words is the smallest number of substitutions, insertions, and deletions of symbols that can be used to transform one of the words into the other.
...read moreread less

Abstract: The edit distance (or Levenshtein distance) between two words is the smallest number of substitutions, insertions, and deletions of symbols that can be used to transform one of the words into the other In this paper, we consider the problem of computing the edit distance of a regular language (the set of words accepted by a given finite automaton) This quantity is the smallest edit distance between any pair of distinct words of the language We show that the problem is of polynomial time complexity In particular, for a given finite automaton A with n transitions, over an alphabet of r symbols, our algorithm operates in time O(n2r2q2( q+r)), where q is either the diameter of A (if A is deterministic), or the square of the number of states in A (if A is nondeterministic) Incidentally, we also obtain an upper bound on the edit distance of a regular language in terms of the automaton accepting the language
...read moreread less

Book Chapter•DOI•

Trainable sketch recognizer for graphical user interface design

[...]

Adrien Coyette¹, Sascha Schimke², Jean Vanderdonckt¹, Claus Vielhauer²•Institutions (2)
Université catholique de Louvain¹, Otto-von-Guericke University Magdeburg²

01 Sep 2007
TL;DR: A new algorithm for automatic recognition of hand drawn sketches based on the Levenshtein distance is presented, which is trainable by every user and improves the recognition performance of the techniques which were used before for widget recognition.
...read moreread less
Abstract: In this paper we present a new algorithm for automatic recognition of hand drawn sketches based on the Levenshtein distance. The purpose for drawing sketches in our application is to create graphical user interfaces in a similar manner as the well established paper sketching. The new algorithm is trainable by every user and improves the recognition performance of the techniques which were used before for widget recognition. In addition, this algorithm ay serve for recognizing other types of sketches, such as letters, figures, and commands. In this way, there is no modality disruption at sketching time.
...read moreread less

Journal Article•DOI•

Edit distance with move operations

[...]

Dana Shapira¹, James A. Storer¹•Institutions (1)
Brandeis University¹

01 Jun 2007-Journal of Discrete Algorithms

TL;DR: A polynomial time greedy algorithm for non-recursive moves which on a subclass of instances of a problem of size n achieves an approximation factor to optimal of at most O(logn).
...read moreread less

Proceedings Article•DOI•

Structuring wiki revision history

[...]

Mikalai Sabel¹•Institutions (1)
University of Trento¹

21 Oct 2007
TL;DR: The proposed revision history of a wiki page is represented as a tree of versions, with every edge in the tree given a weight, called adoption coefficient, indicating similarity between the two corresponding page versions.
...read moreread less
Abstract: Revision history of a wiki page is traditionally maintained as a linear chronological sequence. We propose to represent revision history as a tree of versions. Every edge in the tree is given a weight, called adoption coefficient, indicating similarity between the two corresponding page versions. The same coefficients are used to build the tree. In the implementation described, adoption coefficients are derived from comparing texts of the versions, similarly to computing edit distance. The tree structure reflects actual evolution of page content, revealing reverts, vandalism, and edit wars, which is demonstrated on Wikipedia examples. The tree representation is useful for both human editors and automated algorithms, including trust and reputation schemes for wiki.
...read moreread less

Journal Article•DOI•

Discovering Shape Classes using Tree Edit-Distance and Pairwise Clustering

[...]

Andrea Torsello, Antonio Robles-Kelly¹, Edwin R. Hancock²•Institutions (2)
Australian National University¹, University of York²

01 May 2007-International Journal of Computer Vision

TL;DR: It is shown how the edit distances can be used to compute a matrix of pairwise affinities using χ2 statistics, and a maximum likelihood method for clustering the graphs by iteratively updating the elements of the affinity matrix is presented.
...read moreread less

Abstract: This paper describes work aimed at the unsupervised learning of shape-classes from shock trees. We commence by considering how to compute the edit distance between weighted trees. We show how to transform the tree edit distance problem into a series of maximum weight clique problems, and show how to use relaxation labeling to find an approximate solution. This allows us to compute a set of pairwise distances between graph-structures. We show how the edit distances can be used to compute a matrix of pairwise affinities using ?2 statistics. We present a maximum likelihood method for clustering the graphs by iteratively updating the elements of the affinity matrix. This involves interleaved steps for updating the affinity matrix using an eigendecomposition method and updating the cluster membership indicators. We illustrate the new tree clustering framework on shock-graphs extracted from the silhouettes of 2D shapes.
...read moreread less

Book Chapter•DOI•

Relational Neural Gas

[...]

Barbara Hammer¹, Alexander Hasenfuss¹•Institutions (1)
Clausthal University of Technology¹

10 Sep 2007
TL;DR: Rel relational variants of neural gas are introduced, a very efficient and powerful neural clustering algorithm, which allow a clustering and mining of data given in terms of a pairwise similarity or dissimilarity matrix.
...read moreread less
Abstract: We introduce relational variants of neural gas, a very efficient and powerful neural clustering algorithm, which allow a clustering and mining of data given in terms of a pairwise similarity or dissimilarity matrix. It is assumed that this matrix stems from Euclidean distance or dot product, respectively, however, the underlying embedding of points is unknown. One can equivalently formulate batch optimization in terms of the given similarities or dissimilarities, thus providing a way to transfer batch optimization to relational data. For this procedure, convergence is guaranteed and extensions such as the integration of label information can readily be transferred to this framework.
...read moreread less

Journal Article•DOI•

Text indexing with errors

[...]

Moritz G. Maaí¹, Johannes Nowak¹•Institutions (1)
Technische Universität München¹

01 Dec 2007-Journal of Discrete Algorithms

TL;DR: This paper addresses the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors and presents a trade-off between query time and index complexity that achieves worst-case bounded index size and preprocessing time with linear look-up time.
...read moreread less

Journal Article•DOI•

Sequence-similarity kernels for SVMs to detect anomalies in system calls

[...]

Shengfeng Tian¹, Shaomin Mu¹, Chuanhuan Yin¹•Institutions (1)
Beijing Jiaotong University¹

01 Jan 2007-Neurocomputing

TL;DR: In intrusion detection systems (IDSs), short sequences of system calls executed by running programs can be used as evidence to detect anomalies and distance-based kernel and common subsequence-based kernels are proposed to utilize the sequence information in the detection.
...read moreread less

Patent•

System and Method for Searching a Multimedia Database using a Pictorial Language

[...]

Christine Podilchuk

02 Feb 2007
TL;DR: In this article, a system and method for searching multimedia databases using a pictorial language, input via an iconic interface and making use of trained ontologies, is presented, where the result images are returned to the user in order of their relevance to the query.
...read moreread less
Abstract: A system and method for searching multimedia databases using a pictorial language, input via an iconic interface and making use of trained ontologies, i.e., trained data models. An iconic graphic user interface (GUI) allows a user to specify a pictorial query that may include one or more one or more key-images and optional text input. Similarities between the query key-images and images in a multimedia database based on a pictorial edit distance are used to select the images that are the closest match to the query. The result images are returned to the user in order of their relevance to the query.
...read moreread less

Proceedings Article•DOI•

The Computational Hardness of Estimating Edit Distance [Extended Abstract]

[...]

A. Andon¹, R. Krauthgamer•Institutions (1)
Massachusetts Institute of Technology¹

21 Oct 2007
TL;DR: This work proves the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings, and provides the first setting in which the complexity of computing the edit Distance is provably larger than that of Hamming distance.
...read moreread less
Abstract: We prove the first non-trivial communication complexity lower bound for the problem of estimating the edit distance (aka Levenshtein distance) between two strings. A major feature of our result is that it provides the first setting in which the complexity of computing the edit distance is provably larger than that of Hamming distance. Our lower bound exhibits a trade-off between approximation and communication, asserting, for example, thai protocols with O(1) bits of communication can only obtain approximation a ges Omega(log d/log log d), where d is the length of the input strings. This case of O(1) communication is of particular importance, since it captures constant-size sketches as well as embaddings into spaces like L1 and squared-L2. two prevailing algorithmic approaches for dealing with edit distance. Furthermore, the bound holds not only for strings over alphabet Sigma= {0, 1}, but also for strings that are permu-tations (called the Ulam metric). Besides being applicable to a much richer class of algorithms than all previous results, our bounds are near-tight in at. least one case, namely of embedding permutations into L1. The proof uses a new technique, that relies on Fourier analysis in a rather elementary way.
...read moreread less

Book Chapter•DOI•

A quadratic programming approach to the graph edit distance problem

[...]

Michel Neuhaus¹, Horst Bunke²•Institutions (2)
Pierre-and-Marie-Curie University¹, University of Bern²

11 Jun 2007
TL;DR: Experiments demonstrate that the proposed quadratic programming approach to computing the edit distance of graphs is able to outperform the standard edit distance method in terms of recognition accuracy on two out of three data sets.
...read moreread less
Abstract: In this paper we propose a quadratic programming approach to computing the edit distance of graphs. Whereas the standard edit distance is defined with respect to a minimum-cost edit path between graphs, we introduce the notion of fuzzy edit paths between graphs and provide a quadratic programming formulation for the minimization of fuzzy edit costs. Experiments on real-world graph data demonstrate that our proposed method is able to outperform the standard edit distance method in terms of recognition accuracy on two out of three data sets.
...read moreread less

Proceedings Article•DOI•

Keyword Search using Modified Minimum Edit Distance Measure

[...]

Kartik Audhkhasi¹, Ashish Verma¹•Institutions (1)
Indian Institutes of Technology¹

15 Apr 2007
TL;DR: A variation of the MED, where the substitution penalties are automatically derived from the phone confusion matrix of the recognizer, as compared to heuristic or class based penalties used earlier, is proposed.
...read moreread less
Abstract: A popular approach for keyword search in speech files is the phone lattice search. Recently minimum edit distance (MED) has been used as a measure of similarity between strings rather than using simple string matching while searching the phone lattice for the keyword. In this paper, we propose a variation of the MED, where the substitution penalties are automatically derived from the phone confusion matrix of the recognizer, as compared to heuristic or class based penalties used earlier. The results show that the substitution penalties derived from the phone confusion matrix lead to a considerable improvement in the accuracy of the keyword search algorithm.
...read moreread less

1
2
3
4
…
5