scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2009"


Journal ArticleDOI
TL;DR: A novel algorithm is introduced which allows us to approximately, or suboptimally, compute edit distance in a substantially faster way and is emprically verified that the accuracy of the suboptimal distance remains sufficiently accurate for various pattern recognition applications.

654 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: Three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time are introduced and result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs.
Abstract: Graph data have become ubiquitous and manipulating them based on similarity is essential for many applications. Graph edit distance is one of the most widely accepted measures to determine similarities between graphs and has extensive applications in the fields of pattern recognition, computer vision etc. Unfortunately, the problem of graph edit distance computation is NP-Hard in general. Accordingly, in this paper we introduce three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time. Applying these methods, two algorithms AppFull and AppSub are introduced to perform different kinds of graph search on graph databases. Comprehensive experimental studies are conducted on both real and synthetic datasets to examine various aspects of the methods for bounding graph edit distance. Result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. The effectiveness of these algorithms also confirms the usefulness of using our bounds in filtering and searching of graphs.

413 citations


Journal ArticleDOI
TL;DR: It is shown that the similarity provided by TWED is a potentially useful metric in time series retrieval applications since it could benefit from the triangular inequality property to speed up the retrieval process while tuning the parameters of the elastic measure.
Abstract: In a way similar to the string-to-string correction problem, we address discrete time series similarity in light of a time-series-to-time-series-correction problem for which the similarity between two time series is measured as the minimum cost sequence of edit operations needed to transform one time series into another. To define the edit operations, we use the paradigm of a graphical editing process and end up with a dynamic programming algorithm that we call time warp edit distance (TWED). TWED is slightly different in form from dynamic time warping (DTW), longest common subsequence (LCSS), or edit distance with real penalty (ERP) algorithms. In particular, it highlights a parameter that controls a kind of stiffness of the elastic measure along the time axis. We show that the similarity provided by TWED is a potentially useful metric in time series retrieval applications since it could benefit from the triangular inequality property to speed up the retrieval process while tuning the parameters of the elastic measure. In that context, a lower bound is derived to link the matching of time series into down sampled representation spaces to the matching into the original space. The empiric quality of the TWED distance is evaluated on a simple classification task. Compared to edit distance, DTW, LCSS, and ERP, TWED has proved to be quite effective on the considered experimental task.

298 citations


Journal ArticleDOI
TL;DR: This article presents a worst-case O(n) 3-time algorithm for the problem when the two trees have size n, and proves the optimality of the algorithm among the family of decomposition strategy algorithms—which also includes the previous fastest algorithms—by tightening the known lower bound.
Abstract: The edit distance between two ordered rooted trees with vertex labels is the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes. In this article, we present a worst-case O(n3)-time algorithm for the problem when the two trees have size n, improving the previous best O(n3 log n)-time algorithm. Our result requires a novel adaptive strategy for deciding how a dynamic program divides into subproblems, together with a deeper understanding of the previous algorithms for the problem. We prove the optimality of our algorithm among the family of decomposition strategy algorithms—which also includes the previous fastest algorithms—by tightening the known lower bound of Ω(n2 log2n) to Ω(n3), matching our algorithm's running time. Furthermore, we obtain matching upper and lower bounds for decomposition strategy algorithms of Θ(nm2 (1 + log n/m)) when the two trees have sizes m and n and m

264 citations


Proceedings Article
01 Jan 2009
TL;DR: The method proposed in this paper outperforms contemporary approaches to trace clustering in process mining and evaluates the goodness of the formed clusters using established fitness and comprehensibility metrics defined in the context of process mining.
Abstract: Process Mining refers to the extraction of process models from event logs. Real-life processes tend to be less structured and more flexible. Traditional process mining algorithms have problems dealing with such unstructured processes and generate spaghetti-like process models that are hard to comprehend. An approach to overcome this is to cluster process instances (a process instance is manifested as a trace and an event log corresponds to a multi-set of traces) such that each of the resulting clusters correspond to a coherent set of process instances that can be adequately represented by a process model. In this paper, we propose a context aware approach to trace clustering based on generic edit distance. It is well known that the generic edit distance framework is highly sensitive to the costs of edit operations. We define an automated approach to derive the costs of edit operations. The method proposed in this paper outperforms contemporary approaches to trace clustering in process mining. We evaluate the goodness of the formed clusters using established fitness and comprehensibility metrics defined in the context of process mining. The proposed approach is able to generate clusters such that the process models mined from the clustered traces show a high degree of fitness and comprehensibility when compared to contemporary approaches.

218 citations


Journal ArticleDOI
TL;DR: This work presents an efficient read mapping tool called RazerS, which allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance and guarantees not to lose more reads than specified.
Abstract: Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, efficient algorithms and implementations are crucial for this task. We present an efficient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. [RazerS is freely available at http://www.seqan.de/projects/razers.html.]

173 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: This paper captures input typing errors via edit distance and shows that a naive approach of invoking an offline edit distance matching algorithm at each step performs poorly and present more efficient algorithms.
Abstract: Autocompletion is a useful feature when a user is doing a look up from a table of records. With every letter being typed, autocompletion displays strings that are present in the table containing as their prefix the search string typed so far. Just as there is a need for making the lookup operation tolerant to typing errors, we argue that autocompletion also needs to be error-tolerant. In this paper, we take a first step towards addressing this problem. We capture input typing errors via edit distance. We show that a naive approach of invoking an offline edit distance matching algorithm at each step performs poorly and present more efficient algorithms. Our empirical evaluation demonstrates the effectiveness of our algorithms.

149 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: This paper studies the problem of approximate dictionary matching with edit distance constraints and proposes an improved neighborhood generation method employing novel partitioning and prefix pruning techniques that outperforms alternative approaches by up to an order of magnitude.
Abstract: Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.

118 citations


Patent
Hee-Jun Song1, Young-Hee Park1, Hyun Sik Shim1, Ham Jong Gyu1, Harksoo Kim1, Jooho Lee1, Se Hee Lee1 
07 Apr 2009
TL;DR: In this article, a spelling correction system and method automatically recognizes and corrects misspelled inputs in an electronic device with relatively lower computing power in a learning process, a misspelling correction dictionary is constructed on the basis of a corpus of accepted words, and context-sensitive strings are selected from among all the strings registered in the dictionary Context information about the context sensitive strings is acquired.
Abstract: A spelling correction system and method automatically recognizes and corrects misspelled inputs in an electronic device with relatively lower computing power In a learning process, a misspelling correction dictionary is constructed on the basis of a corpus of accepted words, and context-sensitive strings are selected from among all the strings registered in the dictionary Context information about the context-sensitive strings is acquired In an applying process, at least one target string is selected from among all the strings in a user's input sentence through the dictionary If the target string is one of the context-sensitive strings, the target string is corrected by use of the context information

102 citations


Patent
27 Jul 2009
TL;DR: In this article, a complete framework to detect unauthorized copying of videos on the Internet using the disclosed perceptual video signature is disclosed, which can be used to detect duplicate and near-duplicate videos.
Abstract: Methods and apparatus for detection and identification of duplicate or near-duplicate videos using a perceptual video signature are disclosed. The disclosed apparatus and methods (i) extract perceptual video features, (ii) identify unique and distinguishing perceptual features to generate a perceptual video signature, (iii) compute a perceptual video similarity measure based on the video edit distance, and (iv) search and detect duplicate and near-duplicate videos. A complete framework to detect unauthorized copying of videos on the Internet using the disclosed perceptual video signature is disclosed.

68 citations


Book
24 Nov 2009
TL;DR: The Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance), often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.
Abstract: In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau?Levenshtein distance) allows the transposition of two characters as an operation. Some Translation Environment Tools, such as translation memory leveraging applications, use the Levenhstein algorithm to measure the edit distance between two fuzzy matching content segments.The metric is named after Vladimir Levenshtein, who considered this distance in 1965. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers

Proceedings ArticleDOI
19 Oct 2009
TL;DR: A semi-definite programming is formulated to encode the above two aspects of criteria to learn the distance metric and it is shown such an optimization problem can be efficiently solved with a closed-form solution.
Abstract: This paper proposes a novel semantic-aware distance metric for images by mining multimedia data on the Internet, in particular, web images and their associated tags. As well known, a proper distance metric between images is a key ingredient in many realistic web image retrieval engines, as well many image understanding techniques. In this paper, we attempt to mine a novel distance metric from the web images by integrating their visual content as well as the associated user tags. Different from many existing distance metric learning algorithms which utilize the dissimilar or similar information between images pixels or features in signal level, the proposed scheme also takes the associated user-input tags into consideration. The visual content of images is also leveraged to respect an intuitive assumption that the visual similar images ought to have a smaller distance. A semi-definite programming is formulated to encode the above two aspects of criteria to learn the distance metric and we show such an optimization problem can be efficiently solved with a closed-form solution. We evaluate the proposed algorithm on two datasets. One is the benchmark Corel dataset and the other is a real-world dataset crawled from the image sharing website Flickr. By comparison with other existing distance learning algorithms, competitive results are obtained by the proposed algorithm in experiments.

Proceedings ArticleDOI
31 May 2009
TL;DR: This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n(1/3+o(1)) approximation.
Abstract: We show how to compute the edit distance between two strings of length n up to a factor of 2(O-tilde(sqrt(log n))) in n(1+o(1)) time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n(1/3+o(1)) approximation. Previously, approximation of 2O √log n) was known only for embedding edit distance into l1, and it is not known if that embedding can be computed in less than a quadratic time.

Posted Content
TL;DR: In this article, a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs was presented, and an algorithm running in O(n 1.4}N^{1.2})$ time for computing the edit distance of these two strings under any rational scoring function was presented.
Abstract: We present a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs. For two strings of total length $N$ having straight-line program representations of total size $n$, we provide an algorithm running in $O(n^{1.4}N^{1.2})$ time for computing the edit-distance of these two strings under any rational scoring function, and an $O(n^{1.34}N^{1.34})$ time algorithm for arbitrary scoring functions. This improves on a recent algorithm of Tiskin that runs in $O(nN^{1.5})$ time, and works only for rational scoring functions. Also, in the last part of the paper, we show how the classical four-russians technique can be incorporated into our SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings.

Patent
17 Sep 2009
TL;DR: In this article, the similarity between error reports received by an error reporting service is determined by comparing frames included in a callstack of an error report to frames in other error reports to determine an edit distance between the callstacks.
Abstract: Techniques for determining similarity between error reports received by an error reporting service. An error report may be compared to other previously-received error reports to determine similarity and facilitate diagnosing and resolving an error that generated the error report. In some implementations, the similarity may be determined by comparing frames included in a callstack of an error report to frames included in callstacks in other error reports to determine an edit distance between the callstacks, which may be based on the number and type of frame differences between callstacks. Each type of change may be weighted differently when determining the edit distance. Additionally or alternatively, the comparison may be performed by comparing a type of error, process names, and/or exception codes for the errors contained in the error reports. The similarity may be expressed as a probability that two error reports were generated as a result of a same error.

Journal ArticleDOI
TL;DR: A new measure (CL) of spatial/structural landscape complexity is developed in this paper, based on the Levenshtein algorithm used in Computer Science and Bioinformatics for string comparisons, which may aid in landscape monitoring, management and planning, by identifying areas of higher structural landscape complexity.

Proceedings ArticleDOI
26 Feb 2009
TL;DR: The classical four-russians technique can be incorporated into the SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings.
Abstract: We present a unified framework for accelerating edit-distance computation between two compressible strings using straight-line programs For two strings of total length $N$ having straight-line program representations of total size $n$, we provide an algorithm running in $O(n^{14}N^{12})$ time for computing the edit-distance of these two strings under any rational scoring function, and an $O(n^{134}N^{134})$ time algorithm for arbitrary scoring functions This improves on a recent algorithm of Tiskin that runs in $O(nN^{15})$ time, and works only for rational scoring functions Also, in the last part of the paper, we show how the classical four-russians technique can be incorporated into our SLP edit-distance scheme, giving us a simple $\Omega(\lg N)$ speed-up in the case of arbitrary scoring functions, for any pair of strings

Book ChapterDOI
17 Feb 2009
TL;DR: The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens.
Abstract: The Mongue-Elkan method is a general text string comparison method based on an internal character-based similarity measure (eg edit distance) combined with a token level (ie word level) similarity measure We propose a generalization of this method based on the notion of the generalized arithmetic mean instead of the simple average used in the expression to calculate the Monge-Elkan method The experiments carried out with 12 well-known name-matching data sets show that the proposed approach outperforms the original Monge-Elkan method when character-based measures are used to compare tokens

Journal ArticleDOI
TL;DR: This work proposes a certain methodology for preserving the privacy of various record linkage approaches, implements, examines and compares four pairs of privacy preserving record linkage methods and protocols and presents also a blocking scheme as an extension to the privacy preserve record linkage methodology.
Abstract: Privacy-preserving record linkage is a very important task, mostly because of the very sensitive nature of the personal data. The main focus in this task is to find a way to match records from among different organisation data sets or databases without revealing competitive or personal information to non-owners. Towards accomplishing this task, several methods and protocols have been proposed. In this work, we propose a certain methodology for preserving the privacy of various record linkage approaches and we implement, examine and compare four pairs of privacy preserving record linkage methods and protocols. Two of these protocols use n-gram based similarity comparison techniques, the third protocol uses the well known edit distance and the fourth one implements the Jaro-Winkler distance metric. All of the protocols used are enhanced by private key cryptography and hash encoding. This paper presents also a blocking scheme as an extension to the privacy preserving record linkage methodology. Our comparison is backed up by extended experimental evaluation that demonstrates the performance achieved by each of the proposed protocols.

Proceedings ArticleDOI
04 Aug 2009
TL;DR: This paper proposes an original method to estimate and optimize the operation costs in TED, applying the Particle Swarm Optimization algorithm, and shows the success of this method in automatic estimation, rather than manual assignment of edit costs.
Abstract: Recently, there is a growing interest in working with tree-structured data in different applications and domains such as computational biology and natural language processing. Moreover, many applications in computational linguistics require the computation of similarities over pair of syntactic or semantic trees. In this context, Tree Edit Distance (TED) has been widely used for many years. However, one of the main constraints of this method is to tune the cost of edit operations, which makes it difficult or sometimes very challenging in dealing with complex problems. In this paper, we propose an original method to estimate and optimize the operation costs in TED, applying the Particle Swarm Optimization algorithm. Our experiments on Recognizing Textual Entailment show the success of this method in automatic estimation, rather than manual assignment of edit costs.

01 Jan 2009
TL;DR: An efficient algorithm for similarity join with edit distance constraints is implemented that achieves substantial reduction of the candidate siz s and hence saves computation time and a new algorithm, Ed-Join, is proposed that exploits the mismatch-based filtering methods.
Abstract: Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and patte rn r cognition. In this project, we implement an efficient algorithm for similarity joi n with edit distance constraints. Current approaches are mainly that the edit distance constr ai t is converted to a weaker constraint on number of matching q-grams between pair of strin gs. In our project, we exploit a novel perspective of investigating mismatching q-gr am. We derive two new edit distance lower bounds by analyzing the locations and content s of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the n ew mismatch-based filtering methods; it achieves substantial reduction of the candidate siz s and hence saves computation time.

Proceedings ArticleDOI
04 Jan 2009
TL;DR: This work proposes a new approach of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like l1 and l∞, and shows that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms.
Abstract: A common approach for solving computational problems over a difficult metric space is to embed the "hard" metric into L1 which admits efficient algorithms and is thus considered an "easy" metric. This approach has proved successful or partially successful for important spaces such as the edit distance, but it also has inherent limitations: it is provably impossible to go below certain approximation for some metrics.We propose a new approach, of embedding the difficult space into richer host spaces, namely iterated products of standard spaces like l1 and l∞. We show that this class is rich since it contains useful metric spaces with only a constant distortion, and, at the same time, it is tractable and admits efficient algorithms. Using this approach, we obtain for example the first nearest neighbor data structure with O(log log d) approximation for edit distance in non-repetitive strings (the Ulam metric). This approximation is exponentially better than the lower bound for embedding into L1. Furthermore, we give constant factor approximation for two other computational problems. Along the way, we answer positively a question posed in [Ajtai, Jayram, Kumar, and Sivakumar, STOC 2002]. One of our algorithms has already found applications for smoothed edit distance over 0--1 strings [Andoni and Krauthgamer, ICALP 2008].

Patent
10 Mar 2009
TL;DR: In this paper, an architecture for extracting document information from documents received as search results based on a query string, and computing an edit distance between the data string and the query string is presented.
Abstract: Architecture for extracting document information from documents received as search results based on a query string, and computing an edit distance between the data string and the query string. The edit distance is employed in determining relevance of the document as part of result ranking by detecting near-matches of a whole query or part of the query. The edit distance evaluates how close the query string is to a given data stream that includes document information such as TAUC (title, anchor text, URL, clicks) information, etc. The architecture includes the index-time splitting of compound terms in the URL to allow the more effective discovery of query terms. Additionally, index-time filtering of anchor text is utilized to find the top N anchors of one or more of the document results. The TAUC information can be input to a neural network (e.g., 2-layer) to improve relevance metrics for ranking the search results.

Book ChapterDOI
04 Jun 2009
TL;DR: A distance for geometric graphs is proposed that is shown to be a metric, and that can be computed by solving an integer linear program and also presented experiments using a heuristic distance function.
Abstract: What does it mean for two geometric graphs to be similar? We propose a distance for geometric graphs that we show to be a metric, and that can be computed by solving an integer linear program We also present experiments using a heuristic distance function

Proceedings ArticleDOI
01 Jun 2009
TL;DR: This work proposes the use of Profile HMMs for word-related tasks, and test their applicability to the tasks of multiple cognate alignment and cognate set matching, and finds that they work well in general for both tasks.
Abstract: Profile hidden Markov models (Profile HMMs) are specific types of hidden Markov models used in biological sequence analysis. We propose the use of Profile HMMs for word-related tasks. We test their applicability to the tasks of multiple cognate alignment and cognate set matching, and find that they work well in general for both tasks. On the latter task, the Profile HMM method outperforms average and minimum edit distance. Given the success for these two tasks, we further discuss the potential applications of Profile HMMs to any task where consideration of a set of words is necessary.

Journal ArticleDOI
01 Aug 2009
TL;DR: A novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure, which significantly outperforms state-of-the-art biological sequence alignment methods.
Abstract: This paper introduces a novel method, called Reference-Based String Alignment (RBSA), that speeds up retrieval of optimal subsequence matches in large databases of sequences under the edit distance and the Smith-Waterman similarity measure. RBSA operates using the assumption that the optimal match deviates by a relatively small amount from the query, an amount that does not exceed a prespecified fraction of the query length. RBSA has an exact version that guarantees no false dismissals and can handle large queries efficiently. An approximate version of RBSA is also described, that achieves significant additional improvements over the exact version, with negligible losses in retrieval accuracy. RBSA performs filtering of candidate matches using precomputed alignment scores between the database sequence and a set of fixed-length reference sequences. At query time, the query sequence is partitioned into segments of length equal to that of the reference sequences. For each of those segments, the alignment scores between the segment and the reference sequences are used to efficiently identify a relatively small number of candidate subsequence matches. An alphabet collapsing technique is employed to improve the pruning power of the filter step. In our experimental evaluation, RBSA significantly outperforms state-of-the-art biological sequence alignment methods, such as q-grams, BLAST, and BWT.

Book ChapterDOI
18 Jun 2009
TL;DR: This paper proposes the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most " 1" from query string) in time linear in the length of query string.
Abstract: In the approximate dictionary search problem we have to construct a data structure on a set of strings so that we can answer to queries of the kind: find all strings of the set that are similar (according to some string distance) to a given string. In this paper we propose the first data structure for approximate dictionary search that occupies optimal space (up to a constant factor) and able to answer an approximate query for edit distance "1" (report all strings of dictionary that are at edit distance at most "1" from query string) in time linear in the length of query string. Based on our new dictionary we propose a full-text index for approximate queries with edit distance "1" (report all positions of all sub-strings of the text that are at edit distance at most "1" from query string) answering to a query in time linear in the length of query string using space $O(n(\lg(n)\lg\lg(n))^2)$ in the worst case on a text of length n . Our index is the first index that answers queries in time linear in the length of query string while using space O (n ·poly (log (n ))) in the worst case and for any alphabet size.

Book ChapterDOI
30 Jun 2009
TL;DR: The new version ELKI 0.2 now is extended to time series data and offers a selection of specialized distance measures, which can serve as a visualization- and evaluation-tool for the behavior of different distance measures on time seriesData.
Abstract: ELKI is a unified software framework, designed as a tool suitable for evaluation of different algorithms on high dimensional real-valued feature-vectors. A special case of high dimensional real-valued feature-vectors are time series data where traditional distance measures like L p -distances can be applied. However, also a broad range of specialized distance measures like, e.g., dynamic time-warping, or generalized distance measures like second order distances, e.g., shared-nearest-neighbor distances, have been proposed. The new version ELKI 0.2 now is extended to time series data and offers a selection of these distance measures. It can serve as a visualization- and evaluation-tool for the behavior of different distance measures on time series data.

Book ChapterDOI
29 Aug 2009
TL;DR: A Merge-split edit distance which overcomes segmentation problems by incorporating a multi-purpose merge cost function and Evaluation of the method on 19th century historical document images exhibits extremely promising results.
Abstract: Edit distance matching has been used in literature for word spotting with characters taken as primitives. The recognition rate however, is limited by the segmentation inconsistencies of characters (broken or merged) caused by noisy images or distorted characters. In this paper, we have proposed a Merge-split edit distance which overcomes these segmentation problems by incorporating a multi-purpose merge cost function. The system is based on the extraction of words and characters in the text and then attributing each character with a set of features. Characters are matched by comparing their extracted feature sets using Dynamic Time Warping (DTW) while the words are matched by comparing the strings of characters using the proposed Merge-Split Edit distance algorithm. Evaluation of the method on 19th century historical document images exhibits extremely promising results.

Book ChapterDOI
29 Aug 2009
TL;DR: This paper presents a method to introduce temporal information within the bag-of-words (BoW) approach, modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW model.
Abstract: The recognition of events in videos is a relevant and challenging task of automatic semantic video analysis. At present one of the most successful frameworks, used for object recognition tasks, is the bag-of-words (BoW) approach. However this approach does not model the temporal information of the video stream. In this paper we present a method to introduce temporal information within the BoW approach. Events are modeled as a sequence composed of histograms of visual features, computed from each frame using the traditional BoW model. The sequences are treated as strings where each histogram is considered as a character. Event classification of these sequences of variable size, depending on the length of the video clip, are performed using SVM classifiers with a string kernel that uses the Needlemann-Wunsch edit distance. Experimental results, performed on two datasets, soccer video and TRECVID 2005, demonstrate the validity of the proposed approach.