scispace - formally typeset
Search or ask a question

Showing papers on "Edit distance published in 2011"


Journal ArticleDOI
01 Dec 2011
TL;DR: It is proved that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity and introduces the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature.
Abstract: We consider the classical tree edit distance between ordered labeled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity, but the worst case happens frequently, or they are very efficient for some tree shapes, but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.In this paper we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of RTED is smaller or equal to the complexity of the best competitors for any input instance, i.e., RTED is both efficient and worst-case optimal. We introduce the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature. We prove that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity. In our experiments on synthetic and real world data we empirically evaluate our solution and compare it to the state-of-the-art.

176 citations


Journal ArticleDOI
01 Nov 2011
TL;DR: This paper study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, using a partition-based method called Pass-Join.
Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.

172 citations


Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a partition-based method called Pass-Join, which partitions a string into a set of segments and creates inverted indices for the segments, then selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices.
Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.

171 citations


Journal ArticleDOI
TL;DR: A method that can be used for Minimum Bayes Risk decoding for speech recognition that has similar functionality to the widely used Consensus method, but has a clearer theoretical basis and appears to give better results both for MBR decoding and system combination.

167 citations


Proceedings ArticleDOI
12 Jun 2011
TL;DR: Under this new measure, it is proved that subgraph similarity search is NP hard, while graph similarity match is polynomial, and an information propagation model is found that is able to convert a large network into a set of multidimensional vectors, where sophisticated indexing and similarity search algorithms are available.
Abstract: Complex social and information network search becomes important with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the target network. The presence of noise and the incomplete knowledge about the structure and content of the target network make it unrealistic to find an exact match. Rather, it is more appealing to find the top-k approximate matches.In this paper, we propose a neighborhood-based similarity measure that could avoid costly graph isomorphism and edit distance computation. Under this new measure, we prove that subgraph similarity search is NP hard, while graph similarity match is polynomial. By studying the principles behind this measure, we found an information propagation model that is able to convert a large network into a set of multidimensional vectors, where sophisticated indexing and similarity search algorithms are available. The proposed method, called Ness (Neighborhood Based Similarity Search), is appropriate for graphs with low automorphism and high noise, which are common in many social and information networks. Ness is not only efficient, but also robust against structural noise and information loss. Empirical results show that it can quickly and accurately find high-quality matches in large networks, with negligible cost.

149 citations


Proceedings ArticleDOI
Jianbin Qin1, Wei Wang1, Yifei Lu1, Chuan Xiao1, Xuemin Lin1 
12 Jun 2011
TL;DR: This paper shows that the minimum signature size lower bound is t +1, and proposes asymmetric signature schemes that achieve this lower bound, and develops efficient query processing algorithms based on the new scheme.
Abstract: Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is t +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.

106 citations


Book ChapterDOI
09 Jul 2011
TL;DR: By mapping messages into a large context, the authors can compute the distances between them, and then classify them, which yields more accurate classification of a set of Twitter messages than alternative techniques using string edit distance and latent semantic analysis.
Abstract: By mapping messages into a large context, we can compute the distances between them, and then classify them. We test this conjecture on Twitter messages: Messages are mapped onto their most similar Wikipedia pages, and the distances between pages are used as a proxy for the distances between messages. This technique yields more accurate classification of a set of Twitter messages than alternative techniques using string edit distance and latent semantic analysis.

97 citations


Journal ArticleDOI
TL;DR: A taxonomy that classifies all methods into direct methods and sequence-based filtering methods for approximate dictionary searching is introduced, which focuses on infrequently updated dictionaries, which are used primarily for retrieval.
Abstract: The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.

87 citations


Proceedings Article
01 Sep 2011
TL;DR: This work automatically annotates the English version of a multi-parallel corpus and projects the annotations into all the other language versions, and uses a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database for the translation of English entities.
Abstract: As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English version of a multi-parallel corpus and by projecting the annotations into all the other language versions. For the translation of English entities, we use a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database. For the projection, we incrementally apply different methods: perfect string matching, perfect consonant signature matching and edit distance similarity. The resulting annotated parallel corpus will be made available for reuse.

75 citations


Proceedings Article
27 Jul 2011
TL;DR: This work presents a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora based on the key assumption that collocation errors are often caused by semantic similarity in the first language (L1-language) of the writer.
Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1-language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

74 citations


Posted Content
TL;DR: An improvement has been made to this method by grouping some similar looking alphabets and reducing the weighted difference among members of the same group, and the results showed marked improvement over the traditional Levenshtein distance technique.
Abstract: Dictionary lookup methods are popular in dealing with ambiguous letters which were not recognized by Optical Character Readers. However, a robust dictionary lookup method can be complex as apriori probability calculation or a large dictionary size increases the overhead and the cost of searching. In this context, Levenshtein distance is a simple metric which can be an effective string approximation tool. After observing the effectiveness of this method, an improvement has been made to this method by grouping some similar looking alphabets and reducing the weighted difference among members of the same group. The results showed marked improvement over the traditional Levenshtein distance technique.

Journal ArticleDOI
TL;DR: The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code.
Abstract: One of the main trends in the modern anti-virus industry is the development of algorithms that help estimate the similarity of files. Since malware writers tend to use increasingly complex techniques to protect their code such as obfuscation and polymorphism, anti-virus software vendors face problems of the increasing difficulty of file scanning, the considerable growth of anti-virus databases, and file storages overgrowth. For solving such problems, a static analysis of files appears to be of some interest. Its use helps determine those file characteristics that are necessary for their comparison without executing malware samples within a protected environment. The solution provided in this article is based on the assumption that different samples of the same malicious program have a similar order of code and data areas. Each such file area may be characterized not only by its length, but also by its homogeneity. In other words, the file may be characterized by the complexity of its data order. Our approach consists of using wavelet analysis for the segmentation of files into segments of different entropy levels and using edit distance between sequence segments to determine the similarity of the files. The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers. First, this comparison does not take into account the functionality of analysed files and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code. On the other hand, such a comparison may result in false alarms. Therefore, our solution is useful as a preliminary test that triggers the running of additional checks. Second, the method is relatively easy to implement and does not require code disassembly or emulation. And, third, the method makes the malicious file record compact which is significant when compiling anti-virus databases.

Proceedings ArticleDOI
16 Jul 2011
TL;DR: This work generates bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects, and uses these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word- to-word translations.
Abstract: Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features. We generate bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects. The use of visual similarity substantially improves performance over standard approaches based on string similarity: for generated lexicons with 1000 translations, including visual information leads to an absolute improvement in accuracy of 8-12% over string edit distance alone.

Journal ArticleDOI
01 Mar 2011
TL;DR: An efficient probabilistic edit distance is proposed for providing an explicit video-based LPR, and cognitive loops are introduced at critical stages of the algorithm to take advantage of the context modeling to increase the overall system performances.
Abstract: License Plate Recognition (LPR) is mainly regarded as a solved problem. However, robust solutions able to face real-world scenarios still need to be proposed. Country-specific systems are mostly, designed, which can (artificially) reach high-level recognition rates. This option, however, strictly limits their applicability. In this paper, we propose an approach that can deal with various national plates. There are three main areas of novelty. First, the Optical Character Recognition (OCR) is managed by a hybrid strategy, combining statistical and structural algorithms. Secondly, an efficient probabilistic edit distance is proposed for providing an explicit video-based LPR. Last but not least, cognitive loops are introduced at critical stages of the algorithm. These feedback steps take advantage of the context modeling to increase the overall system performances, and overcome the inextricable parameter settings of the low-level processing. The system performances have been tested in more than 1200 static images with difficult illumination conditions and complex backgrounds, as well as in six different videos containing 525 moving vehicles. The evaluations prove our system to be very competitive among the non-country specific approaches.

Posted Content
TL;DR: In this paper, the first sub-polynomial approximation algorithm for the edit distance between two strings of length n up to a factor of 2^(1+o(1)) was presented.
Abstract: We show how to compute the edit distance between two strings of length n up to a factor of 2^{\~O(sqrt(log n))} in n^(1+o(1)) time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n^(1/3+o(1)) approximation. Previously, approximation of 2^{\~O(sqrt(log n))} was known only for embedding edit distance into l_1, and it is not known if that embedding can be computed in less than quadratic time.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: The proposed approach effectively segments the alignment problem into small sub problems which in turn yields dramatic time savings even when there are large pieces of inserted or deleted text and the OCR accuracy is poor.
Abstract: This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. The proposed approach effectively segments the alignment problem into small sub problems which in turn yields dramatic time savings even when there are large pieces of inserted or deleted text and the OCR accuracy is poor. This approach is used to evaluate the OCR accuracy of real scanned books in English, French, German and Spanish.

Journal ArticleDOI
TL;DR: The character confusion-based prototype of Text-Induced Corpus Clean-up is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents, showing that the system is not sensitive to domain variation.
Abstract: We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.

Proceedings ArticleDOI
12 Jun 2011
TL;DR: This paper proposes a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance, and shows that the method achieves high performance and outperforms state-of-the-art studies.
Abstract: Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Journal ArticleDOI
TL;DR: The performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages shows poor performance, suggesting the need for more linguistically nuanced methods for automated language classification tasks.
Abstract: The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of time. Standardizing the orthography increases the performance, but only to a maximum of 65% accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages.This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks.

Journal ArticleDOI
TL;DR: A novel technique to address the problem of efficient privacy-preserving approximate record linkage by utilizing a secure blocking component based on phonetic algorithms statistically enhanced to improve security and a secure matching component where actual approximate matching is performed using the Levenshtein Distance algorithm.
Abstract: Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching. Category: Ubiquitous computing; Security and privacy

Proceedings ArticleDOI
07 Jun 2011
TL;DR: It is demonstrated that while consistent discovery is tractable for sparse random graphs using a small number of participants, in general, there are graphs which cannot be discovered by any algorithm even with a significant number of Participants, and with the availability of end-to-end information along all the paths between the participants.
Abstract: We consider the task of topology discovery of sparse random graphs using end-to-end random measurements (eg, delay) between a subset of nodes, referred to as the participants The rest of the nodes are hidden, and do not provide any information for topology discovery We consider topology discovery under two routing models: (a) the participants exchange messages along the shortest paths and obtain end-to-end measurements, and (b) additionally, the participants exchange messages along the second shortest path For scenario (a), our proposed algorithm results in a sub-linear edit-distance guarantee using a sub-linear number of uniformly selected participants For scenario (b), we obtain a much stronger result, and show that we can achieve consistent reconstruction when a sub-linear number of uniformly selected nodes participate This implies that accurate discovery of sparse random graphs is tractable using an extremely small number of participants We finally obtain a lower bound on the number of participants required by any algorithm to reconstruct the original random graph up to a given edit distance We also demonstrate that while consistent discovery is tractable for sparse random graphs using a small number of participants, in general, there are graphs which cannot be discovered by any algorithm even with a significant number of participants, and with the availability of end-to-end information along all the paths between the participants

Journal ArticleDOI
TL;DR: This work suggests an efficient approach for solving the atom mapping problem exactly--finding mappings of minimum edge edit distance based on A* search equipped with sophisticated heuristics for pruning the search space.
Abstract: The ability to trace the fate of individual atoms through the metabolic pathways is needed in many applications of systems biology and drug discovery. However, this information is not immediately available from the most common metabolome studies and needs to be separately acquired. Automatic discovery of correspondence of atoms in biochemical reactions is called the "atom mapping problem." We suggest an efficient approach for solving the atom mapping problem exactly--finding mappings of minimum edge edit distance. The algorithm is based on A* search equipped with sophisticated heuristics for pruning the search space. This approach has clear advantages over the commonly used heuristic approach of iterative maximum common subgraph (MCS) algorithm: we explicitly minimize an objective function, and we produce solutions that typically require less manual curation. The two methods are similar in computational resource demands. We compare the performance of the proposed algorithm against several alternatives on data obtained from the KEGG LIGAND and RPAIR databases: greedy search, bi-partite graph matching, and the MCS approach. Our experiments show that alternative approaches often fail in finding mappings with minimum edit distance.

Journal ArticleDOI
TL;DR: The Edit distance and the DNFP methods have the highest discrimination powers and can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2.
Abstract: DNA barcoding technology, which uses a short piece of DNA sequence to identify species, has wide ranges of applications. Until today, a universal DNA barcode marker for plants remains elusive. The rbc L and mat K regions have been proposed as the “core barcode” for plants and the ITS2 and psbA-trnH intergenic spacer (PTIGS) regions were later added as supplemental barcodes. The use of PTIGS region as a supplemental barcode has been limited by the lack of computational tools that can handle significant insertions and deletions in the PTIGS sequences. Here, we compared the most commonly used alignment-based and alignment-free methods and developed a web server to allow the biologists to carry out PTIGS-based DNA barcoding analyses. First, we compared several alignment-based methods such as BLAST and those calculating P distance and Edit distance, alignment-free methods Di-Nucleotide Frequency Profile (DNFP) and their combinations. We found that the DNFP and Edit-distance methods increased the identification success rate to ~80%, 20% higher than the most commonly used BLAST method. Second, the combined methods showed overall better success rate and performance. Last, we have developed a web server that allows (1) retrieving various sub-regions and the consensus sequences of PTIGS, (2) annotating novel PTIGS sequences, (3) determining species identity by PTIGS sequences using eight methods, and (4) examining identification efficiency and performance of the eight methods for various taxonomy groups. The Edit distance and the DNFP methods have the highest discrimination powers. Hybrid methods can be used to achieve significant improvement in performance. These methods can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2. To our knowledge, the web server developed here is the only one that allows species determination based on PTIGS sequences. The web server can be accessed at http://psba-trnh-plantidit.dnsalias.org .

Journal ArticleDOI
TL;DR: This work presents a local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate, and applies the SWIFT algorithm for lossless filtering.
Abstract: Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. We present here the local pairwise aligner STELLAR that has full sensitivity for e-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar . The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de .

Proceedings ArticleDOI
24 Oct 2011
TL;DR: This paper proposes a unified method to tackle the task of data record extraction from Web pages by addressing several key issues in a uniform manner and achieves higher accuracy compared with three state-of-the-art methods.
Abstract: Although the task of data record extraction from Web pages has been studied extensively, yet it fails to handle many pages due to their complexity in format or layout. In this paper, we propose a unified method to tackle this task by addressing several key issues in a uniform manner. A new search structure, named as Record Segmentation Tree (RST), is designed, and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. Another characteristic of our method which is significantly different from previous works is that it can effectively handle complicated and challenging data record regions. It is achieved by generating subtree groups dynamically from the RST structure during the search process. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. Extensive experiments are conducted on four data sets, including flat, nested, and intertwine records. The experimental results demonstrate that our method achieves higher accuracy compared with three state-of-the-art methods.

Book ChapterDOI
27 Jun 2011
TL;DR: This paper shows that the problem of finding an edit distance between unordered trees is MAX SNP-hard, even if the height of trees is 2, the degrees of trees are 2, and the height is 3 under a unit cost.
Abstract: Zhang and Jiang (1994) have shown that the problem of finding an edit distance between unordered trees is MAX SNP-hard. In this paper, we show that this problem is MAX SNP-hard, even if (1) the height of trees is 2, (2) the degree of trees is 2, (3) the height of trees is 3 under a unit cost, and (4) the degree of trees is 2 under a unit cost.

Journal ArticleDOI
TL;DR: This paper presents a fixed-parameter algorithm for the tree edit distance problem for unordered trees under the unit cost model that works in O(2.62^[email protected]?poly(n) time and O(n^2) space, where the parameter k is the maximum bound of the edit distance and n is themaximum size of input trees.

Journal ArticleDOI
TL;DR: This paper introduces a simple Sum-over-Paths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences, and extends related previous work from bioinformatics to the case of graphs with cycles.

Book ChapterDOI
10 Oct 2011
TL;DR: SpSim as discussed by the authors is a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori.
Abstract: The most commonly used measures of string similarity, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as "ph" and "f" in English-Portuguese cognates "phase" and "fase". Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of cognaticity. This paper describes SpSim, a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance -based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.

Proceedings ArticleDOI
27 Aug 2011
TL;DR: A variety of metrics are explored to compare the automatic pronunciation methods of three freely-available grapheme-to-phoneme packages on a large dictionary using a novel weighted phonemic substitution matrix constructed from substitution frequencies in a collection of trusted alternate pronunciations.
Abstract: As grapheme-to-phoneme methods proliferate, their careful evaluation becomes increasingly important. This paper explores a variety of metrics to compare the automatic pronunciation methods of three freely-available grapheme-to-phoneme packages on a large dictionary. Two metrics, presented here for the first time, rely upon a novel weighted phonemic substitution matrix constructed from substitution frequencies in a collection of trusted alternate pronunciations. These new metrics are sensitive to the degree of mutability among phonemes. An alignment tool uses this matrix to compare phoneme substitutions between pairs of pronunciations. Index Terms: grapheme-to-phoneme, edit distance, substitution matrix, phonetic distance measures