Showing papers on "Edit distance published in 2011"

PDF

Open Access

Journal Article•DOI•

RTED: a robust algorithm for the tree edit distance

[...]

Mateusz Pawlik¹, Nikolaus Augsten¹•Institutions (1)

01 Dec 2011

TL;DR: It is proved that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity and introduces the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature.

...read moreread less

Abstract: We consider the classical tree edit distance between ordered labeled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity, but the worst case happens frequently, or they are very efficient for some tree shapes, but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms.In this paper we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of RTED is smaller or equal to the complexity of the best competitors for any input instance, i.e., RTED is both efficient and worst-case optimal. We introduce the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature. We prove that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity. In our experiments on synthetic and real world data we empirically evaluate our solution and compare it to the state-of-the-art.

...read moreread less

176 citations

Journal Article•DOI•

Pass-join: a partition-based method for similarity joins

[...]

Guoliang Li¹, Dong Deng¹, Jiannan Wang¹, Jianhua Feng¹•Institutions (1)

Tsinghua University¹

01 Nov 2011

TL;DR: This paper study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, using a partition-based method called Pass-Join.

...read moreread less

Abstract: As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.

...read moreread less

172 citations

Posted Content•

PASS-JOIN: A Partition-based Method for Similarity Joins

[...]

Guoliang Li¹, Dong Deng¹, Jiannan Wang¹, Jianhua Feng¹•Institutions (1)

Tsinghua University¹

30 Nov 2011-arXiv: Databases

TL;DR: Wang et al. as mentioned in this paper proposed a partition-based method called Pass-Join, which partitions a string into a set of segments and creates inverted indices for the segments, then selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices.

...read moreread less

171 citations

Journal Article•DOI•

Minimum Bayes Risk decoding and system combination based on a recursion for edit distance

[...]

Haihua Xu¹, Daniel Povey², Lidia Mangu³, Jie Zhu¹•Institutions (3)

Shanghai Jiao Tong University¹, Microsoft², IBM³

01 Oct 2011-Computer Speech & Language

TL;DR: A method that can be used for Minimum Bayes Risk decoding for speech recognition that has similar functionality to the widely used Consensus method, but has a clearer theoretical basis and appears to give better results both for MBR decoding and system combination.

...read moreread less

167 citations

Proceedings Article•DOI•

Neighborhood based fast graph search in large networks

[...]

Arijit Khan¹, Nan Li¹, Xifeng Yan¹, Ziyu Guan¹, Supriyo Chakraborty², Shu Tao³ - Show less +2 more•Institutions (3)

University of California, Santa Barbara¹, University of California, Los Angeles², IBM³

12 Jun 2011

TL;DR: Under this new measure, it is proved that subgraph similarity search is NP hard, while graph similarity match is polynomial, and an information propagation model is found that is able to convert a large network into a set of multidimensional vectors, where sophisticated indexing and similarity search algorithms are available.

...read moreread less

Abstract: Complex social and information network search becomes important with a variety of applications. In the core of these applications, lies a common and critical problem: Given a labeled network and a query graph, how to efficiently search the query graph in the target network. The presence of noise and the incomplete knowledge about the structure and content of the target network make it unrealistic to find an exact match. Rather, it is more appealing to find the top-k approximate matches.In this paper, we propose a neighborhood-based similarity measure that could avoid costly graph isomorphism and edit distance computation. Under this new measure, we prove that subgraph similarity search is NP hard, while graph similarity match is polynomial. By studying the principles behind this measure, we found an information propagation model that is able to convert a large network into a set of multidimensional vectors, where sophisticated indexing and similarity search algorithms are available. The proposed method, called Ness (Neighborhood Based Similarity Search), is appropriate for graphs with low automorphism and high noise, which are common in many social and information networks. Ness is not only efficient, but also robust against structural noise and information loss. Empirical results show that it can quickly and accurately find high-quality matches in large networks, with negligible cost.

...read moreread less

149 citations

Proceedings Article•DOI•

Efficient exact edit similarity query processing with the asymmetric signature scheme

[...]

Jianbin Qin¹, Wei Wang¹, Yifei Lu¹, Chuan Xiao¹, Xuemin Lin¹ - Show less +1 more•Institutions (1)

University of New South Wales¹

12 Jun 2011

TL;DR: This paper shows that the minimum signature size lower bound is t +1, and proposes asymmetric signature schemes that achieve this lower bound, and develops efficient query processing algorithms based on the new scheme.

...read moreread less

Abstract: Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is t +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.

...read moreread less

106 citations

Book Chapter•DOI•

Discovering context: classifying tweets through a semantic transform based on wikipedia

[...]

Yegin Genc¹, Yasuaki Sakamoto¹, Jeffrey V. Nickerson¹•Institutions (1)

Stevens Institute of Technology¹

09 Jul 2011

TL;DR: By mapping messages into a large context, the authors can compute the distances between them, and then classify them, which yields more accurate classification of a set of Twitter messages than alternative techniques using string edit distance and latent semantic analysis.

...read moreread less

Abstract: By mapping messages into a large context, we can compute the distances between them, and then classify them. We test this conjecture on Twitter messages: Messages are mapped onto their most similar Wikipedia pages, and the distances between pages are used as a proxy for the distances between messages. This technique yields more accurate classification of a set of Twitter messages than alternative techniques using string edit distance and latent semantic analysis.

...read moreread less

97 citations

Journal Article•DOI•

Indexing methods for approximate dictionary searching: Comparative analysis

[...]

Leonid Boytsov

28 May 2011-ACM Journal of Experimental Algorithms

TL;DR: A taxonomy that classifies all methods into direct methods and sequence-based filtering methods for approximate dictionary searching is introduced, which focuses on infrequently updated dictionaries, which are used primarily for retrieval.

...read moreread less

Abstract: The primary goal of this article is to survey state-of-the-art indexing methods for approximate dictionary searching. To improve understanding of the field, we introduce a taxonomy that classifies all methods into direct methods and sequence-based filtering methods. We focus on infrequently updated dictionaries, which are used primarily for retrieval. Therefore, we consider indices that are optimized for retrieval rather than for update. The indices are assumed to be associative, that is, capable of storing and retrieving auxiliary information, such as string identifiers. All solutions are lossless and guarantee retrieval of strings within a specified edit distance k. Benchmark results are presented for the practically important cases of k=1, 2, and 3. We concentrate on natural language datasets, which include synthetic English and Russian dictionaries, as well as dictionaries of frequent words extracted from the ClueWeb09 collection. In addition, we carry out experiments with dictionaries containing DNA sequences. The article is concluded with a discussion of benchmark results and directions for future research.

...read moreread less

87 citations

Proceedings Article•

Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection

[...]

Maud Ehrmann¹, Marco Turchi², Ralf Steinberger³•Institutions (3)

Sapienza University of Rome¹, University of Bristol², International Practical Shooting Confederation³

01 Sep 2011

TL;DR: This work automatically annotates the English version of a multi-parallel corpus and projects the annotations into all the other language versions, and uses a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database for the translation of English entities.

...read moreread less

Abstract: As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English version of a multi-parallel corpus and by projecting the annotations into all the other language versions. For the translation of English entities, we use a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database. For the projection, we incrementally apply different methods: perfect string matching, perfect consonant signature matching and edit distance similarity. The resulting annotated parallel corpus will be made available for reuse.

...read moreread less

75 citations

Proceedings Article•

Correcting Semantic Collocation Errors with L1-induced Paraphrases

[...]

Daniel Dahlmeier, Hwee Tou Ng¹•Institutions (1)

National University of Singapore¹

27 Jul 2011

TL;DR: This work presents a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora based on the key assumption that collocation errors are often caused by semantic similarity in the first language (L1-language) of the writer.

...read moreread less

Abstract: We present a novel approach for automatic collocation error correction in learner English which is based on paraphrases extracted from parallel corpora. Our key assumption is that collocation errors are often caused by semantic similarity in the first language (L1-language) of the writer. An analysis of a large corpus of annotated learner English confirms this assumption. We evaluate our approach on real-world learner data and show that L1-induced paraphrases outperform traditional approaches based on edit distance, homophones, and WordNet synonyms.

...read moreread less

74 citations

Posted Content•

Levenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach

[...]

Rishin Haldar, Debajyoti Mukhopadhyay

06 Jan 2011-arXiv: Information Theory

TL;DR: An improvement has been made to this method by grouping some similar looking alphabets and reducing the weighted difference among members of the same group, and the results showed marked improvement over the traditional Levenshtein distance technique.

...read moreread less

Abstract: Dictionary lookup methods are popular in dealing with ambiguous letters which were not recognized by Optical Character Readers. However, a robust dictionary lookup method can be complex as apriori probability calculation or a large dictionary size increases the overhead and the cost of searching. In this context, Levenshtein distance is a simple metric which can be an effective string approximation tool. After observing the effectiveness of this method, an improvement has been made to this method by grouping some similar looking alphabets and reducing the weighted difference among members of the same group. The results showed marked improvement over the traditional Levenshtein distance technique.

...read moreread less

Journal Article•DOI•

Comparing files using structural entropy

[...]

Ivan Sorokin

01 Nov 2011-Journal of Computer Virology and Hacking Techniques

TL;DR: The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code.

...read moreread less

Abstract: One of the main trends in the modern anti-virus industry is the development of algorithms that help estimate the similarity of files. Since malware writers tend to use increasingly complex techniques to protect their code such as obfuscation and polymorphism, anti-virus software vendors face problems of the increasing difficulty of file scanning, the considerable growth of anti-virus databases, and file storages overgrowth. For solving such problems, a static analysis of files appears to be of some interest. Its use helps determine those file characteristics that are necessary for their comparison without executing malware samples within a protected environment. The solution provided in this article is based on the assumption that different samples of the same malicious program have a similar order of code and data areas. Each such file area may be characterized not only by its length, but also by its homogeneity. In other words, the file may be characterized by the complexity of its data order. Our approach consists of using wavelet analysis for the segmentation of files into segments of different entropy levels and using edit distance between sequence segments to determine the similarity of the files. The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers. First, this comparison does not take into account the functionality of analysed files and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code. On the other hand, such a comparison may result in false alarms. Therefore, our solution is useful as a preliminary test that triggers the running of additional checks. Second, the method is relatively easy to implement and does not require code disassembly or emulation. And, third, the method makes the malicious file record compact which is significant when compiling anti-virus databases.

...read moreread less

Proceedings Article•DOI•

Learning bilingual lexicons using the visual similarity of labeled web images

[...]

Shane Bergsma¹, Benjamin Van Durme¹•Institutions (1)

Johns Hopkins University¹

16 Jul 2011

TL;DR: This work generates bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects, and uses these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word- to-word translations.

...read moreread less

Abstract: Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features. We generate bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects. The use of visual similarity substantially improves performance over standard approaches based on string similarity: for generated lexicons with 1000 translations, including visual information leads to an absolute improvement in accuracy of 8-12% over string edit distance alone.

...read moreread less

Journal Article•DOI•

A cognitive and video-based approach for multinational License Plate Recognition

[...]

Nicolas Thome¹, Antoine Vacavant¹, Lionel Robinault¹, Serge Miguet¹•Institutions (1)

University of Lyon¹

01 Mar 2011

TL;DR: An efficient probabilistic edit distance is proposed for providing an explicit video-based LPR, and cognitive loops are introduced at critical stages of the algorithm to take advantage of the context modeling to increase the overall system performances.

...read moreread less

Abstract: License Plate Recognition (LPR) is mainly regarded as a solved problem. However, robust solutions able to face real-world scenarios still need to be proposed. Country-specific systems are mostly, designed, which can (artificially) reach high-level recognition rates. This option, however, strictly limits their applicability. In this paper, we propose an approach that can deal with various national plates. There are three main areas of novelty. First, the Optical Character Recognition (OCR) is managed by a hybrid strategy, combining statistical and structural algorithms. Secondly, an efficient probabilistic edit distance is proposed for providing an explicit video-based LPR. Last but not least, cognitive loops are introduced at critical stages of the algorithm. These feedback steps take advantage of the context modeling to increase the overall system performances, and overcome the inextricable parameter settings of the low-level processing. The system performances have been tested in more than 1200 static images with difficult illumination conditions and complex backgrounds, as well as in six different videos containing 525 moving vehicles. The evaluations prove our system to be very competitive among the non-country specific approaches.

...read moreread less

Posted Content•

Approximating Edit Distance in Near-Linear Time

[...]

Alexandr Andoni, Krzysztof Onak

26 Sep 2011-arXiv: Data Structures and Algorithms

TL;DR: In this paper, the first sub-polynomial approximation algorithm for the edit distance between two strings of length n up to a factor of 2^(1+o(1)) was presented.

...read moreread less

Abstract: We show how to compute the edit distance between two strings of length n up to a factor of 2^{\~O(sqrt(log n))} in n^(1+o(1)) time. This is the first sub-polynomial approximation algorithm for this problem that runs in near-linear time, improving on the state-of-the-art n^(1/3+o(1)) approximation. Previously, approximation of 2^{\~O(sqrt(log n))} was known only for embedding edit distance into l_1, and it is not known if that embedding can be computed in less than quadratic time.

...read moreread less

Proceedings Article•DOI•

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

[...]

Ismet Zeki Yalniz¹, R. Manmatha¹•Institutions (1)

University of Massachusetts Amherst¹

18 Sep 2011

TL;DR: The proposed approach effectively segments the alignment problem into small sub problems which in turn yields dramatic time savings even when there are large pieces of inserted or deleted text and the OCR accuracy is poor.

...read moreread less

Abstract: This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. The proposed approach effectively segments the alignment problem into small sub problems which in turn yields dramatic time savings even when there are large pieces of inserted or deleted text and the OCR accuracy is poor. This approach is used to evaluate the OCR accuracy of real scanned books in English, French, German and Spanish.

...read moreread less

Journal Article•DOI•

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

[...]

Martin Reynaert¹•Institutions (1)

Tilburg University¹

01 Jun 2011-International Journal on Document Analysis and Recognition

TL;DR: The character confusion-based prototype of Text-Induced Corpus Clean-up is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents, showing that the system is not sensitive to domain variation.

...read moreread less

Abstract: We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.

...read moreread less

Proceedings Article•DOI•

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

[...]

Guoliang Li¹, Dong Deng¹, Jianhua Feng¹•Institutions (1)

Tsinghua University¹

12 Jun 2011

TL;DR: This paper proposes a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance, and shows that the method achieves high performance and outperforms state-of-the-art studies.

...read moreread less

Abstract: Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

...read moreread less

Journal Article•DOI•

Levenshtein distances fail to identify language relationships accurately

[...]

Simon J. Greenhill¹•Institutions (1)

University of Auckland¹

01 Dec 2011-Computational Linguistics

TL;DR: The performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages shows poor performance, suggesting the need for more linguistically nuanced methods for automated language classification tasks.

...read moreread less

Abstract: The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of time. Standardizing the orthography increases the performance, but only to a maximum of 65% accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages.This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks.

...read moreread less

Journal Article•DOI•

Secure Blocking + Secure Matching = Secure Record Linkage

[...]

Alexandros Karakasidis¹, Vassilios S. Verykios•Institutions (1)

University of Thessaly¹

30 Sep 2011-Journal of computing science and engineering

TL;DR: A novel technique to address the problem of efficient privacy-preserving approximate record linkage by utilizing a secure blocking component based on phonetic algorithms statistically enhanced to improve security and a secure matching component where actual approximate matching is performed using the Levenshtein Distance algorithm.

...read moreread less

Abstract: Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching. Category: Ubiquitous computing; Security and privacy

...read moreread less

Proceedings Article•DOI•

Topology discovery of sparse random graphs with few participants

[...]

Animashree Anandkumar¹, Avinatan Hassidim², Jonathan A. Kelner³•Institutions (3)

University of California, Irvine¹, Google², Massachusetts Institute of Technology³

07 Jun 2011

TL;DR: It is demonstrated that while consistent discovery is tractable for sparse random graphs using a small number of participants, in general, there are graphs which cannot be discovered by any algorithm even with a significant number of Participants, and with the availability of end-to-end information along all the paths between the participants.

...read moreread less

Abstract: We consider the task of topology discovery of sparse random graphs using end-to-end random measurements (eg, delay) between a subset of nodes, referred to as the participants The rest of the nodes are hidden, and do not provide any information for topology discovery We consider topology discovery under two routing models: (a) the participants exchange messages along the shortest paths and obtain end-to-end measurements, and (b) additionally, the participants exchange messages along the second shortest path For scenario (a), our proposed algorithm results in a sub-linear edit-distance guarantee using a sub-linear number of uniformly selected participants For scenario (b), we obtain a much stronger result, and show that we can achieve consistent reconstruction when a sub-linear number of uniformly selected nodes participate This implies that accurate discovery of sparse random graphs is tractable using an extremely small number of participants We finally obtain a lower bound on the number of participants required by any algorithm to reconstruct the original random graph up to a given edit distance We also demonstrate that while consistent discovery is tractable for sparse random graphs using a small number of participants, in general, there are graphs which cannot be discovered by any algorithm even with a significant number of participants, and with the availability of end-to-end information along all the paths between the participants

...read moreread less

Journal Article•DOI•

Computing atom mappings for biochemical reactions without subgraph isomorphism.

[...]

Markus Heinonen¹, Sampsa Lappalainen¹, Taneli Mielikäinen², Juho Rousu¹•Institutions (2)

University of Helsinki¹, Nokia²

06 Jan 2011-Journal of Computational Biology

TL;DR: This work suggests an efficient approach for solving the atom mapping problem exactly--finding mappings of minimum edge edit distance based on A* search equipped with sophisticated heuristics for pruning the search space.

...read moreread less

Abstract: The ability to trace the fate of individual atoms through the metabolic pathways is needed in many applications of systems biology and drug discovery. However, this information is not immediately available from the most common metabolome studies and needs to be separately acquired. Automatic discovery of correspondence of atoms in biochemical reactions is called the "atom mapping problem." We suggest an efficient approach for solving the atom mapping problem exactly--finding mappings of minimum edge edit distance. The algorithm is based on A* search equipped with sophisticated heuristics for pruning the search space. This approach has clear advantages over the commonly used heuristic approach of iterative maximum common subgraph (MCS) algorithm: we explicitly minimize an objective function, and we produce solutions that typically require less manual curation. The two methods are similar in computational resource demands. We compare the performance of the proposed algorithm against several alternatives on data obtained from the KEGG LIGAND and RPAIR databases: greedy search, bi-partite graph matching, and the MCS approach. Our experiments show that alternative approaches often fail in finding mappings with minimum edit distance.

...read moreread less

Journal Article•DOI•

PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region.

[...]

Chang Liu¹, Dong Liang², Ting Gao³, Xiaohui Pang¹, Jingyuan Song¹, Hui Yao¹, Jianping Han¹, Zhihua Liu¹, Xiaojun Guan⁴, Kun Jiang, Huan Li², Shilin Chen¹ - Show less +8 more•Institutions (4)

Peking Union Medical College¹, Beihang University², Qingdao Agricultural University³, University of North Carolina at Chapel Hill⁴

30 Nov 2011-BMC Bioinformatics

TL;DR: The Edit distance and the DNFP methods have the highest discrimination powers and can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2.

...read moreread less

Abstract: DNA barcoding technology, which uses a short piece of DNA sequence to identify species, has wide ranges of applications. Until today, a universal DNA barcode marker for plants remains elusive. The rbc L and mat K regions have been proposed as the “core barcode” for plants and the ITS2 and psbA-trnH intergenic spacer (PTIGS) regions were later added as supplemental barcodes. The use of PTIGS region as a supplemental barcode has been limited by the lack of computational tools that can handle significant insertions and deletions in the PTIGS sequences. Here, we compared the most commonly used alignment-based and alignment-free methods and developed a web server to allow the biologists to carry out PTIGS-based DNA barcoding analyses. First, we compared several alignment-based methods such as BLAST and those calculating P distance and Edit distance, alignment-free methods Di-Nucleotide Frequency Profile (DNFP) and their combinations. We found that the DNFP and Edit-distance methods increased the identification success rate to ~80%, 20% higher than the most commonly used BLAST method. Second, the combined methods showed overall better success rate and performance. Last, we have developed a web server that allows (1) retrieving various sub-regions and the consensus sequences of PTIGS, (2) annotating novel PTIGS sequences, (3) determining species identity by PTIGS sequences using eight methods, and (4) examining identification efficiency and performance of the eight methods for various taxonomy groups. The Edit distance and the DNFP methods have the highest discrimination powers. Hybrid methods can be used to achieve significant improvement in performance. These methods can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2. To our knowledge, the web server developed here is the only one that allows species determination based on PTIGS sequences. The web server can be accessed at http://psba-trnh-plantidit.dnsalias.org .

...read moreread less

Journal Article•DOI•

STELLAR: fast and exact local alignments

[...]

Birte Kehr¹, David Weese¹, Knut Reinert¹•Institutions (1)

Free University of Berlin¹

05 Oct 2011-BMC Bioinformatics

TL;DR: This work presents a local pairwise aligner STELLAR that has full sensitivity for ε-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate, and applies the SWIFT algorithm for lossless filtering.

...read moreread less

Abstract: Large-scale comparison of genomic sequences requires reliable tools for the search of local alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. We present here the local pairwise aligner STELLAR that has full sensitivity for e-alignments, i.e. guarantees to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local alignments. STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for finding local alignments between genomic sequences under the edit distance model. Binaries are freely available for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar . The source code is freely distributed with the SeqAn C++ library version 1.3 and later at http://www.seqan.de .

...read moreread less

Proceedings Article•DOI•

Towards a unified solution: data record region detection and segmentation

[...]

Lidong Bing¹, Wai Lam¹, Yuan Gu¹•Institutions (1)

The Chinese University of Hong Kong¹

24 Oct 2011

TL;DR: This paper proposes a unified method to tackle the task of data record extraction from Web pages by addressing several key issues in a uniform manner and achieves higher accuracy compared with three state-of-the-art methods.

...read moreread less

Abstract: Although the task of data record extraction from Web pages has been studied extensively, yet it fails to handle many pages due to their complexity in format or layout. In this paper, we propose a unified method to tackle this task by addressing several key issues in a uniform manner. A new search structure, named as Record Segmentation Tree (RST), is designed, and several efficient search pruning strategies on the RST structure are proposed to identify the records in a given Web page. Another characteristic of our method which is significantly different from previous works is that it can effectively handle complicated and challenging data record regions. It is achieved by generating subtree groups dynamically from the RST structure during the search process. Furthermore, instead of using string edit distance or tree edit distance, we propose a token-based edit distance which takes each DOM node as a basic unit in the cost calculation. Extensive experiments are conducted on four data sets, including flat, nested, and intertwine records. The experimental results demonstrate that our method achieves higher accuracy compared with three state-of-the-art methods.

...read moreread less

Book Chapter•DOI•

Improved MAX SNP-hard results for finding an edit distance between unordered trees

[...]

Kouichi Hirata¹, Yoshiyuki Yamamoto¹, Tetsuji Kuboyama²•Institutions (2)

Kyushu Institute of Technology¹, Gakushuin University²

27 Jun 2011

TL;DR: This paper shows that the problem of finding an edit distance between unordered trees is MAX SNP-hard, even if the height of trees is 2, the degrees of trees are 2, and the height is 3 under a unit cost.

...read moreread less

Abstract: Zhang and Jiang (1994) have shown that the problem of finding an edit distance between unordered trees is MAX SNP-hard. In this paper, we show that this problem is MAX SNP-hard, even if (1) the height of trees is 2, (2) the degree of trees is 2, (3) the height of trees is 3 under a unit cost, and (4) the degree of trees is 2 under a unit cost.

...read moreread less

Journal Article•DOI•

Exact algorithms for computing the tree edit distance between unordered trees

[...]

Tatsuya Akutsu¹, Daiji Fukagawa², Atsuhiro Takasu², Takeyuki Tamura¹•Institutions (2)

Kyoto University¹, National Institute of Informatics²

01 Feb 2011-Theoretical Computer Science

TL;DR: This paper presents a fixed-parameter algorithm for the tree edit distance problem for unordered trees under the unit cost model that works in O(2.62^[email protected]?poly(n) time and O(n^2) space, where the parameter k is the maximum bound of the edit distance and n is themaximum size of input trees.

...read moreread less

Journal Article•DOI•

A sum-over-paths extension of edit distances accounting for all sequence alignments

[...]

Silvia Garcia-Diez¹, François Fouss, Masashi Shimbo², Marco Saerens¹•Institutions (2)

Université catholique de Louvain¹, Nara Institute of Science and Technology²

01 Jun 2011-Pattern Recognition

TL;DR: This paper introduces a simple Sum-over-Paths (SoP) formulation of string edit distances accounting for all possible alignments between two sequences, and extends related previous work from bioinformatics to the case of graphs with cycles.

...read moreread less

Book Chapter•DOI•

Measuring spelling similarity for cognate identification

[...]

Luís Gomes¹, José Gabriel Pereira Lopes¹•Institutions (1)

Universidade Nova de Lisboa¹

10 Oct 2011

TL;DR: SpSim as discussed by the authors is a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori.

...read moreread less

Abstract: The most commonly used measures of string similarity, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as "ph" and "f" in English-Portuguese cognates "phase" and "fase". Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of cognaticity. This paper describes SpSim, a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance -based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.

...read moreread less

Proceedings Article•DOI•

[...]

Ben Hixon¹, Eric Schneider¹, Susan L. Epstein¹•Institutions (1)

City University of New York¹

27 Aug 2011

TL;DR: A variety of metrics are explored to compare the automatic pronunciation methods of three freely-available grapheme-to-phoneme packages on a large dictionary using a novel weighted phonemic substitution matrix constructed from substitution frequencies in a collection of trusted alternate pronunciations.

...read moreread less

Abstract: As grapheme-to-phoneme methods proliferate, their careful evaluation becomes increasingly important. This paper explores a variety of metrics to compare the automatic pronunciation methods of three freely-available grapheme-to-phoneme packages on a large dictionary. Two metrics, presented here for the first time, rely upon a novel weighted phonemic substitution matrix constructed from substitution frequencies in a collection of trusted alternate pronunciations. These new metrics are sensitive to the degree of mutability among phonemes. An alignment tool uses this matrix to compare phoneme substitutions between pairs of pronunciations. Index Terms: grapheme-to-phoneme, edit distance, substitution matrix, phonetic distance measures

...read moreread less

Collapse