scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
12 May 2019
TL;DR: This paper describes the self-correction approach based on the sequence to sequence Neural Machine Translation (NMT) as applied to rectify the incorrectness in the results of any information extraction approach such as Optical Character Recognition (OCR).
Abstract: In recent years, with the increasing usage of digital media and advancements in deep learning architectures, most of the paper-based documents have been revolutionized into digital versions. These advancements have helped state-of-the-art information extraction and digital mailroom technologies become progressively efficient. Even though many efficient postInformation Extraction (IE) error rectification methods have been introduced in the recent past to improve the quality of digitized documents. They are still imperfect and they demand improvements in the area of context-based error correction, specifically when we are dealing with the documents involving sensitive information such as invoices. This paper describes the self-correction approach based on the sequence to sequence Neural Machine Translation (NMT) as applied to rectify the incorrectness in the results of any information extraction approach such as Optical Character Recognition (OCR). We accomplished this approach by exploiting the concepts of sequence learning with the help of feedback provided during each cycle of training. Finally, we have compared state-of-the-art post-OCR error correction methods with our feedback learning approach. Our empirical results have outperformed state-of-the-art post-OCR error correction methods.

6 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...We measured the quality of output using Levenshtein distance[4]....

    [...]

Journal ArticleDOI
TL;DR: These results could promote recommendation systems as the selection of interests that are considered of enrichment depends on the reliability of the profiles where they are stored, and the quality of the created interest structure would evolve in order to improve the profile reliability result.
Abstract: Purpose - Generally, the user requires customized information reflecting his/her current needs and interests that are stored in his/her profile. There are many sources which may provide beneficial information to enrich the user’s interests such as his/her social network for recommendation purposes. The proposed approach rests basically on predicting the reliability of the users’ profiles which may contain conflictual interests. The paper aims to discuss this issue.Design/methodology/approach - This approach handles conflicts by detecting the reliability of neighbors’ profiles of a user. The authors consider that these profiles are dependent on one another as they may contain interests that are enriched from non-reliable profiles. The dependency relationship is determined between profiles, each of which contains interests that are structured based on k-means algorithm. This structure takes into consideration not only the evolutionary aspect of interests but also their semantic relationships. Findings - The proposed approach was validated in a social-learning context as evaluations were conducted on learners who are members of Moodle e-learning system and Delicious social network. The quality of the created interest structure is assessed. Then, the result of the profile reliability is evaluated. The obtained results are satisfactory. These results could promote recommendation systems as the selection of interests that are considered of enrichment depends on the reliability of the profiles where they are stored. Research limitations/implications - Some specific limitations are recorded. As the quality of the created interest structure would evolve in order to improve the profile reliability result. In addition, as Delicious is used as a main data source for the learner’s interest enrichment, it was necessary to obtain interests from other sources, such as e-recruitement systems. Originality/value - This research is among the pioneer papers to combine the semantic as well as the hierarchical structure of interests and conflict resolution based on a profile reliability approach.

6 citations

Book ChapterDOI
15 Apr 2020
TL;DR: A Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) algorithm is used to generate a large set of novel, malicious mutants that are diverse with respect to their behavioural and structural similarity to the original mutant.
Abstract: In the field of metamorphic malware detection, training a detection model with malware samples that reflect potential mutants of the malware is crucial in developing a model resistant to future attacks. In this paper, we use a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) algorithm to generate a large set of novel, malicious mutants that are diverse with respect to their behavioural and structural similarity to the original mutant. Using two classes of malware as a test-bed, we show that the MAP-Elites algorithm produces a large and diverse set of mutants, that evade between 64% to 72% of the 63 detection engines tested. When compared to results obtained using repeated runs of an Evolutionary Algorithm that converges to a single solution result, the MAP-Elites approach is shown to produce a significantly more diverse range of solutions, while providing equal or improved results in terms of evasiveness, depending on the dataset in question. In addition, the archive produced by MAP-Elites sheds insight into the properties of a sample that lead to them being undetectable by a suite of existing detection engines.

6 citations


Additional excerpts

  • ...The text based similarity measures the cosine similarity, fuzzy string match [4] and Levenshtein distance [10] between the original malware and the mutant....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors proposed a brain-wide DOT framework that integrates a cap-based whole-head optode placement system with multiple computational approaches, i.e., finite-element modeling, inverse source reconstruction, data-driven pattern recognition, and statistical correlation tomography, to reconstruct RSNs in dual contrasts of oxygenated (HbO) and deoxygenated hemoglobins.
Abstract: Objective.Diffuse optical tomography (DOT) has the potential in reconstructing resting state networks (RSNs) in human brains with high spatio-temporal resolutions and multiple contrasts. While several RSNs have been reported and successfully reconstructed using DOT, its full potential in recovering a collective set of distributed brain-wide networks with the number of RSNs close to those reported using functional magnetic resonance imaging (fMRI) has not been demonstrated.Approach.The present study developed a novel brain-wide DOT (BW-DOT) framework that integrates a cap-based whole-head optode placement system with multiple computational approaches, i.e. finite-element modeling, inverse source reconstruction, data-driven pattern recognition, and statistical correlation tomography, to reconstruct RSNs in dual contrasts of oxygenated (HbO) and deoxygenated hemoglobins (HbR).Main results.Our results from the proposed framework revealed a comprehensive set of RSNs and their subnetworks, which collectively cover almost the entire neocortical surface of the human brain, both at the group level and individual participants. The spatial patterns of these DOT RSNs suggest statistically significant similarities to fMRI RSN templates. Our results also reported the networks involving the medial prefrontal cortex and precuneus that had been missed in previous DOT studies. Furthermore, RSNs obtained from HbO and HbR suggest similarity in terms of both the number of RSN types reconstructed and their corresponding spatial patterns, while HbR RSNs show statistically more similarity to fMRI RSN templates and HbO RSNs indicate more bilateral patterns over two hemispheres. In addition, the BW-DOT framework allowed consistent reconstructions of RSNs across individuals and across recording sessions, indicating its high robustness and reproducibility, respectively.Significance.Our present results suggest the feasibility of using the BW-DOT, as a neuroimaging tool, in simultaneously mapping multiple RSNs and its potential values in studying RSNs, particularly in patient populations under diverse conditions and needs, due to its advantages in accessibility over fMRI.

6 citations

Journal ArticleDOI
TL;DR: In this article, the authors developed a journal recommender system, which compares the content similarities between a manuscript and the existing journal articles in two subject corpora (covering the social sciences and medicine).
Abstract: The purpose of this paper is to develop a journal recommender system, which compares the content similarities between a manuscript and the existing journal articles in two subject corpora (covering the social sciences and medicine). The study examines the appropriateness of three text similarity measures and the impact of numerous aspects of corpus documents on system performance.,Implemented three similarity measures one at a time on a journal recommender system with two separate journal corpora. Two distinct samples of test abstracts were classified and evaluated based on the normalized discounted cumulative gain.,The BM25 similarity measure outperforms both the cosine and unigram language similarity measures overall. The unigram language measure shows the lowest performance. The performance results are significantly different between each pair of similarity measures, while the BM25 and cosine similarity measures are moderately correlated. The cosine similarity achieves better performance for subjects with higher density of technical vocabulary and shorter corpus documents. Moreover, increasing the number of corpus journals in the domain of social sciences achieved better performance for cosine similarity and BM25.,This is the first work related to comparing the suitability of a number of string-based similarity measures with distinct corpora for journal recommender systems.

6 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]