scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Oct 2020
TL;DR: In this paper, a model is used to find influenced citations in the scientific documents using a semantic approach rather than simple keyword matching, and the proposed technique is to discover the relevancy among scientific document and a list of reference papers in the same document in a semantic way.
Abstract: Citation analysis is an essential part of the research to inculcate the potentiality of the work and giving credit to the researcher. Generally, citations are used by researchers for many purposes, however mainly it is used to show that the references influenced author's work. Now a day's most of the references influenced one way or other. The proposed model is used to find influenced citations in the scientific documents using a semantic approach rather than simple keyword matching. Generally, every scientific document contains references or a bibliography at the end of the document. The goal of the proposed technique is to discover the relevancy among scientific document and a list of reference papers in the same document in a semantic way. In this paper, the proposed work developed using the proposed semantic similarity measure called Modified Word Movers Distance (MWMD) applied to the deep learning model. Finally, the proposed work efficiently finds influenced citations percentage in the scientific documents.

1 citations

Proceedings ArticleDOI
03 Mar 2016
TL;DR: This technological solution is tested on the standard dataset such as Enron, LingSpam along with personal E-mail messages (PEM) to empirically prove the strength of this solution.
Abstract: The internet E-mail infrastructure has become famous and widely used means of communication for personal, business and academic purposes just because it is fast, cheap and very efficient. People are using this E-mail infrastructure for their day-to-day work. The users are receiving many unwanted E-mails from the unknown senders. These unwanted E-mails are called as spam E-mails. In this paper technological solution for blocking spam E-mail is discussed. This technological solution consists of combination of origin based filters (OBF) and content based filters (CBF) which is adaptive in nature. The CBF Filter contains two components machine learning based classifier (MLC) and semantic similarity with edge based classifier (SSC). This technological solution is tested on the standard dataset such as Enron, LingSpam along with personal E-mail messages (PEM). The results empirically prove the strength of this solution.

1 citations

Proceedings Article
Jin Zeng1, Jidong Ge1, Yemao Zhou1, Yi Feng1, Chuanyi Li1, Zhongjin Li1, Bin Luo1 
01 Jan 2017
TL;DR: The text similarity measures are summarized, and gradually extend to the Latent Semantic Analysis, and the experiment shows that the statutes predicted by LSA are more accurate than that only by TF-IDF.
Abstract: The traditional approach to measure text similarity is based on the TF-IDF algorithm to get the document vector, and then use the cosine similarity algorithm to calculate the text similarity. However, this method of statistical way ignores the potential semantics of the articles or words. By some means, this method only aims at the word itself. But with the Latent Semantic Analysis, the semantic space is added on the basis of calculate TF-IDF. Each word and document can have a position in semantic space by Singular Value Decomposition. That allows the semantic analysis, document clustering, and the relationship between semantic class and document class can be finished at the same time. Here, we summarize the text similarity measures, and gradually extend to the Latent Semantic Analysis. The experiment shows that the statutes predicted by LSA are more accurate than that only by TF-IDF.

1 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Related Work Finding the similarity between words and words is the basis of finding the similarity between sentences and sentences ([2])....

    [...]

Journal ArticleDOI
TL;DR: In this paper, latent semantic analysis (LSA) is used to find patterns in a relatively small sample of notable works archived by Project Gutenberg and it is shown that an LSA-equipped AI can distinguish quite sharply between fiction and non-fiction works, and can detect some differences between political philosophy and history.
Abstract: Can an Artificial Intelligence make distinctions among major works of politics, philosophy, and fiction without human assistance? In this paper, latent semantic analysis (LSA) is used to find patterns in a relatively small sample of notable works archived by Project Gutenberg. It is shown that an LSA-equipped AI can distinguish quite sharply between fiction and non-fiction works, and can detect some differences between political philosophy and history, and between conventional fiction and fantasy/science fiction. It is conjectured that this capability is a step in the direction of “M-comprehension” (or “machine comprehension”) by AIs.

1 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Others have surveyed this literature (Foltz 1998; Gomaa and Fahmy 2013; Mikolov et al....

    [...]

Proceedings ArticleDOI
01 Jun 2015
TL;DR: The TATO system which participated in the SemEval-2015 Task 2a: “Semantic Textual Similarity (STS) for English” is described, which combines multiple similarity measures of varying complexity ranging from simple lexical and syntactic similarity measures to complex semantic similarity ones to compute semantic textual similarity.
Abstract: In this paper, we describe the TATO system which participated in the SemEval-2015 Task 2a: “Semantic Textual Similarity (STS) for English”. Our system is trained on published datasets from the previous competitions. Based on some machine learning techniques, it combines multiple similarity measures of varying complexity ranging from simple lexical and syntactic similarity measures to complex semantic similarity ones to compute semantic textual similarity. Our final model consists of a simple linear combination of about 30 main features out of a numerous number of features experimented. The results are promising, with Pearson’s coefficients on each individual dataset ranging from 0.6796 to 0.8167 and an overall weighted mean score of 0.7422, well above the task baseline system.

1 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]