scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
09 Jul 2016
TL;DR: The discussion of the past studies is done and how the research is proposing a framework for finding the resemblance is being discussed, which can help recruiter’s to select the best candidates for the Job Profile.
Abstract: Online Social Networking is increasing at a fast rate. There are lots of profiles of the users and there is too much resemblance between the user profiles which can help recruiter’s to select the best candidates for the Job Profile. Now, each similarity measure has its own applicability and best suited to a particular type of attribute values and if these measures are collectively combined then it can help us to find the best resemblance among the user profile ,the result of which matches to the actual result. In this paper, the discussion of the past studies is done and how our research is proposing a framework for finding the resemblance is being discussed.

3 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Reference [24] discussed about the various text based similarity approaches....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors compared three page similarity comparison methods using similarity computing models to compute page pairwise similarity in image level, text level, and image & text level in order to maintain users' reading experience continuity across e-book revisions.
Abstract: E-book reader supports users to create digital learning footprints in many forms like highlighting sentences or taking memos. Nowadays, it also allows an instructor to update their e-books in the e-book reader. However, e-book users often face problems when trying to find learning footprints they made in a new version e-book. Thus, users’ reading experience continuity across e-book revisions is hard to be maintained and seems to become a shortcoming within the e-book system. In this paper, in order to maintain users’ reading experience continuity, we deal with the transfer of learning footprints such as a marker, memo, and bookmark across e-book revisions on an e-book reader in a coursework scenario. We first give introduction and related works to demonstrate how researchers dedicated on the problem mentioned in this paper and page similarity comparison. Then, we compare three page similarity comparison methods using similarity computing models to compute page pairwise similarity in image level, text level, and image & text level. In the analysis, for each level, we analyze the performance of transferring learning footprint across e-book revisions and also the optimal threshold for similar page determination. After that, we give the analysis results to show the performances of three methods in image level, text level, and image & text level, and then, the error analysis is presented to specify the error types that occur in the results. We then propose page image & text similarity comparison as the optimal method to automatically transfer learning footprints across e-book revisions based on the analysis results and error analysis among three compared methods. Finally, the discussion and conclusions are shown in the end of this paper.

3 citations

Journal ArticleDOI
TL;DR: This paper analyzed YouTube live streaming comments in order to understand spammers’ behavior and found that features that performed best in terms of run time and classification efficiency is the relevant score together with the time spent in live chat and the number of messages per user.
Abstract: Live streaming is becoming a popular channel for advertising and marketing. An advertising company can use this feature to broadcast and reach a large number of customers. YouTube is one of the streaming media with an extreme growth rate and a large number of viewers. Thus, it has become a primary target of spammers and attackers. Understanding the behavior of users on live chat may reduce the moderator’s time in identifying and preventing spammers from disturbing other users. In this paper, we analyzed YouTube live streaming comments in order to understand spammers’ behavior. Seven user’s behavior features and message characteristic features were comprehensively analyzed. According to our findings, features that performed best in terms of run time and classification efficiency is the relevant score together with the time spent in live chat and the number of messages per user. The accuracy is as high as 66.22 percent. In addition, the most suitable technique for real-time classification is a decision tree.

3 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...Cosine similarity [18-19] is the traditional method used to measure the degree of similarity between two vectors, obtained from the cosine angle multiplication....

    [...]

Proceedings Article
01 Dec 2017
TL;DR: A quantitative, data-driven machine learning approach to mitigate the problem of unpredictability of Computer Science Graduate School Admissions and the possibility of a system which may help prospective applicants evaluate their Statement of Purpose based on the system output is discussed.
Abstract: We present a quantitative, data-driven machine learning approach to mitigate the problem of unpredictability of Computer Science Graduate School Admissions. In this paper, we discuss the possibility of a system which may help prospective applicants evaluate their Statement of Purpose (SOP) based on our system output. We, then, identify feature sets which can be used to train a predictive model. We train a model over fifty manually verified SOPs for which it uses an SVM classifier and achieves the highest accuracy of 92% with 10-fold cross-validation. We also perform experiments to establish that Word Embedding based features and Document Similarity-based features outperform other identified feature combinations. We plan to deploy our application as a web service and release it as a FOSS service.

3 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Text Similarity and related measures (Choi et al., 2010; Adomavicius and Tuzhilin, 2005; Gomaa and Fahmy, 2013) have been extensively studied and used for various NLP applications viz....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a Support Vector Machine (SVM) with contextualized word embeddings from Bidirectional Encoder Representations from Transformers (BERT) is proposed to represent protein sequences.

3 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]