scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
08 May 2020
TL;DR: This paper designs a collaborative cyber threat perception model, DI-MDPs, which is based on the decentralized coordination, and the core idea is initiative information interaction among agents, and contributes a reinforcement learning algorithm HTI that takes advantage of the particular structure of DI- MDPs.
Abstract: The essence of network security is the asymmetric online confrontation with the partial observable cyber threats, which requires the defense ability against unexpected security incidents The existing network intrusion detection systems are mostly static centralized structure, and usually faced with problems such as high pressure of central processing node, low fault tolerance, low damage resistance and high construction cost In this paper, exploiting the advantage of collaborative decision-making of decentralized multiagent coordination, we design a collaborative cyber threat perception model, DI-MDPs, which is based on the decentralized coordination, and the core idea is initiative information interaction among agents Then, we analysis the relevance and transformation conditions between the proposed model, then contribute a reinforcement learning algorithm HTI that takes advantage of the particular structure of DI-MDPs in which agent updates policies by learning both its local cognition and the additional information obtained through interaction Finally, we compare and verify the performance of the designed algorithm under typical scenario setting

3 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...According to the Dice coefficient theory [15], reldes denotes the semantic correlation degree from cli to Gi(j): reldes = 1...

    [...]

Proceedings ArticleDOI
01 Jun 2016
TL;DR: This work proposes a novel method oriented for lexical similarity measure of Chinese sentences by means of Chinese sentence segmentation and weighted word matching, and shows that this method can achieve satisfactory performance in various cases.
Abstract: Artificial intelligence chatbots are computer programs that make interactions via auditory or textual information between human and machine using natural language processing techniques, most of which work on the basis of pattern matching. Typically, a chatbot recognizes the audio from human and translates it to text, and then matches the text with sentences stored in advance in the database by means of measuring similarity. So far, English chatbots have done much better than Chinese ones, mainly for the reason that processing sentences in Chinese requires more sophisticated techniques than that in English. Considering the low efficiency and accuracy of the existing Chinese sentence similarity measure methods, we propose a novel method oriented for lexical similarity measure of Chinese sentences by means of Chinese sentence segmentation and weighted word matching. Based on this method, a Chinese AI chatbot is developed and tested under a variety of settings. Experimental results show that our proposed method can achieve satisfactory performance in various cases.

3 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...Once the question asked by the user is matched to a wrong key in the database, the chatbot tends to return an irrelevant answer, degrading user experience....

    [...]

  • ...Corpus-Based similarity measure determine the similarity between words according to information obtained form large corpora [2]....

    [...]

  • ...Currently, most semantic similarity measures rely on Corpus-Based and Knowledge-Based algorithms....

    [...]

Journal ArticleDOI
TL;DR: The authors applied simple programming to calculate the lexical overlap between translations and used the results in a preliminary discussion of possible influence of earlier on later translations, and compared the results with conclusions arrived at in previous research on the interrelationship between translations.
Abstract: The present article is primarily intended as a methodological contribution to Islamic Studies providing an example of how the power of computer-aided text analysis can benefit the field. The data consists of a set of 51 lexicons created out of translations of the Qur’an into English. The analysis applies simple programming to calculate the lexical overlap between translations and uses the results in a preliminary discussion of possible influence of earlier on later translations. The results are compared with conclusions arrived at in previous research on the interrelationship between translations and are also used to identify and suggest new areas for in-depth studies.

3 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...Other methods that could have been used as well (for presentations and discussions of a variety of such methods, see e.g. Gomaa and Fahmy 2013; Vijaymeena and Kavitha 2016), but the Jaccard similarity is easy to comprehend and include in code, even for a non-specialist, and hence chimes well with…...

    [...]

Posted Content
TL;DR: This work proposes affinity, a system that assesses the similarity between text messaging histories of users reliably and efficiently in a privacy-preserving manner and reaches an average 85.0% accuracy on a political party classification task.
Abstract: In the field of social networking services, finding similar users based on profile data is common practice. Smartphones harbor sensor and personal context data that can be used for user profiling. Yet, one vast source of personal data, that is text messaging data, has hardly been studied for user profiling. We see three reasons for this: First, private text messaging data is not shared due to their intimate character. Second, the definition of an appropriate privacy-preserving similarity measure is non-trivial. Third, assessing the quality of a similarity measure on text messaging data representing a potentially infinite set of topics is non-trivial. In order to overcome these obstacles we propose affinity, a system that assesses the similarity between text messaging histories of users reliably and efficiently in a privacy-preserving manner. Private texting data stays on user devices and data for comparison is compared in a latent format that neither allows to reconstruct the comparison words nor any original private plain text. We evaluate our approach by calculating similarities between Twitter histories of 60 US senators. The resulting similarity network reaches an average 85.0% accuracy on a political party classification task.

3 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...See for instance [22] for a survey on traditional approaches....

    [...]

Book ChapterDOI
03 Jul 2017
TL;DR: This paper explores the similarity-based models for the QA system to rank search result candidates and used Damerau-Levenshtein distance and cosine similarity model to obtain ranking scores between the question posted by the registered user and a similar candidate questions in repository.
Abstract: The rapid growth of World Wide Web has extended Information Retrieval related technology such as queries for information needs become more easily accessible. One such platform is online question answering (QA). Online community can posting questions and get direct response for their special information needs using various platforms. It creates large unorganized repositories of valuable knowledge resources. Effective QA retrieval is required to make these repositories accessible to fulfill users information requests quickly. The repositories might contained similar questions and answer to users newly asked question. This paper explores the similarity-based models for the QA system to rank search result candidates. We used Damerau-Levenshtein distance and cosine similarity model to obtain ranking scores between the question posted by the registered user and a similar candidate questions in repository. Empirical experimental results indicate that our proposed ensemble models are very encouraging and give a significantly better similarity value to improve search ranking results.

3 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...The widely used approach to identify similar text through lexical and semantic are based on string, corpus and knowledge [10]....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]