scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: In this paper, the authors present a textual German corpus for text similarity detection, which is used to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences.
Abstract: Text similarity detection aims at measuring the degree of similarity between a pair of texts. Corpora available for text similarity detection are designed to evaluate the algorithms to assess the paraphrase level among documents. In this paper we present a textual German corpus for similarity detection. The purpose of this corpus is to automatically assess the similarity between a pair of texts and to evaluate different similarity measures, both for whole documents or for individual sentences. Therefore we have calculated several simple measures on our corpus based on a library of similarity functions.

5 citations

Journal ArticleDOI
Deng Yuan1
TL;DR: This paper puts forward the integrated method and key technology of innovation knowledge management based on user-generated content, which should be able to provide a systematic solution for the interactive innovationknowledge management.
Abstract: The number of companies using social media to interact with users has been increasing. By analyzing the content generated by users online and mining the information contained therein, companies have applied them to product management and innovation, improving the competitiveness of enterprises. In the face of growing user-generated content, how to achieve effective information mining and transform into product knowledge has posed a challenge to the enterprise. In the background of smart phone products, this paper puts forward the integrated method and key technology of innovation knowledge management based on user-generated content, which should be able to provide a systematic solution for the interactive innovation knowledge management.

5 citations

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a novel adaptive meta-heuristic for music plagiarism detection, which combines text similarity-based and clustering-based methods to get an improved hybrid method.
Abstract: Abstract Plagiarism is a controversial and debated topic in different fields, especially in the Music one, where the commercial market generates a huge amount of money. The lack of objective metrics to decide whether a song is a plagiarism, makes music plagiarism detection a very complex task: often decisions have to be based on subjective argumentations. Automated music analysis methods that identify music similarities can be of help. In this work, we first propose two novel such methods: a text similarity-based method and a clustering-based method. Then, we show how to combine them to get an improved (hybrid) method. The result is a novel adaptive meta-heuristic for music plagiarism detection. To assess the effectiveness of the proposed methods, considered both singularly and in the combined meta-heuristic, we performed tests on a large dataset of ascertained plagiarism and non-plagiarism cases. Results show that the meta-heuristic outperforms existing methods. Finally, we deployed the meta-heuristic into a tool , accessible as a Web application, and assessed the effectiveness, usefulness, and overall user acceptance of the tool by means of a study involving 20 people, divided into two groups, one of which with access to the tool. The study consisted in having people decide which pair of songs, in a predefined set of pairs, should be considered plagiarisms and which not. The study shows that the group supported by our tool successfully identified all plagiarism cases, performing all tasks with no errors. The whole sample agreed about the usefulness of an automatic tool that provides a measure of similarity between two songs.

5 citations

Journal ArticleDOI
TL;DR: In this article , a context-aware semantic communication architecture is proposed to achieve more efficient information interaction with less traffic, which is a promising communication mechanism for future intelligent devices, and six metrics to evaluate the performance of semantic communication from the aspects of effectiveness and reliability.
Abstract: Semantic communication focuses on the accurate transmission of meanings rather than data or symbols. It can achieve more efficient information interaction with less traffic, which is a promising communication mechanism for future intelligent devices. In this paper, we summarize existing research closely related to semantic communication. And then we propose a context-aware semantic communication architecture, explaining the functions of each module within it. We also introduce six metrics to evaluate the performance of semantic communication from the aspects of effectiveness and reliability. Finally, we present an example to illustrate the realization of semantic communication, which is proven effective in reducing data traffic compared to traditional communication mechanisms.

5 citations

Dissertation
01 Dec 2018
TL;DR: Research to develop a Novel Arabic Conversational Intelligent Tutoring System (CITS), called LANA, for children with ASD, which delivers topics related to the science subject by engaging with the user in Arabic language is described.
Abstract: Children with Autism Spectrum Disorder (ASD) are affected in different degrees in terms of their level of intellectual ability. Some people with Asperger syndrome or high functioning autism are very intelligent academically but they still have difficulties in social and communication skills. In recent years, many of these pupils are taught within mainstream schools. However, the process of facilitating their learning and participation remains a complex and poorly understood area of education. Although many teachers in mainstream schools are firmly committed to the principles of inclusive education, they do not feel that they have the necessary training and support to provide adequately for pupils with ASD. One solution for this problem is to use a virtual tutor to supplement the education of pupils with ASD in mainstream schools. This thesis describes research to develop a Novel Arabic Conversational Intelligent Tutoring System (CITS), called LANA, for children with ASD, which delivers topics related to the science subject by engaging with the user in Arabic language. The Visual, Auditory, and Kinaesthetic (VAK) learning style model is used in LANA to adapt to the children’s learning style by personalising the tutoring session. Development of an Arabic Conversational Agent has many challenges. Part of the challenge in building such a system is the requirement to deal with the grammatical features and the morphological nature of the Arabic language. The proposed novel architecture for LANA uses both pattern matching (PM) and a new Arabic short text similarity (STS) measure to extract facts from user’s responses to match rules in scripted conversation in a particular domain (Science). In this research, two prototypes of an Arabic CITS were developed (LANA-I) and (LANA-II). LANA-I was developed and evaluated with 24 neurotypical children to evaluate the effectiveness and robustness of the system engine. LANA-II was developed to enhance LANA-I by addressing spelling mistakes and words variation with prefix and suffix. Also in LANA-II, TEACCH method was added to the user interface to adapt the tutorial environment to the autistic students learning, and the knowledge base was expanded by adding a new tutorial. An evaluation methodology and experiment were designed to evaluate the enhanced components of LANA-II architecture. The results illustrated a statistically significant impact on the effectiveness of LANA-II engine when compared to LANA-I. In addition, the results indicated a statistically significant improvement on the autistic students learning gain with adapting to their learning styles indicating that LANA-II can be adapted to autistic children’s learning styles and enhance their learning.

5 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...It is measuring the distance between two strings to identify the similarity between them (Gomaa and Fahmy, 2013)....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]