scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: State of the art techniques: Levenshtein Distance, Cosine Similarity, Hamming Distance and ASCII based hashing and Rabin–Karp rolling hashing have been investigated on source code strings and it has been observed that Rabin-Karp hashing performs better than other techniques in terms of running time, accuracy and type-of-clones.
Abstract: Detecting similarity between two source code bases or inside one code base has many applications in the area of plagiarism detection and reused code which is manageable for refactoring. In this paper, State of the art techniques: Levenshtein Distance, Cosine Similarity, Hamming Distance and ASCII based hashing and Rabin–Karp rolling hashing have been investigated on source code strings, which is an extended work to already published research work. From experimentation, it has been observed that Rabin–Karp hashing performs better than other techniques in terms of running time, accuracy and type-of-clones. All techniques face one issue of increase in similarity searching time linearly with database size, whereas Rabin–Karp hashing handled this issue efficiently. Moreover, Rabin–Karp rolling hash method reported minimum false positives and it is also able to manage multiple patterns at a time.

6 citations

Proceedings ArticleDOI
06 Dec 2019
TL;DR: A text similarity hybrid model (L-THM) integrating LDA and TF-IDF for calculating text similarity is proposed that can better represent the text information than the single model, and obtain a good F value in the cluster, which effectively improves the text similarity calculation effect.
Abstract: The traditional TF-IDF-based text similarity calculation model uses statistical methods to map text to the keyword vector space and convert the similarity of text into the distance between text vectors. Such methods have problems such as high computational dimensions, sparse data, and inability to take advantage of the semantic information contained in the text itself, so the results obtained are not as similar as the physical text. The text similarity model based on the topic model changes the traditional spatial similarity of keyword vectors, and can fully utilize the semantic information contained in the text itself. But this approach ignores the effect of words on text semantic representations with different weights. In the process of converting text into topic feature space, valuable information is lost. In view of the above problems, this paper proposes a text similarity hybrid model (L-THM) integrating LDA and TF-IDF for calculating text similarity. The model uses the semantic information contained in the text itself and the keyword information reflecting the text to comprehensively analyses and calculates the similarity between the texts. The experimental results show that the hybrid model can better represent the text information than the single model, and obtain a good F value in the cluster, which effectively improves the text similarity calculation effect.

6 citations

Journal ArticleDOI
TL;DR: A novel hybrid approach of corpus based and knowledge based method is proposed approach to compute the semantic relatedness among words to improve the efficiency of query processing and understanding of textual data more proficiently.

6 citations

Proceedings ArticleDOI
28 May 2016
TL;DR: This paper presents an interactive system to measure the lexical similarity between a new incoming project and a set of completed projects exist in the repository and therefore identify some components to be reused.
Abstract: Recently, usage of text similarity has increased rapidly to be involved in different areas such as document clustering, information retrieval, short answer grading, text summarization, machine learning and natural language processing. Lexical-based similarity and semantic-based similarity are the two main categories of text similarity. Reusability of software components increases productivity and quality. In this paper, we propose that there is some linkage between text similarity and software reusability. In an organization, whenever a new incoming project is received, similarity test can be done to identify some similar projects and therefore some components to be reused such as design, code and test cases instead of starting building software from scratch. In this paper, we present an interactive system to measure the lexical similarity between a new incoming project and a set of completed projects exist in the repository and therefore identify some components to be reused.

6 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...There are a lot of algorithms which are used in lexical similarity such as Longest Common SubString (LCS), DamerauLevenshtein, Jaro, Jaro–Winkler, and others [6,8]....

    [...]

  • ...Semantic similarity is extensively used in query answering systems to help the user to find what he or she means regardless of the sequence of characters written [6,7,8]....

    [...]

  • ...Document similarity measurement is an important technique for categorizing and clustering documents [5,6,7]....

    [...]

DissertationDOI
01 Jan 2017
TL;DR: A new way to train a virtual assistant with unsupervised learning is presented, called AVRA, which is a deep learning image processing and recommender system that can collaborate with the computer user to accomplish various tasks.
Abstract: A new way to train a virtual assistant with unsupervised learning is presented in this thesis. Rather than integrating with a particular set of programs and interfaces, this new approach involves shallow integration between the virtual assistant and computer through machine vision. In effect the assistant interprets the computer screen in order to produce helpful recommendations to assist the computer user. In developing this new approach, called AVRA, the following methods are described: an unsupervised learning algorithm which enables the system to watch and learn from user behavior, a method for fast filtering of the text displayed on the computer screen, a deep learning classifier used to recognize key onscreen text in the presence of OCR translation errors, and a recommendation filtering algorithm to triage the many possible action recommendations. AVRA is compared to a similar commercial state-of-the-art system, to highlight how this work adds to the state of the art. AVRA is a deep learning image processing and recommender system that can collaborate with the computer user to accomplish various tasks. This document presents a comprehensive overview of the development and possible applications of this novel virtual assistant technology. It detects onscreen tasks based upon the context it perceives by analyzing successive computer screen images with neural networks. AVRA is a recommender system, as it assists the user by producing action recommendations regarding onscreen tasks. In order to simplify the interaction between the user and AVRA, the system was designed to only produce action recommendations that can be accepted with a single mouse click. These action recommendations are produced without integration into each individual application executing on the computer. Furthermore, the action recommendations are personalized to the user’s interests utilizing a history of the user’s interaction.

6 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Lexical similarity involves comparing sequences of characters to measure similarity, while semantic similarity involves studying a corpus to determine how similarly two words are used [69]....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]