A Survey of Text Similarity Approaches
Wael Hassan Gomaa,Aly A. Fahmy +1 more
TLDR
This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.Abstract:
Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.read more
Citations
More filters
Journal ArticleDOI
Similarity measures in automated essay scoring systems: A ten-year review
Construction of UMLS Metathesaurus with Knowledge-Infused Deep Learning.
TL;DR: A knowledge-infused learning approach provides a good performance indicating promising potential for emulating the current building process of constructing the Metathesaurus and investigation of the applicability, maintenance, and scalability of these models.
Proceedings ArticleDOI
Code Comment Assessment Development for Basic Programming Subject using Online Judge
TL;DR: This research is aimed at implementing the code comment assessment module for online judge in the order it can check the comments in the source code which is uploaded by students.
Journal ArticleDOI
Improving text relatedness by incorporating phrase relatedness with word relatedness
TL;DR: This work adopts 2 existing word relatedness measures based on Google n‐gram and Global Vectors for Word Representation and incorporate them differently with an existing Google n–gram–based phrase relatedness method to compute text relatedness.
Journal ArticleDOI
Packet Length Covert Channels Crashed
TL;DR: This paper investigates one type of packet length covert channels which exploits the variation of the network packets’ lengths to convey secret messages this type of covert channels does not need any rules to be distributed prior to the initiation of a covert session as other packetlength covert channels need.
References
More filters
Journal ArticleDOI
WordNet : an electronic lexical database
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Journal ArticleDOI
A general method applicable to the search for similarities in the amino acid sequence of two proteins
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Journal ArticleDOI
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI
A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.