scispace - formally typeset
Open AccessJournal ArticleDOI

A Survey of Text Similarity Approaches

Wael Hassan Gomaa, +1 more
- 18 Apr 2013 - 
- Vol. 68, Iss: 13, pp 13-18
TLDR
This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract
Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

read more

Content maybe subject to copyright    Report

Citations
More filters
Patent

Automated efficient translation context delivery

TL;DR: In this article, the source strings are compared against a dictionary of reference strings in the source language and the selected reference strings are presented; each source string has one or more similar/related strings displayable in association therewith.
Book ChapterDOI

Predicting retweet class using deep learning

Abstract: The increasing user base of people using social media platforms to interact and report their individual opinions on a subject matter is generating a volume of text data that is nonlinear. This presents an enormous number of possibilities for experts to experiment, build a framework, and make insights related to user behavior. If a piece of information is factual, essential, and helpful, then the right set of words and social networks should be utilized to cascade the information to all. Through current research, we propose a novel deep learning framework to predict information popularity on Twitter, measured through the retweet feature of the tool and algorithmically created features. We perform this research with the hypothesis that retweeting behavior can be an outcome of a writer's practice of semantics and grasp of the language. When we read any sentence, the understanding of the sentence does not start from scratch, but instead builds upon the knowledge in a sequence of reading and interpretation of the phrases used in the text. This rule of semantics can be used to create word features. The extracted features can be utilized to train a deep learning model like long short-term memory that has firmly proven its importance in learning hidden trends in any data. The long short-term memory framework has the capability of storing previous learnings and using them when needed. As an outcome of the experimentation proposed we use the word expletives along with word-embedding features to successfully present a generalizable deep neural network framework to classify tweets with a high potential for being retweeted, and tweets with a low possibility of being retweeted.
Journal ArticleDOI

A Hybrid Model for Paraphrase Detection Combines pros of Text Similarity with Deep Learning

TL;DR: This paper proposes a hybrid model that combines the text similarity approach with deep learning approach in order to improve paraphrase detection and verified results with Microsoft Research Paraphrase Corpus dataset.
Journal ArticleDOI

A hamming distance and fuzzy logic-based algorithm for P2P content distribution in enterprise networks

TL;DR: A peer selection algorithm in BitTorrent to distribute content in enterprise networks is proposed, with a new special role called Internal Swarm Coordinator, ISC for short, which allows peers from an Internal Swarm to cooperate in a coordinated manner.
Proceedings ArticleDOI

Question Similarity Modeling with Bidirectional Long Short-Term Memory Neural Network

TL;DR: This work proposed a question model building with Bidirectional Long Short-Term Memory (BLSTM) neural networks, which as well can be used in other fields, such as sentence similarity computation, paraphrase detection, question answering and so on.
References
More filters
Journal ArticleDOI

WordNet : an electronic lexical database

Christiane Fellbaum
- 01 Sep 2000 - 
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Journal ArticleDOI

A general method applicable to the search for similarities in the amino acid sequence of two proteins

TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Journal ArticleDOI

Identification of common molecular subsequences.

TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Journal ArticleDOI

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.

TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Related Papers (5)