scispace - formally typeset
Open AccessJournal ArticleDOI

A Survey of Text Similarity Approaches

Wael Hassan Gomaa, +1 more
- 18 Apr 2013 - 
- Vol. 68, Iss: 13, pp 13-18
Reads0
Chats0
TLDR
This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract
Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Authors. in profile

Marjorie V. Batey
- 01 Jan 1969 - 
Journal ArticleDOI

Network structure and influence of the climate change counter-movement

TL;DR: In this article, an application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.
Proceedings ArticleDOI

Cosine similarity to determine similarity measure: Study case in online essay assessment

TL;DR: This research implemented the weighting of Term Frequency - Inverse Document Frequency (TF-IDF) method and Cosine Similarity with the measuring degree concept of similarity terms in a document to rank the document weight that have closesness match level with expert's document.
Proceedings ArticleDOI

ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Textual Entailment

TL;DR: This paper extracted seven types of features including text difference measures proposed in entailment judgement subtask, as well as common text similarity measures used in both subtasks to solve the both subtasking by considering them as a regression and a classification task respectively.
Journal ArticleDOI

Similarity encoding for learning with dirty categorical variables

TL;DR: In this paper, a generalization of one-hot encoding, similarity encoding, is proposed to build feature vectors from similarities across categories. But similarity encoding is not suitable for non-curated data.
References
More filters

Lexical chains as representations of context for the detection and correction of malapropisms

TL;DR: How lexical chains can be constructed by means of WordNet, and how they can be applied in one particularlinguistic task: the detection and correction of malapropisms is shown.
Journal ArticleDOI

Probabilistic linkage of large public health data files

TL;DR: Probabilistic linkage technology makes it feasible and efficient to link large public health databases in a statistically justifiable manner by linking highway crashes to Emergency Medical Service reports and to hospital admission records for the National Highway Traffic Safety Administration (NHTSA).
Journal ArticleDOI

Sentence similarity based on semantic nets and corpus statistics

TL;DR: Experiments demonstrate that the proposed method provides a similarity measure that shows a significant correlation to human intuition and can be used in a variety of applications that involve text knowledge representation and discovery.

String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.

TL;DR: A formal method of modeling how to adjust matching weights between pure agreement and pure disagreement is presented and it is demonstrated that the theoretical rules of Fellegi and Sunter are still valid when general weighting adjustments accounting for partial agreement are performed.
Journal ArticleDOI

Approximate String Matching

TL;DR: Approximate matching of strings is reviewed with the aim of surveying techniques suitable for finding an item in a database when there may be a spelling mistake or other error in the keyword.
Related Papers (5)