scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

164 citations

Journal ArticleDOI
Justin Farrell1
TL;DR: In this article, an application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.
Abstract: An application of network science reveals the institutional and corporate structure of the climate change counter-movement in the United States, while computational text analysis shows its influence in the news media and within political circles.

144 citations

Proceedings ArticleDOI
26 Apr 2016
TL;DR: This research implemented the weighting of Term Frequency - Inverse Document Frequency (TF-IDF) method and Cosine Similarity with the measuring degree concept of similarity terms in a document to rank the document weight that have closesness match level with expert's document.
Abstract: Development of technology in educational field brings the easier ways through the variety of facilitation for learning process, sharing files, giving assignment and assessment. Automated Essay Scoring (AES) is one of the development systems for determining a score automatically from text document source to facilitate the correction and scoring by utilizing applications that run on the computer. AES process is used to help the lecturers to score efficiently and effectively. Besides it can reduce the subjectivity scoring problem. However, implementation of AES depends on many factors and cases, such as language and mechanism of scoring process especially for essay scoring. A number of methods implemented for weighting the terms from document and reaching the solutions for handling comparative level between documents answer and expert's document still defined. In this research, we implemented the weighting of Term Frequency — Inverse Document Frequency (TF-IDF) method and Cosine Similarity with the measuring degree concept of similarity terms in a document. Tests carried out on a number of Indonesian text-based documents that have gone through the stage of pre-processing for data extraction purposes. This process results is in a ranking of the document weight that have closesness match level with expert's document.

137 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...Cosine Similarity is a measure of similarity between two vectors obtained from the cosine angle multiplication value of two vectors being compared [3]....

    [...]

  • ...Some approach to determine similarity level applied such as cosine similarity [3]....

    [...]

Proceedings ArticleDOI
01 Aug 2014
TL;DR: This paper extracted seven types of features including text difference measures proposed in entailment judgement subtask, as well as common text similarity measures used in both subtasks to solve the both subtasking by considering them as a regression and a classification task respectively.
Abstract: This paper presents our approach to semantic relatedness and textual entailment subtasks organized as task 1 in SemEval 2014. Specifically, we address two questions: (1) Can we solve these two subtasks together? (2) Are features proposed for textual entailment task still effective for semantic relatedness task? To address them, we extracted seven types of features including text difference measures proposed in entailment judgement subtask, as well as common text similarity measures used in both subtasks. Then we exploited the same feature set to solve the both subtasks by considering them as a regression and a classification task respectively and performed a study of influence of different features. We achieved the first and the second rank for relatedness and entailment task respectively.

116 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ..., path, lch, wup, jcn (Gomaa and Fahmy, 2013)) were used to calculate the similarity between two words....

    [...]

  • ...method (Bos and Markert, 2005) where automatic reasoning tools are used to check the logical representations derived from sentences and (2) machine learning method (Zhao et al., 2013; Gomaa and Fahmy, 2013) where a supervised model is built...

    [...]

  • ...Existing work on STS can be divided into 4 categories according to the similarity measures used (Gomaa and Fahmy, 2013): (1) string-based method (Bär et al....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a generalization of one-hot encoding, similarity encoding, is proposed to build feature vectors from similarities across categories. But similarity encoding is not suitable for non-curated data.
Abstract: For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.

111 citations

References
More filters
Proceedings Article
06 Jan 2007
TL;DR: This work proposes Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia that results in substantial improvements in correlation of computed relatedness scores with human judgments.
Abstract: Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

2,285 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...CL-ESA exploits a document-aligned multilingual reference collection such as Wikipedia to represent a document as a languageindependent concept vector....

    [...]

  • ...Nine algorithms were explained; HAL, LSA, GLSA, ESA, CL-ESA, PMI-IR, SCO-PMI, NGD and DISCO....

    [...]

  • ...Explicit Semantic Analysis (ESA) [17] is a measure used to compute the semantic relatedness between two arbitrary texts....

    [...]

  • ...The cross-language explicit semantic analysis (CLESA) [18] is a multilingual generalization of ESA....

    [...]

Proceedings Article
20 Aug 1995
TL;DR: This paper presents a new measure of semantic similarity in an IS-A taxonomy, based on the notion of information content, which performs encouragingly well and is significantly better than the traditional edge counting approach.
Abstract: This paper presents a new measure of semantic similarity in an IS-A taxonomy, based on the notion of information content. Experimental evaluation suggests that the measure performs encouragingly well (a correlation of r = 0.79 with a benchmark set of human similarity judgments, with an upper bound of r = 0.90 for human subjects performing the same task), and significantly better than the traditional edge counting approach (r = 0.66).

2,253 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...There are six measures of semantic similarity; three of them are based on information content: Resnik (res) [29], Lin (lin) [25] and Jiang & Conrath (jcn) [30]....

    [...]

Book
27 Oct 1997
TL;DR: This comprehensive, reader-friendly text covers the latest decision support theories and practices used by managers and organizations and is recommended for managers interested in Decision Support Systems, Computerized Decision Making, and Management Support Systems.
Abstract: From the Publisher: Widely hailed for its contemporary, cutting-edge perspective, this comprehensive, reader-friendly text covers the latest decision support theories and practices used by managers and organizations. Current examples and cases are drawn from actual organizations and firms. Decision Making, Systems, Modeling, and Support. Data Warehousing, Access, Analysis, Mining, and Visualization. Modeling and Analysis. Decision Support System Development. Collaborative Computing Technologies: Group Support Systems. Enterprise Decision Support Systems. Knowledge Management. Artificial Intelligence and Expert Systems. Knowledge Acquisition and Validation. Knowledge Representation. Inference Techniques. Intelligent Systems Development. Neural Computing Applications, and Advanced Artificial Intelligent Systems and Applications. Intelligent Software Agents and Creativity. Implementing and Integrating Management Support Systems. Organizational and Societal Impacts of Management Support Systems. For managers interested in Decision Support Systems, Computerized Decision Making, and Management Support Systems.

2,148 citations

Journal ArticleDOI
TL;DR: A new theory of similarity between words and phrases based on information distance and Kolmogorov complexity is presented, which is applied to construct a method to automatically extract similarity, the Google similarity distance, of Words and phrases from the WWW using Google page counts.
Abstract: Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society" is "database," and the equivalent of "use" is "a way to search the database". We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories

1,784 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...If both terms always occur together, their NGD is zero, or equivalent to the coefficient between x squared and y squared....

    [...]

  • ...Nine algorithms were explained; HAL, LSA, GLSA, ESA, CL-ESA, PMI-IR, SCO-PMI, NGD and DISCO....

    [...]

  • ...Normalized Google Distance (NGD) [22] is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords....

    [...]

Journal ArticleDOI
TL;DR: A procedure that processes a corpus of text and produces numeric vectors containing information about its meanings for each word, which provide the basis for a representational model of semantic memory, hyperspace analogue to language (HAL).
Abstract: A procedure that processes a corpus of text and produces numeric vectors containing information about its meanings for each word is presented. This procedure is applied to a large corpus of natural language text taken from Usenet, and the resulting vectors are examined to determine what information is contained within them. These vectors provide the coordinates in a high-dimensional space in which word relationships can be analyzed. Analyses of both vector similarity and multidimensional scaling demonstrate that there is significant semantic information carried in the vectors. A comparison of vector similarity with human reaction times in a single-word priming experiment is presented. These vectors provide the basis for a representational model of semantic memory, hyperspace analogue to language (HAL).

1,717 citations