scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
17 Mar 2021
TL;DR: The authors classify the approaches to measuring the resemblance of sentences based on the methods implemented into three groups: word-to-word based, structure-based, and vector-based methods.
Abstract: This study is intended to analyze the methods used to test resemblance of sentences. For many Natural Language Processing applications such as text grouping, information recovery, brief reaction reviewing, machine learning, passage summary and text categorization, measuring resemblance between sentences is a vital activity. In this paper, we classify the approaches to measuring the resemblance of sentences based on the methods implemented into three groups. The most frequently used methods to finding phrase resemblance are word-to-word based, structure-based, and vector-based. Centered on a particular viewpoint, each approach tests the interaction between short texts. Furthermore, to provide a full view of this problem, datasets that are often used as benchmarks for testing techniques in this field are added. Better outcomes are obtained through methods that incorporate more than one viewpoint. In addition, resemblance of sentences is based on the correspondence of their meanings that tests the semantic resemblance between two concepts, words or sentences needs further research.
Posted Content
TL;DR: A weighted similarity metric is proposed as a measure of matching and find it more reliable than Content-Content or Title-Title similarities alone and the automation of replying to questions has brought the turn around response time (TART) down from aminimum of 21 mins to a minimum of 0.3 secs.
Abstract: e-Yantra Robotics Competition (eYRC) is a unique Robotics Competition hosted by IIT Bombay that is actually an Embedded Systems and Robotics MOOC. Registrations have been growing exponentially in each year from 4500 in 2012 to over 34000 in 2019. In this 5-month long competition students learn complex skills under severe time pressure and have access to a discussion forum to post doubts about the learning material. Responding to questions in real-time is a challenge for project staff. Here, we illustrate the advantage of Deep Learning for real-time question answering in the eYRC discussion forum. We illustrate the advantage of Transformer based contextual embedding mechanisms such as Bidirectional Encoder Representation From Transformer (BERT) over word embedding mechanisms such as Word2Vec. We propose a weighted similarity metric as a measure of matching and find it more reliable than Content-Content or Title-Title similarities alone. The automation of replying to questions has brought the turn around response time(TART) down from a minimum of 21 mins to a minimum of 0.3 secs.

Cites methods from "A Survey of Text Similarity Approac..."

  • ...Throughout our approach, we strictly follow a termbased matching techniques instead of character based matching as mentioned in the survey report [4]....

    [...]

Proceedings ArticleDOI
Angela Mathew1, Sangeetha Jamal1
01 Oct 2019
TL;DR: A method to improve the existing methods to classify documents in to categories based on supervised machine learning technique by converting the unstructured text data into a numerical vector form for representational learning.
Abstract: This paper proposes a method to improve the existing methods to classify documents in to categories based on supervised machine learning technique. It includes converting the unstructured text data into a numerical vector form for representational learning. The key issue in text analysis and learning is to find an effective representational model. In case of large data sets the classical Bag of Words model outperforms the existing document representational learning methods in its simplicity in processing and accuracy. An improvement to this method using a fuzzy mapping model based on semantic similarity among words, beats the existing methods for classification. The method gives a comparatively simple representation and higher accuracy model. The results on real world data sets show an improved accuracy using the proposed method in supervised learning methods for classification.
Book ChapterDOI
01 Jan 2021
TL;DR: In this paper, a character-based template generator for log lines is proposed, which combines comparison-based methods and heuristics to generate robust templates for any type of computer log data, which can be applied in security information and event management (SIEM) solutions.
Abstract: Log line clusters usually lack meaningful descriptions that are required to understand the information provided by log lines within a cluster. Template generators allow to produce such descriptions in form of patterns that match all log lines within a cluster and therefore describe the common features, e.g., substrings, of the lines. Current approaches only allow the generation of token-based (e.g., space-separated words) templates, which are often inaccurate for log lines, because they usually do not account for existing string similarities in, for instance fully qualified system names or domain names. Consequently, novel character-based template generators are required that provide robust templates for any type of computer log data, which can be applied in security information and event management (SIEM) solutions, for continuous auditing, quality inspection and control. In this chapter, we propose a novel approach for computing character-based templates, which combines comparison-based methods and heuristics. To achieve this goal, we solve the problem of efficiently calculating a multi-line alignment for a group of log lines and compute an accurate approximation of the optimal character-based template.
Posted Content
TL;DR: In this article, the authors presented a graph data structure, which they denote as a meta-graph, that combines underlying users' relational event information, as well as semantic and topical modeling.
Abstract: As recent events have demonstrated, disinformation spread through social networks can have dire political, economic and social consequences. Detecting disinformation must inevitably rely on the structure of the network, on users particularities and on event occurrence patterns. We present a graph data structure, which we denote as a meta-graph, that combines underlying users' relational event information, as well as semantic and topical modeling. We detail the construction of an example meta-graph using Twitter data covering the 2016 US election campaign and then compare the detection of disinformation at cascade level, using well-known graph neural network algorithms, to the same algorithms applied on the meta-graph nodes. The comparison shows a consistent 3%-4% improvement in accuracy when using the meta-graph, over all considered algorithms, compared to basic cascade classification, and a further 1% increase when topic modeling and sentiment analysis are considered. We carry out the same experiment on two other datasets, HealthRelease and HealthStory, part of the FakeHealth dataset repository, with consistent results. Finally, we discuss further advantages of our approach, such as the ability to augment the graph structure using external data sources, the ease with which multiple meta-graphs can be combined as well as a comparison of our method to other graph-based disinformation detection frameworks.
References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]