scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper propose a methodology with tool support, named as Zen-ReqConfig, to tackle the problem of automatically developing structured and configuration-ready PL requirements repositories.
Abstract: A system product line (PL) often has a large number of reusable and configurable requirements, which in practice are organized hierarchically based on the architecture of the PL. However, the current literature lacks approaches that can help practitioners to systematically and automatically develop structured and configuration-ready PL requirements repositories. In the context of product line engineering and model-based engineering, automatic requirements structuring can benefit from models. Such a structured PL requirements repository can greatly facilitate the development of product-specific requirements repository, the product configuration at the requirements level, and the smooth transition to downstream product configuration phases (e.g., at the architecture design phase). In this paper, we propose a methodology with tool support, named as Zen-ReqConfig, to tackle the above challenge. Zen-ReqConfig is built on existing model-based technologies, natural language processing, and similarity measure techniques. It automatically devises a hierarchical structure for a PL requirements repository, automatically identifies variabilities in textual requirements, and facilitates the configuration of products at the requirements level, based on two types of variability modeling techniques [i.e., cardinality-based feature modeling (CBFM) and a UML-based variability modeling methodology (named as SimPL)]. We evaluated Zen-ReqConfig with five case studies. Results show that Zen-ReqConfig can achieve a better performance based on the character-based similarity measure Jaro than the term-based similarity measure Jaccard. With Jaro, Zen-ReqConfig can allocate textual requirements with high precision and recall, both over 95% on average and identify variabilities in textual requirements with high precision (over 97% on average) and recall (over 94% on average). Zen-ReqConfig achieved very good time performance: with less than a second for generating a hierarchical structure and less than 2 s on average for allocating a requirement. When comparing SimPL and CBFM, no practically significant difference was observed, and they both performed well when integrated with Zen-ReqConfig.

13 citations

Journal ArticleDOI
TL;DR: This paper presents MERGILO, a method for reconciling knowledge extracted from multiple natural language sources, and for delivering it as a knowledge graph, and shows promising results.
Abstract: This paper presents MERGILO, a method for reconciling knowledge extracted from multiple natural language sources, and for delivering it as a knowledge graph. The underlying problem is relevant in many application scenarios requiring the creation and dynamic evolution of a knowledge base, e.g. automatic news summarization, human-robot dialoguing, etc. After providing a formal definition of the problem, we propose our holistic approach to handle natural language input - typically independent texts as in news from different sources - and we output a knowledge graph representing their reconciled knowledge. MERGILO is evaluated on its ability to identify corresponding entities and events across documents against a manually annotated corpus of news, showing promising results.

13 citations

Posted Content
TL;DR: Analysis of the relation between stance and sentiment reveals that sentiment and stance are not highly aligned, and hence the simple sentiment polarity cannot be used solely to denote a stance toward a given topic.
Abstract: Stance detection is the task of inferring viewpoint towards a given topic or entity either being supportive or opposing. One may express a viewpoint towards a topic by using positive or negative language. This paper examines how the stance is being expressed in social media according to the sentiment polarity. There has been a noticeable misconception of the similarity between the stance and sentiment when it comes to viewpoint discovery, where negative sentiment is assumed to mean against stance, and positive sentiment means in-favour stance. To analyze the relation between stance and sentiment, we construct a new dataset with four topics and examine how people express their viewpoint with regards these topics. We validate our results by carrying a further analysis of the popular stance benchmark SemEval stance dataset. Our analyses reveal that sentiment and stance are not highly aligned, and hence the simple sentiment polarity cannot be used solely to denote a stance toward a given topic.

12 citations

Journal ArticleDOI
TL;DR: The new approach and the delivered insights aim at improving the overall health and safety culture of the project partner practices, as they can be applied to caution and, thus, prevent future incidents.

12 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...A number of text similarity approaches have been proposed in the literature [20] that can be used to compare text descriptions of events....

    [...]

Proceedings ArticleDOI
10 Nov 2017
TL;DR: This work introduces a new text-similarity measure, which employs named-entities’ information extracted from the texts and the n-gram graphs’ model for representing documents, embedded in a text clustering algorithm (k-Means).
Abstract: Text comparison is an interesting though hard task, with many applications in Natural Language Processing. This work introduces a new text-similarity measure, which employs named-entities’ information extracted from the texts and the n-gram graphs’ model for representing documents. Using OpenCalais as a named-entity recognition service and the JINSECT toolkit for constructing and managing n-gram graphs, the text similarity measure is embedded in a text clustering algorithm (k-Means). The evaluation of the produced clusters with various clustering validity metrics shows that the extraction of named entities at a first step can be profitable for the time-performance of similarity measures that are based on the n-gram graph representation without affecting the overall performance of the NLP task.

12 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Gomaa and Fahmy (2013) provide a survey on text similarity measures, dividing them into three groups: i) stringbased measures that operate on character sequences, ii) corpus-based measures that take into account information that comes from corpora, and iii) knowledge-based measures that use…...

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]