scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: UPM is introduced, an unsupervised algorithm for matching products by their titles that is independent of any external sources that demonstrates its superiority against the state-of-the-art clustering approaches and string similarity metrics, in terms of both efficiency and effectiveness.
Abstract: The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employ external data sources to enrich the titles; these solutions are rather impractical, since the process of fetching external data is inefficient. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles that is independent of any external sources. UPM consists of three stages. During the first stage, the algorithm analyzes the titles and extracts combinations of words out of them. These combinations are evaluated in stage 2 according to several criteria, and the most appropriate of them are selected to form the initial clusters. The third phase is a post-processing verification stage that refines the initial clusters by correcting the erroneous matches. This stage is designed to operate in combination with all clustering approaches, especially when the data possess properties that prevent the co-existence of two data points within the same cluster. The experimental evaluation of UPM with multiple datasets demonstrates its superiority against the state-of-the-art clustering approaches and string similarity metrics, in terms of both efficiency and effectiveness.

5 citations

Proceedings ArticleDOI
13 Feb 2022
TL;DR: This study investigates how supporting serendipitous discovery and analysis of online product reviews can encourage readers to explore reviews more comprehensively prior to making purchase decisions and proposes two interventions — Exploration Metrics that can help readers understand and track their exploration patterns through visual indicators and a Bias Mitigation Model that intends to maximize knowledge discovery.
Abstract: In this study, we investigate how supporting serendipitous discovery and analysis of online product reviews can encourage readers to explore reviews more comprehensively prior to making purchase decisions. We propose two interventions — Exploration Metrics that can help readers understand and track their exploration patterns through visual indicators and a Bias Mitigation Model that intends to maximize knowledge discovery by suggesting sentiment and semantically diverse reviews. We designed, developed, and evaluated a text analytics system called Serendyze, where we integrated these interventions. We asked 100 crowd workers to use Serendyze to make purchase decisions based on product reviews. Our evaluation suggests that exploration metrics enabled readers to efficiently cover more reviews in a balanced way, and suggestions from the bias mitigation model influenced readers to make confident data-driven decisions. We discuss the role of user agency and trust in text-level analysis systems and their applicability in domains beyond review exploration.

5 citations

Book ChapterDOI
01 Jan 2019
TL;DR: This work proposes a method which uses data from Wikipedia and WordNet’s Brown corpus to calculate semantic relatedness using modified form of Normalized Google Distance (NGD), and finds that the proposed method calculates relatedness that significantly correlates human intuition.
Abstract: Many applications in natural language processing require semantic relatedness between words to be quantified. Existing WordNet-based approaches fail in the case of non-dictionary words, jargons, or some proper nouns. Meaning of terms evolves over the years which have not been reflected in WordNet. However, WordNet cannot be ignored as it considers the semantics of the language along with its contextual meaning. Hence, we propose a method which uses data from Wikipedia and WordNet’s Brown corpus to calculate semantic relatedness using modified form of Normalized Google Distance (NGD). NGD incorporates word sense derived from WordNet and occurrence over the data from Wikipedia. Through experiments, we performed on a set of selected word pairs, and we found that the proposed method calculates relatedness that significantly correlates human intuition.

5 citations

Dissertation
15 Dec 2014
TL;DR: TrustMe as discussed by the authors is a P2P system based on SPARQL for the purpose of importer information from sources externes pertinentes du Linked Data, e.g., from RDF.
Abstract: Dans cette these, nous etudions plusieurs approches destinees a aider les utilisateurs a trouver des informations utiles et fiables dans le Web de donnees, en utilisant les technologies du Web semantique. Nous abordons pour cela deux themes de recherche: le liage de donnees dans le Linked-Data et la confiance dans les reseaux P2P semantiques. Nous modelisons le probleme de liage dans le Web de donnees comme un probleme de raisonnement sur des donnees incompletes, qu'il s'agit d'enrichir en interrogeant de facon precise et pertinente le cloud du Linked Data. Nous avons concu et implemente un nouvel algorithme qui, a partir d'une requete de liage (du type et d'une base de regles modelisant de maniere uniforme diverses connaissances du domaine (contraintes du schema, axiomes d'inclusion ou d'exclusion d'une ontologie, regles expertes, mappings), construit iterativement des requetes SPARQL pour importer des sources externes pertinentes du Linked Data les donnees utiles pour repondre a la requete de liage. Les experimentations que nous avons menees sur des donnees reelles ont demontre la faisabilite de cette approche et son utilite dans la pratique pour le liage de donnees et la resolution d'homonymie. En outre, nous proposons une adaptation de cette approche pour prendre en compte des donnees et des connaissances eventuellement incertaines, avec en resultat l'inference de liens ‘sameAs' et ‘differentFrom' associes a des poids de probabilite. Dans cette adaptation nous modelisons l'incertitude comme des valeurs de probabilite. Nos experimentations ont montre que notre approche passe a l'echelle pour des bases de connaissances constituees de plusieurs millions de faits RDF et produit des poids probabilistes fiables. Concernant la confiance, nous introduisons un mecanisme de confiance permettant de guider le processus de reponse aux requetes dans des Reseaux P2P semantiques. Les differents pairs dans les reseaux P2P semantiques organisent leur information en utilisant des ontologies distinctes et d ependent d'alignements entre ontologies pour traduire leurs requetes. La notion de confiance dans un tel contexte est subjective ; elle estime la probabilite qu'un pair apportera des reponses satisfaisantes pour les requetes specifiques dans les interactions futures. Le mecanisme propose de calcul de valeurs de confiance combine les informations fournies par les alignements avec celles provenant des interactions passees entre pairs. Les valeurs de confiances calculees sont affinees progressivement a chaque cycle de requete/reponse en utilisant l'inference bayesienne. Pour l'evaluation de notre mecanisme, nous avons construit un systeme P2P de partage de signets semantiques (TrustMe) dans lequel il est possible de faire varier differents parametres quantitatifs et qualitatifs. Les resultats experimentaux montrent la convergence des valeurs de confiance ;.ils mettent egalement en evidence le gain en terme de qualite des reponses des pairs - mesurees selon la precision et le rappel- lorsque le processus de reponse aux requetes est guide par notre mecanisme de confiance.

5 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...– String similarity tools [33]: Many linking rules disambiguate entities if they share exactly the same values for certain properties (e....

    [...]

Book ChapterDOI
12 May 2015
TL;DR: This work describes how schema matching problem can be modelled and simulated as agents where each agent learn, reason and act to find the best match in the other schema attributes group.
Abstract: Schema matching and mapping are an important tasks for many applications, such as data integration, data warehousing and e-commerce. Many algorithms and approaches were proposed to deal with the problem of automatic schema matching and mapping. In this work, we describe how schema matching problem can be modelled and simulated as agents where each agent learn, reason and act to find the best match in the other schema attributes group. Many differences exist between our approach and the existing practice in schema matching. First and foremost our approach is based on the paradigm Agent-based Modeling and Simulation (ABMS), while, as far as we know, all the current methods do not use ABMS paradigm. Second, the agent’s decision-making and reasoning process leverages probabilistic models (Bayesian) for matching prediction and action selection (planning). The results we obtained so far are very encouraging and reinforce our belief that many intrinsic properties of our model, such as simulations, stochasticity and emergence, contribute efficiently to the increase of the matching quality and thus the decrease of the matching uncertainty.

5 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]