scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
01 Feb 2019
TL;DR: This work proposes a new modeling and learning framework for detecting linkage between crime events using spatio-temporal-textual data, which are highly prevalent in the form of police reports, and captures the notion of modus operandi, by introducing a multivariate marked point process and handling the complex text jointly with the time and location.
Abstract: Crimes emerge out of complex interactions of behaviors and situations; thus there are complex linkages between crime incidents. Solving the puzzle of crime linkage is a highly challenging task because we often only have limited information from indirect observations such as records, text descriptions, and associated time and locations. We propose a new modeling and learning framework for detecting linkage between crime events using spatio-temporal-textual data, which are highly prevalent in the form of police reports. We capture the notion of modus operandi (M.O.), by introducing a multivariate marked point process and handling the complex text jointly with the time and location. The model is able to discover the latent space that links the crime series. The model fitting is achieved by a computationally efficient Expectation-Maximization (EM) algorithm. In addition, we explicitly reduce the bias in the text documents in our algorithm. Our numerical results using real data from the Atlanta Police show that our method has competitive performance relative to the state-of-theart. Our results, including variable selection, are highly interpretable and may bring insights into M.O. extraction.

5 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...By looking into the distribution of the high TF-IDF value (Gomaa and Fahmy, 2013) keywords in each of the crime series labeled by the police shown in Figure 2, the co-ocurrence keywords in each crime series is surprisingly interesting and give a vivid picture for each crime series....

    [...]

  • ...Commonly used measure for semantic similarity includes TF-IDF weighting followed by cosine distance is good enough for most applications (Gomaa and Fahmy, 2013)....

    [...]

Book ChapterDOI
01 Sep 2014
TL;DR: This paper presents a novel approach for efficiently perform phonetic similarity search over large data sources, that uses a data structure called PhoneticMap to encode language-specific phonetic information.
Abstract: Analysis of unstructured data may be inefficient in the presence of spelling errors Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary However, they are not rich enough to encode phonetic information to assist the search In this paper, we present a novel approach for efficiently perform phonetic similarity search over large data sources, that uses a data structure called PhoneticMap to encode language-specific phonetic information We validate our approach through an experiment over a data set using a Portuguese variant of a well-known repository, to automatically correct words with spelling errors

5 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...[9] presents a survey with the existing works on text similarity through partitioning them into three approaches....

    [...]

Proceedings ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a new framework SurfCon that leverages two important types of information in the privacy-aware clinical data, i.e., the surface form information, and the global context information for synonym discovery.
Abstract: Unstructured clinical texts contain rich health-related information. To better utilize the knowledge buried in clinical texts, discovering synonyms for a medical query term has become an important task. Recent automatic synonym discovery methods leveraging raw text information have been developed. However, to preserve patient privacy and security, it is usually quite difficult to get access to large-scale raw clinical texts. In this paper, we study a new setting named synonym discovery on privacy-aware clinical data (i.e., medical terms extracted from the clinical texts and their aggregated co-occurrence counts, without raw clinical texts). To solve the problem, we propose a new framework SurfCon that leverages two important types of information in the privacy-aware clinical data, i.e., the surface form information, and the global context information for synonym discovery. In particular, the surface form module enables us to detect synonyms that look similar while the global context module plays a complementary role to discover synonyms that are semantically similar but in different surface forms, and both allow us to deal with the OOV query issue (i.e., when the query is not found in the given data). We conduct extensive experiments and case studies on publicly available privacy-aware clinical data, and show that SurfCon can outperform strong baseline methods by large margins under various settings.

5 citations

Journal Article
TL;DR: The proposed system will find the similarity between two Arabic texts by using hybrid similarity measures techniques: Semantic similarity measure, Cosine similarity measure and N-gram ( using the Dice similarity measure).
Abstract: Calculating similarities between texts that have been written in one language or multiple languages still one of the most important challenges facing the natural language processing. This work offers many approaches that used for the texts similarity. The proposed system will find the similarity between two Arabic texts by using hybrid similarity measures techniques: Semantic similarity measure, Cosine similarity measure and N-gram ( using the Dice similarity measure). In our proposed system we will design Arabic SemanticNet that store the keywords for a specific field(computer science), by this network we can find semantic similarity between words according to specific equations. Cosine and N-gram similarity measures are used in order to find the similar characters sequences. The proposed system was executed by using Visual Basic 2012, and after testing it, it proved to be a worthy for finding the similarity between two Arabic texts (From the viewpoint of accuracy and search time).

5 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...If the words have used in the same way, or same thing, or opposite of each other, or used in the same context or one of them is a type of another one then the words similar semantically[1]....

    [...]

  • ...153 topic detection, document clustering, questions generation, topic tracking, essay scoring, question answering, machine translation, short answer scoring, and others[1]....

    [...]

01 Jan 2016
TL;DR: The author revealed that research opportunities for knowledge management in Globalized, Decentralized and Globalized Manufacturing Systems are limited, but the potential for new ideas and approaches to address these challenges are considerable.
Abstract: ....................................................................................................................... ii DEDICATION ................................................................................................................... iv ACKNOWLEDGEMENTS .................................................................................................v LIST OF TABLES ............................................................................................................. ix LIST OF FIGURES .......................................................................................................... xii CHAPTER 1: INTRODUCTION AND MOTIVATION ....................................................1 1.1 Decentralized and Globalized Manufacturing Systems ...........................................1 1.2 Need for Knowledge Management in Globalized, Decentralized Design and Manufacturing Systems ................................................................................................ 3 1.3 A Method to Link Product Design and Assembly Process Design .........................4 CHAPTER 2: FRAME OF REFERENCE ........................................................................10 2.1 Product Evolution Process (PEP) ..........................................................................11 2.2 Information available at different PEP phases .......................................................13 2.3 Product-Process coupling ......................................................................................17 2.4 Computational methods to analyze and compare assembly solid models .............26 2.5 Computational methods to analyze and compare assembly work instructions .....39 2.6 Summary of research opportunities .......................................................................43 CHAPTER 3: COMPUTATIONAL SOLID MODEL SIMILARITY FOR ASSEMBLY PROCESS INFORMATION RETRIEVAL ....................................47 3.1 Use of tessellation areas to determine solid model similarity ...............................51 3.2 Comparison of existing solid model similarity methods to tessellation area frequency distribution solid model similarity ............................................................. 56 3.3 D1 method to compute solid model similarity and visualizing differences in solid models ......................................................................................................................... 73 CHAPTER 4: COMPUTATIONAL ANALYSIS OF ASSEMBLY WORK INSTRUCTIONS ...................................................................................................77 TITLE PAGE........................................................................................................................i

5 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]