scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Nov 2016
TL;DR: This paper proposes a cubes design and construction approach based on fusion of cubes containing a part of the user's need according to a user query where the expressed need is dispersed on more than one cube.
Abstract: Business Intelligence systems provide an effective solution for multidimensional online computing and analysis from large volumes of data. In the decision making process, analyzed data are typically stored in a set of cubes often heterogeneous. Most of the time, the structure of these cubes is unknown by decision makers. Our goal is to enable decision makers to express their need via a query in natural language. In this paper, we deal with the problem of data cube design and construction according to a user query where the expressed need is dispersed on more than one cube. We propose a cubes design and construction approach based on fusion of cubes containing a part of the user's need. We validate our approach via a tool, called “Design-Cubes-Query”, that implements our approach and we show its use through a case study.

4 citations

Book ChapterDOI
17 Nov 2017
TL;DR: A new similarity measure based on concept name analysis is proposed to solve the weakness of the existing similarity measures for primitive concepts and is evaluated and compared against other approaches with the human expert results based on different types of ontology concepts.
Abstract: The semantic similarity measure between biomedical terms or concepts is a crucial task in biomedical information extraction and knowledge discovery. Most of the existing similarity approaches measure the similarity degree based on the path length between concept nodes as well as the depth of the ontology tree or hierarchy. These measures do not work well in case of the “primitive concepts” which are partially defined and have only few relations in the ontology structure. Namely, they cannot give the desired similarity results against human expert judge on the similarity among primitive concepts. In this paper, the existing two ontology-based measures are introduced and analyzed in order to determine their limitations with respect to the considered knowledge base. After that, a new similarity measure based on concept name analysis is proposed to solve the weakness of the existing similarity measures for primitive concepts. Using SNOMED CT as the input ontology, the accuracy of our proposal is evaluated and compared against other approaches with the human expert results based on different types of ontology concepts. Based on the correlation between the results of the evaluated measures and the human expert ratings, this paper analyzes the strength and weakness of each similarity measure for all ontology concepts.

4 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...We define the headword noun pharse structure denoted by simHeadword based on the Jaccard similarity [15] (the number of shared terms over the number of all unique tems)....

    [...]

Journal Article
TL;DR: An improvement to existing Test Case Prioritization (TCP) technique for a more effective regression testing and prioritization using particle swarm optimization (PSO) is proposed, which shows that weight-hybrid string distance is capable of improving APFD values.
Abstract: Regression testing is concerned with testing the modified version of software. However, to re-test entire test cases require significant cost and time. To reduce the cost and time, higher average percentage fault detection (APFD) rate and faster execution to kill fault mutant are required. Therefore, to achieve these two requirements, an improvement to existing Test Case Prioritization (TCP) technique for a more effective regression testing is offered. A weight-hybrid string distance technique and prioritization using particle swarm optimization (PSO) is proposed. Distance between test cases and weight for each test case, and hybridization of both values for weight-hybrid string distance are calculated. This experiment was evaluated using Siemens dataset. Result obtained from this experiment shows that weight-hybrid string distance is capable of improving APFD values whereby APFD value for hybrid TFIDF-JC is equal to 97.37%, which shows the highest improvement by 4.74% as compared to non-hybrid JC. Meanwhile, for percentage of test cases needed to kill 100% fault mutants, hybrid TFIDF-M yields the lowest value, 22.88%, which shows a 76% improvement as compared to its non-hybrid string distance.

4 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...String distance were measured based on the string or terms arrangements and also their character sequences [11]....

    [...]

  • ...1 The String Distance for TCP String distance were measured based on the string or terms arrangements and also their character sequences [11]....

    [...]

  • ...There are two type of string distances; character-based and term-based [11]....

    [...]

  • ...Jaccard in the other hand is calculated based on the number of mutual terms in compared to the amount of all exclusive terms in both strings [11]....

    [...]

01 Jan 2018
TL;DR: Overall this thesis finds that novel computational methods can be used to detect knowledge transfer, but that further advancements in terms of technical tools and methods are needed to improve their performance and feasibility.
Abstract: text mining application DTU Orbit (10/11/2019) Novel perspectives on university-industry knowledge transfer: A structural assessment and text mining application Universities face increasing demands for active dissemination of their research results and are expected to contribute to knowledge development in their socioeconomic environment. Universities are expected to be key drivers promoting economic development and innovation. Consequently knowledge dissemination, as a crucial aspect for industrial development and innovation, is politically highly desired and became a focus area for public funding of university research. Scholars, policy makers and practitioners picked up on this increasing demand for understanding university contributions and investigate collaboration and knowledge transfer between universities and industry. However, some elements in the interaction between universities and industry that contribute to its effectiveness still remain largely unknown. Questions remain regarding especially the knowledge transfer channels and measurements of successful knowledge dissemination. The overarching aim of this PhD project is to identify novel potential measures for university-industry knowledge transfer through specifically chosen and adapted computational methods, hereby contributing to the understanding of university research knowledge transfer. First, publication data from a single technical university’s publication database were analysed regarding their distributions and ratios in different dimensions, such as publication types, research fields, etc. Additionally, coverage of long-standing established publication databases was taken into consideration. The results showed that the traditional databases have skewed coverage and novel or less traditional outcomes of research (output that is not a journal article or a book chapter) often might be significantly underrepresented. It shows that additional data can increase the insights into university research in certain aspects significantly. In the second part of the PhD project, a novel approach for detecting knowledge transfer was developed and used to trace the content from university research in companies. Text mining applications were used to detect content from academic publication abstracts on company websites. The findings show that the detection of common content between universities and industry via text mining applications is possible and beneficial. In the final part of the PhD project the methods are applied to investigate the impact of Open Access publications on knowledge transfer. Using the text mining methods, I examine the differences between subscription-based and Open Access publications, assuming that the accessibility of a written item implies a different performance in terms of knowledge transfer. Here the results show that for this specific measure Open Access publishing makes a difference in terms of university-industry knowledge transfer. Given the contemporary positive assumptions regarding Open Access publications, the differences appear less pronounced than expected. Overall this thesis finds that novel computational methods can be used to detect knowledge transfer, but that further advancements in terms of technical tools and methods are needed to improve their performance and feasibility.

4 citations

Posted Content
TL;DR: This work considers both the single change-point alternative and the changed-interval alternative, and derive analytic formulas to control the Type I error for the new methods, making them fast applicable to large datasets.
Abstract: In the regime of change-point detection, a nonparametric framework based on scan statistics utilizing graphs representing similarities among observations is gaining attention due to its flexibility and good performances for high-dimensional and non-Euclidean data sequences, which are ubiquitous in this big data era. However, this graph-based framework encounters problems when there are repeated observations in the sequence, which often happens for discrete data, such as network data. In this work, we extend the graph-based framework to solve this problem by averaging or taking union of all possible "optimal" graphs resulted from repeated observations. We consider both the single change-point alternative and the changed-interval alternative, and derive analytic formulas to control the type I error for the new methods, making them fast applicable to large data sets. The extended methods are illustrated on an application in detecting changes in a sequence of dynamic networks over time.

4 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]