scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Nov 2016
TL;DR: The architecture for applying Malik Bennabi's ruler on Intellectual Property of Islamic Finance and Banking will be implemented in order to measure all journal article's content and group them into two categories namely Useful and Harmful.
Abstract: This paper presents our work in designing system architecture for applying Malik Bennabi's ruler on Intellectual Property (IP) of Islamic Finance and Banking. We chose journal article as one form of IP for Islamic Finance and Banking. There are several components in the architecture starting from IP retrieval, structure extraction, structure classification, structure summarization, and Knowledge-based similarity measuring. The architecture will be implemented in order to measure all journal article's content and group them into two categories namely Useful and Harmful.

2 citations

Journal ArticleDOI
15 Apr 2021
TL;DR: This work uses the lens of text analytics tools based on machine learning techniques to investigate a number of questions of interest to scholars of this and related traditions of the Great Perfection.
Abstract: Over the past decade, through a mixture of optical character recognition and manual input, there is now a growing corpus of Tibetan literature available as e-texts in Unicode format. With the creation of such a corpus, the techniques of text analytics that have been applied in the analysis of English and other modern languages may now be applied to Tibetan. In this work, we narrow our focus to examine a modest portion of that literature, the Mind-section portion of the literature of the Tibetan tradition of the Great Perfection. Here, we will use the lens of text analytics tools based on machine learning techniques to investigate a number of questions of interest to scholars of this and related traditions of the Great Perfection. It has been necessary for us to participate in all portions of this process: corpora identification and text edition selection, rendering the text as e-texts in Unicode using both Optical Character Recognition and manual entry, data cleaning and transformation, implementation of software for text analysis, and interpretation of results. For this reason, we hope this study can serve as a model for other low-resource languages that are just beginning to approach the problem of providing text analytics for their language.

2 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...The use of n-gram frequencies as input features to authorship attribution models has been proposed by References [11, 50] and as a simple method capable of capturing both lexical content and local context....

    [...]

  • ...Inverse document frequency, introduced by Reference [57] as term specificity, is a method of weighting term frequency values by the rarity of terms across all documents in the corpus....

    [...]

  • ...The origins of this influence are easy to understand when one imagines the challenges of presenting a coherent curriculum of philosophical study to students over a period of 15 or more years of systematic religious training (cf. Reference [10] for an example curriculum)....

    [...]

  • ...In Reference [19], a distinction is made between string-based, corpus-based, and knowledge-based similarity metrics....

    [...]

  • ...Reference [50] describes a tradeoff in setting the n-gram order, n, which denotes the length of ngrams to examine....

    [...]

Journal ArticleDOI
TL;DR: Two learnable string distance metrics for two kinds of ER problems are explored by employing the principle component analysis and the largest margin nearest neighbor algorithm for training, showing that these approaches can improve entity resolution accuracy over traditional techniques.
Abstract: Entity resolution (ER) is to find database records that refer to the same real-world entity. A key component for ER is to choose a proper distance (similarity) function for each database field to quantify the similarity of records. Most existing ER approaches focus on how to define a proper matching rule based on generic or hand-crafted distance metrics. In this paper, we explore two learnable string distance metrics for two kinds of ER problems by employing the principle component analysis and the largest margin nearest neighbor algorithm for training. Experimental results on real data sets show that our approaches can improve entity resolution accuracy over traditional techniques.

2 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...We choose six most common distancemetrics for entity resolution, including three character-based metrics, which are Q-Gram, Jaro and Levenshtein and three tokenbased metrics, which are Overlap Coefficient, Cosine and Jaccard....

    [...]

  • ...We use the Overlap Coefficient [3] to compute the similarity of record pairs, and the similarity result is as follows: the similarities of matching pairs m1 and m2 are 0.9 and 1 respectively, and the similarities of nonmatching pairs n1-n4 are 0.35, 0.33, 0.54 and 0.54, respectively....

    [...]

  • ...We use the Overlap Coefficient [3] to compute the similarity of record pairs, and the similarity result is as follows: the similarities of matching pairs m1 and m2 are 0....

    [...]

Proceedings ArticleDOI
01 Oct 2019
TL;DR: These two algorithms are applied into the automatic system to assess the Japanese language exam with close results between them with average accuracy of Winnowing algorithm is only 1.06% lower than LSA that could gain 87.78% and should be suitable for grading short essay answer in Japanese language.
Abstract: In this paper, advanced of research on e-learning application for short essay grading system had been conducted. This system was developed based on the needs of Japanese Language study program for short essay examination that required time and focus for finishing those tasks. Human abilities are limited by their energy, so that cognitive assessment objectivity could decrease in line with the elapsed time. Latent Semantic Analysis (LSA) and Winnowing Algorithm are two methods used in developing the automatic short essay answer grading system called SIMPLE-O by Department of Electrical Engineering, Universitas Indonesia. These two algorithms are chosen based on its ability to do semantic analytic without the needs of understanding about the characteristic of its languages. LSA used Singular Value Decomposition (SVD) as its main method, besides Winnowing algorithm is based on fingerprinting. These algorithms are applied into the automatic system to assess the Japanese language exam with close results between them with average accuracy of Winnowing algorithm is only 1.06% lower than LSA that could gain 87.78%. These two algorithms should be suitable for grading short essay answer in Japanese language.

2 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Other than PEG, C-rater, Erater, and Latent Semantic Analysis (LSA) are some other applications developed for automatic essay assessment [2][3]....

    [...]

Journal ArticleDOI
03 Sep 2020
TL;DR: An overview of the textual similarity in the literature is provided and many approaches for measuring textual similarity for Arabic text reviewed and compared in this paper.
Abstract: Survey research is appropriate and necessary to address certain research question types. This paper aims to provide a general overview of the textual similarity in the literature. Measuring textual similarity tends to have an increasingly important turn in related topics like text classification, recovery of specific information from data, clustering, topic retrieval, subject tracking, question answering, essay grading, summarization, and the nowadays trending Conversational Agents (CA), which is a program deals with humans through natural language conversation. Finding the similarity between terms is the essential portion of textual similarity, then used as a major phase for sentence-level, paragraph-level, and script-level similarities. In particular, we concern with textual similarity in Arabic. In the Arabic Language, applying Natural language Processing (NLP) tasks are very challenging indeed as it has many characteristics, which are considered as confrontations. However, many approaches for measuring textual similarity have been presented for Arabic text reviewed and compared in this paper.

2 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]