scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
09 Dec 2020
TL;DR: In this paper, the authors proposed an approach that uses machine learning models with seven character-based similarity measures to classify texts based on similarity and found that the trained Neural Networks model gives the best mean accuracy (96%) in detecting similarity between two text bodies.
Abstract: Text similarity detection is one of the significant research problems in the Natural Language Processing field. In this paper, we propose an approach that uses machine learning models with seven character-based similarity measures to classify texts based on similarity. For this purpose, we use character-based similarity measures—Longest Common Substring, Longest Common Subsequence, Ratcliff/Obershelp algorithms, Jaro, Jaro–Winkler, Levenshtein, and Damerau-Levenshtein distances as input of supervised machine learning algorithms. For the similarity detection task, news articles are collected from Azerbaijani news websites and 9600 text pairs are created and manually labeled as similar and non-similar. These text pairs are processed by similarity measures to feed Machine learning algorithms—Support Vector Machine, Random Forest and Multi-layer Perceptron Neural Network. We performed a 10-fold cross-validation process on the dataset and found that the trained Neural Networks model gives the best mean accuracy (96%) in detecting similarity between two text bodies. We demonstrated that our proposed method outperforms results gained from individual character-based similarity measurement.

2 citations

Proceedings ArticleDOI
01 Dec 2019
TL;DR: A simulator intended to accelerate the ability to experiment with workflows for extracting Darwin Core (DC) terms from images of specimens and adds Human-in-the-loop capabilities, for iterative IE and research on optimal methods.
Abstract: In the last decade, institutions from around the world have implemented initiatives for digitizing biological collections (biocollections) and sharing their information online. The transcription of the metadata from photographs of specimens’ labels is performed through human-centered approaches (e.g., crowdsourcing) because fully automated Information Extraction (IE) methods still generate a significant number of errors. The integration of human and machine tasks has been proposed to accelerate the IE from the billions of specimens waiting to be digitized. Nevertheless, in order to conduct research and trying new techniques, IE practitioners need to prepare sets of images, crowdsourcing experiments, recruit volunteers, process the transcriptions, generate ground truth values, program automated methods, etc. These research resources and processes require time and effort to be developed and architected into a functional system. In this paper, we present a simulator intended to accelerate the ability to experiment with workflows for extracting Darwin Core (DC) terms from images of specimens. The so-called HuMaIN Simulator includes the engine, the human-machine IE workflows for three DC terms, the code of the automated IE methods, crowdsourced and ground truth transcriptions of the DC terms of three biocollections, and several experiments that exemplify its potential use. The simulator adds Human-in-the-loop capabilities, for iterative IE and research on optimal methods. Its practical design permits the quick definition, customization, and implementation of experimental IE scenarios.

2 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...Quality is computed as the Damerau-Levenhstein [39] similarity of the extracted Event-date to the ground truth values....

    [...]

Journal ArticleDOI
TL;DR: A proposal to automatically generate the dialogue rules from a dialogue corpus through the use of evolving algorithms and adapt the rules according to the detected user intention, which is an efficient way for adapting a set of dialogue rules considering user utterance clusters.
Abstract: Conversational systems have become an element of everyday life for billions of users who use speech‐based interfaces to services, engage with personal digital assistants on smartphones, social media chatbots, or smart speakers. One of the most complex tasks in the development of these systems is to design the dialogue model, the logic that provided a user input selects the next answer. The dialogue model must also consider mechanisms to adapt the response of the system and the interaction style according to different groups and user profiles. Rule‐based systems are difficult to adapt to phenomena that were not taken into consideration at design‐time. However, many of the systems that are commercially available are based on rules, and so are the most widespread tools for the development of chatbots and speech interfaces. In this article, we present a proposal to: (a) automatically generate the dialogue rules from a dialogue corpus through the use of evolving algorithms, (b) adapt the rules according to the detected user intention. We have evaluated our proposal with several conversational systems of different application domains, from which our approach provided an efficient way for adapting a set of dialogue rules considering user utterance clusters.

2 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...Different techniques has been proposed for this task (Gomaa & Fahmy, 2013)....

    [...]

Journal ArticleDOI
TL;DR: This article is intended to provide a history of the project and some of the key events leading up to and during the development of the plan.
Abstract: Article history: Received: 30 August, 2020 Accepted: 21 October, 2020 Online: 26 October, 2020

2 citations

Journal ArticleDOI
23 Oct 2017
TL;DR: In this research, the relationship between subjects was calculated based on the proximity of the primary contents of the subjects and the value was determined by calculating TF-IDF (Term Frequency Inverse Document Frequency) from each term.
Abstract: In education world, recognizing the relationship between one subject and another is imperative. By recognizing the relationship between courses, performing sustainability mapping between subjects can be easily performed. Moreover, detecting and reducing any duplicated contents in several subjects will be also possible to execute. Of course, these conveniences will benefit lecturers, students and departments. It will ease the analysis and discussion processes between lecturers related to subjects in the same domain. In addition, students will conveniently choose a group of subjects they are interested in. Furthermore, departments can easily create a specialization group based on the similarity of the subjects and combine the courses possessing high similarity. In this research, given a good database, the relationship between subjects was calculated based on the proximity of the primary contents of the subjects. The feature used was term feature, in which value was determined by calculating TF-IDF (Term Frequency Inverse Document Frequency) from each term. In recognizing the value of proximity between subjects, cosine similarity method was implemented. Finally, testing was done utilizing precision, recall and accuracy method. The research results show that the precision and accuracy values are 90,91% and the recall value is 100%.

2 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]