scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
23 Feb 2019
TL;DR: The model facilitates an inclusive conceptualisation of attacks via the speech interface, and serves as a basis for critical analysis of the currently available defence measures.
Abstract: This paper presents a high-level model of attacks via a speech interface, and of defences against such attacks. Specifically, the paper provides a summary of different types of attacks, and of the defences available to counter them, within the framework of the OODA loop model. The model facilitates an inclusive conceptualisation of attacks via the speech interface, and serves as a basis for critical analysis of the currently available defence measures.

4 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Whilst a number of both phonetic and semantic distance measures have been developed (Pucher et al., 2007)(Gomaa and Fahmy, 2013), none of these are fully reliable in terms of their ability to separate sounds and meanings which are perceived as different by human listeners....

    [...]

  • ..., 2007)(Gomaa and Fahmy, 2013), none of these are fully reliable in terms of their ability to separate sounds and meanings which are perceived as different by human listeners....

    [...]

Proceedings ArticleDOI
01 Jan 2016
TL;DR: An automated test assembly algorithm to minimize the redundant question in a test form based on Bee algorithm is proposed by using a new technique, called Min-SumDistance (MSD).
Abstract: An ideal test form should contain questions with different level of difficulties and non-redundant questions. This paper proposed an automated test assembly algorithm to minimize the redundant question in a test form based on Bee algorithm. A neighborhood search in Bee algorithm is improved by using a new technique, called Min-SumDistance (MSD). The MSD is the distance of considered question compared to others in the test form. The sum of question pairs distance indicates to the redundant question in the test form. A question content is represented in two forms as an unigram with TF and TF-IDF scores. The experiments using 200 questions from Information Technology Professional Examination(ITPE). To evaluate the performance of MSD method, we count a number of enemy pairs of the test form and compared to the random method. The experimental results show that our proposed algorithm yields the significant numbers of redundant questions.

4 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Gomaa and Fahmy mentioned that there are many techniques proposed to measure the text Ref. code: 25605609035091CUN 9 similarity....

    [...]

  • ...Gomaa WH, Fahmy AA....

    [...]

  • ...Gomaa and Fahmy (20) survey the text similarity approach which separates into a string-based approach, corpus-based approach, and knowledgebased approach....

    [...]

  • ...Gomaa and Fahmy (20) survey the text similarity approach which separates into a string-based approach, corpus-based approach, and knowledgebased approach....

    [...]

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper conducted an empirical analysis of click farming on the Taobao platform in China and extracted several new features from three sources, namely main goods, online shop itself, and online reviews, based on the formation mechanism of click-farming.

4 citations

Journal ArticleDOI
TL;DR: Experimental results prove that the semantic role labeling and named entity recognition approaches can be used for keyword selection in an automatic multiple choice question generation system.
Abstract: this research, an automatic multiple choice question generation system for evaluating semantic role labels and named entities is proposed. The selection of the informative sentence and the keyword to be asked about are based on the semantic labels and named entities that exist in the question sentence. The research introduces a novel method for the distractor selection process. Distractors are chosen based on a string similarity measure between sentences in the data set. Eight algorithms of string similarity measures are used in this research. The system is tested using a set of sentences extracted from the data set for question answering. Experimental results prove that the semantic role labeling and named entity recognition approaches can be used for keyword selection. String similarity measures have been used in generating the distractors in the process of automatic multiple choice questions generation. Combining the similarity measures of some algorithms led to enhancing the results.

4 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...A survey about these algorithms and text similarity approaches exists in [20]....

    [...]

Posted Content
TL;DR: In this article, a text analysis platform focused on the pharmaceutical domain is presented, which applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles, and thoroughly integrates the results obtained through a proposed methodology.
Abstract: The challenge of recognizing named entities in a given text has been a very dynamic field in recent years. This is due to the advances in neural network architectures, increase of computing power and the availability of diverse labeled datasets, which deliver pre-trained, highly accurate models. These tasks are generally focused on tagging common entities, but domain-specific use-cases require tagging custom entities which are not part of the pre-trained models. This can be solved by either fine-tuning the pre-trained models, or by training custom models. The main challenge lies in obtaining reliable labeled training and test datasets, and manual labeling would be a highly tedious task. In this paper we present PharmKE, a text analysis platform focused on the pharmaceutical domain, which applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles. It performs text classification using state-of-the-art transfer learning models, and thoroughly integrates the results obtained through a proposed methodology. The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks, centered on the pharmaceutical domain. The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset. Additionally, the PharmKE platform integrates the results obtained from named entity recognition tasks to resolve co-references of entities and analyze the semantic relations in every sentence, thus setting up a baseline for additional text analysis tasks, such as question answering and fact extraction. The recognized entities are also used to expand the knowledge graph generated by DBpedia Spotlight for a given pharmaceutical text.

4 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]