scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The goal of Sub-Sync is to address the lack of synchronism that occurs in the subtitles when produced during the broadcast of live TV programs or other programs that have some improvised parts, and to resolve the problematic interphase of synchronized and unsynchronized parts of mixed type programs.
Abstract: Individuals with sensory impairment (hearing or visual) encounter serious communication barriers within society and the world around them. These barriers hinder the communication process and make access to information an obstacle they must overcome on a daily basis. In this context, one of the most common complaints made by the Television (TV) users with sensory impairment is the lack of synchronism between audio and subtitles in some types of programs. In addition, synchronization remains one of the most significant factors in audience perception of quality in live-originated TV subtitles for the deaf and hard of hearing. This paper introduces the Sub-Sync framework intended for use in automatic synchronization of audio-visual contents and subtitles, taking advantage of current well-known techniques used in symbol sequences alignment. In this particular case, these symbol sequences are the subtitles produced by the broadcaster subtitling system and the word flow generated by an automatic speech recognizing the procedure. The goal of Sub-Sync is to address the lack of synchronism that occurs in the subtitles when produced during the broadcast of live TV programs or other programs that have some improvised parts. Furthermore, it also aims to resolve the problematic interphase of synchronized and unsynchronized parts of mixed type programs. In addition, the framework is able to synchronize the subtitles even when they do not correspond literally to the original audio and/or the audio cannot be completely transcribed by an automatic process. Sub-Sync has been successfully tested in different live broadcasts, including mixed programs, in which the synchronized parts (recorded, scripted) are interspersed with desynchronized (improvised) ones.

11 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...For this reason, Needleman-Wunsch [11] [12]–[14] was applied in order to avoid delays in the computational time....

    [...]

Proceedings Article
01 Mar 2016
TL;DR: APSyn as discussed by the authors evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists.
Abstract: In this paper, we claim that Vector Cosine ― which is generally considered one of the most efficient unsupervised measures for identifying word similarity in Vector Space Models ― can be outperformed by a completely unsupervised measure that evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists. This claim comes from the hypothesis that similar words do not simply occur in similar contexts, but they share a larger portion of their most relevant contexts compared to other related words. To prove it, we describe and evaluate APSyn, a variant of Average Precision that ― independently of the adopted parameters ― outperforms the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in the TOEFL dataset, beating therefore the non-English US college applicants (whose average, as reported in the literature, is 64.50%) and several state-of-the-art approaches.

11 citations

Proceedings ArticleDOI
01 Sep 2018
TL;DR: The conclusion of this research is dynamic gram similarity can implement to agent in educational game and can answer grammar English question automatically.
Abstract: Today, almost people like a game. A game can make people have fun when playing it. An exciting game must have many feature and menus especially artificial intelligent (AI) called enemy or agent. Agent in game can be build using any method to make smart and intellect. In educational game, smart agent must can choose and decide a right answer of question. In this research, we make an agent to answer English question using dynamic gram similarity. This method based on n-gram similarity method but it can check neighbor word on sentence. Result of this research indicates dynamic gram similarity has 66% which is more accurate answer than unigram, big ram and trigram similarity. The conclusion of this research is dynamic gram similarity can implement to agent in educational game and can answer grammar English question automatically.

10 citations

Posted Content
17 Sep 2017
TL;DR: Author(s): Padhi, Saswat; Jain, Prateek; Perelman, Daniel; Polozov, Oleksandr; Gulwani, Sumit; Millstein, Todd D.
Abstract: Author(s): Padhi, Saswat; Jain, Prateek; Perelman, Daniel; Polozov, Oleksandr; Gulwani, Sumit; Millstein, Todd D

10 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...OpenRefine [3] does not learn syntactic profiles, but it allows clustering of strings using character-based similarity measures [17]....

    [...]

  • ...We observed that character-based measures [17] show poor AUC, and are not indicative of syntactic similarity....

    [...]

  • ...However, existing similarity measures [17] do not capture the desired syntactic dissimilarity over strings....

    [...]

Journal ArticleDOI
TL;DR: The proposed model achieves comparable performance to that of other state‐of‐the‐art models, and the proposed cyclic self‐learning strategy can be further extended to other domains.

10 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...Among them, corpus-based similarity calculations can be divided into three methods based on different models: bag-of-words (BOW) models, neural networks and search engines (Gomaa & Fahmy,  2013; Pradhan et al., 2015)....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]