scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
16 Dec 2019
TL;DR: Sistem penilaian otomatis pada jawaban pendek dapat dipertimbangkan sebagai alternatif dalam proses penilaians ujian siswa as mentioned in this paper.
Abstract: Sistem penilaian otomatis pada jawaban pendek dapat dipertimbangkan sebagai alternatif dalam proses penilaian ujian siswa. Berbeda dengan model pilihan ganda, model penilaian jawaban pendek lebih sulit untuk dihitung dengan metode komputasi, karena penilaian jawaban pendek membutuhkan teknik-teknik pengolahan bahasa alami. Beberapa metode komputasi telah digunakan dan dikembangkan oleh peneliti sebelumnya. Salah satu teknik yang paling dasar untuk digunakan adalah metode penilaian berbasis leksikal yang menilai jawaban berdasarkan kemiripan susunan karakternya. Penelitian ini menggunakan metode Cosine Similarity dan Jaccard Similarity yang dapat digunakan untuk mengukur kemiripan jawaban siswa dengan jawaban guru berdasarkan kata-kata penyusunnya. Teknik pra-pemrosesan teks juga digunakan untuk membandingkan hasil pada masing-masing metode. Data yang digunakan pada penelitian ini merupakan soal dan jawaban berbahasa Indonesia. Hasil menunjukkan bahwa metode Cosine Similarity yang dilengkapi dengan pra-pemrosesan teks mampu meraih korelasi tertinggi (0.62) pada pengukuran korelasi Pearson. Kata kunci—penilaian otomatis, cosine similarity, jaccard similarity, jawaban pendek, pengolahan bahasa alami

5 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Dalam penelitiannya, Gomaa [15] mengklasifikasikan kemiripan kata-kata secara leksikal sebagai String-Based Algorithm yang ditunjukkan oleh Gambar 2....

    [...]

  • ...[15] W. H.Gomaa and A. A. Fahmy, “A Survey of Text Similarity Approaches,” Int....

    [...]

  • ...Mengukur kesamaan/kemiripan antara kata-kata, kalimat, paragraf dan dokumen merupakan komponen penting dalam berbagai pekerjaan seperti pencarian informasi, klastering dokumen, disambiguasi makna, penilaian esai otomatis, penilaian jawaban pendek, mesin penerjemah dan peringkasan teks [15]....

    [...]

Proceedings Article
01 Jan 2020
TL;DR: This paper presents a case study of migrating a privacy-safe information extraction system in production for Gmail from a traditional rule-based architecture to a machine-learned Software 2.0 architecture.
Abstract: This paper presents a case study of migrating a privacy-safe information extraction system in production for Gmail from a traditional rule-based architecture to a machine-learned Software 2.0 architecture. The key idea is to use the extractions from the existing rule-based system as training data to learn models that in turn replace all the machinery for the rule-based system. The resulting system a) delivers better precision and recall, b) is significantly smaller in terms of lines of code, c) is easier to maintain and improve, and d) allowed us to leverage machine learning advances to build a cross-language extraction system even though our original training data was only in English. We describe challenges encountered during this migration around generation and management of training data, evaluation of models, and report on many traditional“Software 1.0”components we built to address them.

5 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Approximate matching of complex fields is a well-studied problem with many sophisticated solutions [10]....

    [...]

Posted Content
TL;DR: This paper used an attention-based recurrent neural network model that optimizes the sentence similarity across English, Spanish, and Arabic for cross-lingual Semantic Textual Similarity (STS) task.
Abstract: This paper describes a neural-network model which performed competitively (top 6) at the SemEval 2017 cross-lingual Semantic Textual Similarity (STS) task. Our system employs an attention-based recurrent neural network model that optimizes the sentence similarity. In this paper, we describe our participation in the multilingual STS task which measures similarity across English, Spanish, and Arabic.

5 citations

Journal ArticleDOI
TL;DR: In this paper, a review-to-aspect mapping method was proposed to explore reviewers' opinions from the massive and sparse online reviews, and the analytical and experimental results with real data demonstrate that online customers can be sectioned into groups in accordance with their reviewing behaviors and that people within the same group may have similar reviewing motivations and concerns for an online shopping experience.
Abstract: Web 2.0 technologies have attracted an increasing number of people with various backgrounds to become active online writers and viewers. As a result, exploring reviewers’ opinions from a huge number of online reviews has become more important and simultaneously more difficult than ever before. In this paper, we first present a methodological framework to study the “purchasing-reviewing” behavior dynamics of online customers. Then, we propose a review-to-aspect mapping method to explore reviewers’ opinions from the massive and sparse online reviews. The analytical and experimental results with real data demonstrate that online customers can be sectioned into groups in accordance with their reviewing behaviors and that people within the same group may have similar reviewing motivations and concerns for an online shopping experience.

5 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...…ij k j n k w dist s s dist s Null φ = ≤= (4) where ( )dist • may be any method that can measure the similarity of iw (or the initial semantic word ijs of iw ) and ks , for example, the string-based, corpus-based (Gomaa and Fahmy 2013), or cluster-based (Aggarwal and Zhai 2012) similarity method....

    [...]

01 Jan 2017
TL;DR: This article proposes a graph-based method, specifically βcompact clustering, for discovering the groups of documents written by the same author, based on the analysis of the similarity between documents and they belong to the same group as long as the similarities between them exceeds the threshold β.
Abstract: Identifying the authorship either of an anonymous or a doubtful document constitutes a cornerstone for automatic forensic applications. Moreover, it is a challenging task for both humans and computers. Clustering documents according to the linguistic style of the authors who wrote them has been a task little studied by the research community. In order to address this problem, PAN Evaluation Framework has become the first effort to promote the development of the author clustering. This article proposes a graph-based method, specifically βcompact clustering, for discovering the groups of documents written by the same author. The β-compact algorithm is based on the analysis of the similarity between documents and they belong to the same group as long as the similarity between them exceeds the threshold β and it is the maximum similarity with respect to other documents. In our proposal we evaluated different linguistic features and similarity measures presented in previous works of authorship analysis task. The training dataset was used to determine the best value of β parameter for each language. The result of the experiments was encouraging.

5 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...We used the Dice, Jaccard and Cosine functions [6], using only binary features, that is, we did not compute the frequency of each of the features, only their appearance in the document....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]