scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
28 Aug 2018
TL;DR: The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.
Abstract: This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI Record Linkage toolkit is employed for most of the record linkage pipeline tasks (i.e. preprocessing, scalable feature representation, blocking and clustering) and the OpenCalais platform for entity extraction. The resulting clusters are evaluated with multiple clustering quality metrics. The experiments show very good clustering results and significant speedups in the clustering process, which indicates the suitability of both the record linkage formulation and the JedAI toolkit for improving the scalability for large-scale document clustering tasks.

3 citations

Journal ArticleDOI
TL;DR: A web application that enables you to check a multilingual text, with special focus on Arabic, for duplicate contents on the World Wide Web, and the results were encouraging and will open doors for new and innovative techniques for researchers in this field.
Abstract: Using someone else's work or ideas without attribution is plagiarism, whether you meant to do it or not. Unintended plagiarism of snippet of text can have serious consequences and be a serious form of ethical misconduct. The current system is a web application that enables you to check a multilingual text, with special focus on Arabic, for duplicate contents on the World Wide Web. In this system, you can simply input or paste your text through the online system and for each sentence in the text it will go through three popular search engines: Google, Bing, and Yandex SERP and try to find the top three results on the first page for each search engine where duplicate contents already exist. This system is getting data from the three-search engines custom search APIs. Then, the system uses a text similarity technique between the suspicious sentence and the retrieved text snippet for all nine results. The result is the one that gives the highest similarity rate. The results were encouraging and will open doors for new and innovative techniques for researchers in this field.

3 citations

Book ChapterDOI
18 Sep 2017
TL;DR: This work acquires data from a large dataset of pages liked by users of a Facebook app and uses it to associate with high accuracy a given cinema-related page on Facebook to the corresponding record on IMDb, which includes plenty of metadata in addition to genres.
Abstract: In Facebook, the set of pages liked by some users represents an important knowledge about their real life tastes. However, the process of classification, which is already hard when dealing with dozens of classes and genres, is made even more difficult by the very coarse information of Facebook pages. Our work originates from a large dataset of pages liked by users of a Facebook app. To overcome the limitations of multilabel automatic classification of free-form user-generated pages, we acquire data also from IMDb, a large public database about movies. We use it to associate with high accuracy a given cinema-related page on Facebook to the corresponding record on IMDb, which includes plenty of metadata in addition to genres. To this aim, we compare different approaches. The obtained results demonstrate that the highest accuracy is obtained by the combined use of different methods and metrics.

3 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...In [10], the authors discuss the existing works on text similarity through partitioning them into three approaches: String-based, Corpus-based and Knowledge-based similarities....

    [...]

Proceedings ArticleDOI
01 Nov 2017
TL;DR: The analysis conducted found that among the most common similarity measurements, those based on the Jaro-Winkler algorithm significantly outperformed the other algorithms.
Abstract: Similarity measurement is a significant process to determine the degree of similarity between two records. This paper presents a comparative analysis of important similarity measurements which are utilised for the detection of duplicated records in databases. The work evaluates their strengths based on the efficiency of prevailing algorithms, the time required to process and identify duplications as well as performance accuracy. The analysis conducted found that among the most common similarity measurements, those based on the Jaro-Winkler algorithm significantly outperformed the other algorithms. This paper presents an enhanced strategy based on the Jaro-Winkler algorithm to improve the detection of similarity among database records. The ability to provide solutions to this problem will greatly enhance the quality of data used in decision-making.

3 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...Every non-matching data character is a transposition [9], [16]....

    [...]

  • ...There are several variations regarding string metrics; the most frequently used are described below [9][10]....

    [...]

  • ...The algorithm uses a prefix scale that assigns a greater positive ranking to a set of strings that are matched from the start of the common prefix length, up to a maximum of four characters [9], [16], [19]....

    [...]

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper collected product review data from shopping websites, social media, product communities, and other online platforms to identify product competitors with the help of word-frequency cooccurrence technology.
Abstract: Sellers readily obtain consumer product evaluations from online reviews in order to identify competitive products in detail and predict sales. Firstly, we collect product review data from shopping websites, social media, product communities, and other online platforms to identify product competitors with the help of word-frequency cooccurrence technology. We take mobile phones as an example to mine and analyze product competition information. Then, we calculate the product review quantity, review emotion value, product-network heat, and price statistics and establish the regression model of online product review forecasts. In addition, the neural-network model is established to suggest that the relationships among factors are linear. On the basis of analyzing and discussing the impact of product sales of the competitors, product price, the emotional value of the reviews, and product-network popularity, we construct the sales forecast model. Finally, to verify the validity of the factor analysis affecting the sales and the rationality of the established model, actual sales data are used to further analyze and verify the model, showing that the model is reasonable and effective.

3 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]