scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Nov 2019
TL;DR: In this paper, the authors proposed a large scale B2B recommender framework to address the requirements of large-scale retailers, e.g., the huge number of items and customers leading to data sparsity or the high level of accuracy required in recommending safety items.
Abstract: The aim of a recommender system is to suggest relevant items in order to improve purchasing experience and minimise information overload. Despite extensive research in the area of B2C recommender systems, business-to-business (B2B) distributors can not directly benefit from the results. Mainly because the data from these large scale retailers is not publicly available to researchers and also their problems are not widely known to the outside world. These companies have complex structures for their items and customers, e.g. the huge number of items and customers leading to data sparsity or the high level of accuracy required in recommending safety items. Furthermore, one of the key requirements for such businesses is bulk recommendations to be able to meet their market demands. A unique hybrid approach to recommendation with an emphasis on knowledge components is needed for such businesses. It is critical to have a careful analysis of item-category specific features for any recommendation as well as the customer context. In this paper, we propose a large scale B2B recommender framework to address the above requirements.

1 citations

Proceedings ArticleDOI
17 Jun 2022
TL;DR: The paper presents several attempts even forms of visualization used to display information about possible redundancy and duplicity of content plus specially in the domain of educational content engineering.
Abstract: Nowadays, a lot of content is being moved to digital form, which allows its better processing and analysis. However, in some cases, loose redundancy or duplicity can be a problem. In this paper, redundancy even duplicity of short texts based on ontology-related approaches and their visualization is presented. The paper presents several attempts even forms of visualization used to display information about possible redundancy and duplicity of content plus specially in the domain of educational content engineering.

1 citations

Proceedings ArticleDOI
29 Apr 2022
TL;DR: In this paper , the authors simulate human imprecision in conversational virtual agents' temporal statements and conduct a user study to evaluate the effects of time precision on perceived anthropomorphism and usefulness.
Abstract: Research on intelligent virtual agents (IVAs) often concerns the implementation of human-like behavior by integrating artificial intelligence algorithms. Thus far, few studies focused on mimicry of cognitive imperfections inherent to humans in IVAs. Neglecting to implement such imperfect behavior in IVAs might result in less believable or engaging human-agent interactions. In this paper, we simulate human imprecision in conversational IVAs’ temporal statements. We conducted a survey to identify temporal statement patterns, transferred them to a conversational IVA, and conducted a user study evaluating the effects of time precision on perceived anthropomorphism and usefulness. Statistical analyses reveal significant interaction between time precision and agents’ use of memory aids, indicating that (i) imprecise agents are perceived as more human-like than precise agents when responding immediately, and (ii) unnaturally high levels of temporal precision can be compensated for by memory aid use. Further findings underscore the value of continued inquiry into cultural variations.

1 citations

Book ChapterDOI
10 Dec 2019
TL;DR: An unsupervised short text tagging algorithm that generates latent topics, or clusters of semantically similar words, from a corpus of short texts, and labels these short texts by stable predominant topics, which shows the method is competitive with industry short text topic modeling algorithms.
Abstract: From online reviews and product descriptions to tweets and chats, many modern applications revolve around understanding both semantic structure and topics of short texts. Due to significant reliance on word co-occurrence, traditional topic modeling algorithms such as LDA perform poorly on sparse short texts. In this paper, we propose an unsupervised short text tagging algorithm that generates latent topics, or clusters of semantically similar words, from a corpus of short texts, and labels these short texts by stable predominant topics. The algorithm defines a weighted undirected network, namely the one mode projection of the bipartite network between words and users. Nodes represent all unique words from the corpus of short texts, edges mutual presence of pairs of words in a short text, and weights the number of short texts in which pairs of words appear. We generate the latent topics using nested stochastic block models (NSBM), dividing the network of words into communities of similar words. The algorithm is versatile—it automatically detects the appropriate number of topics. Many applications stem from the proposed algorithm, such as using the short text topic representations as the basis of a short text similarity metric. We validate the results using inter-semantic similarity and normalized mutual information, which show the method is competitive with industry short text topic modeling algorithms.

1 citations

Dissertation
31 May 2018
TL;DR: This work used the Moral Foundations Theory and Doc2Vec, a Natural Language Processing technique, to compute the quantified moral loadings of usergenerated textual contents in social networks, and indicated that these moral features are tightly bound with users’ behavior insocial networks.
Abstract: Moral inclinations expressed in user-generated content such as online reviews or tweets can provide useful insights to understand users’ behavior and activities in social networks, for example, to predict users’ rating behavior, perform customer feedback mining, and study users’ tendency to spread abusive content on these social platforms. In this work, we want to answer two important research questions. First, if the moral attributes of social network data can provide additional useful information about users’ behavior and how to utilize this information to enhance our understanding. To answer this question, we used the Moral Foundations Theory and Doc2Vec, a Natural Language Processing technique, to compute the quantified moral loadings of usergenerated textual contents in social networks. We used conditional relative frequency and the correlations between the moral foundations as two measures to study the moral break down of the social network data, utilizing a dataset of Yelp reviews and a dataset of tweets on abusive user-generated content. Our findings indicated that these moral features are tightly bound with users’ behavior in social networks. The second question we want to answer is if we can use the quantified moral loadings as new boosting features to improve the differentiation, classification, and prediction of social network activities. To test our hypothesis, we adopted our new moral features in a multi-class classification approach to distinguish hateful and offensive tweets in a labeled dataset, and compared with the baseline approach that only uses conventional text mining features such as tf-idf features, Part of Speech (PoS) tags, etc. Our findings demonstrated that the moral features improved the performance of the baseline approach in terms of precision, recall, and F-measure.

1 citations


Cites background or methods from "A Survey of Text Similarity Approac..."

  • ...• Overlap coefficient is very similar to Dice’s coefficient; however, if one document is a subset of the other document, we will consider the similarity as a full match [41]....

    [...]

  • ...In addition, there are character-based similarity approaches such as Longest Common SubString (LCS) algorithm which considers the similarity between two strings as the length of contiguous chain of characters that are common in both strings, or N-grams where the similarity is defined as the count of the common N-grams in two strings over the maximal number of the Ngrams in two strings [9, 41]....

    [...]

  • ...• Dice’s coefficient is computed as twice the number of common terms in two documents over the total number of terms in both documents [30, 41]....

    [...]

  • ...• Jaccard similarity is defined as the number of common terms divided by the number of the unique terms in the documents [41, 55]...

    [...]

  • ...• Matching coefficient is a vector-based scheme where we count the number of similar terms in the documents where both document vectors are non-zero [41]....

    [...]

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]