scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey of Text Similarity Approaches

18 Apr 2013-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 68, Iss: 13, pp 13-18
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: An autonomous methodology for automatically classifying a set of text data (customer sentences) as referring to known or unknown CN statements by the company as well as verifying and updating the customer need database with an average of 90 % precision and 60 % recall is presented.

8 citations

Journal ArticleDOI
TL;DR: This work presents an approach for the alignment of online profiles and job advertisements, according to knowledge and skills, using measures of lexical, syntactic and taxonomic similarity, and uses natural language processing to offer a novel approach for assessing the match of the competencies described by the applicants and by the employers, even if they use different terminology.
Abstract: Describing the competencies required by a profession is essential for aligning online profiles of job seekers and job advertisements. Comparing the competencies described within each context has typically not be done, which has generated a complete disconnect in language between them. This work presents an approach for the alignment of online profiles and job advertisements, according to knowledge and skills, using measures of lexical, syntactic and taxonomic similarity. In addition, we use a ranking that allows the alignment of the profiles to the topics of a thesaurus that define competencies. The results are promising, because the combination of the measures of similarity with the alignment with thesauri of competencies offers robustness to the process of generation of professional competence descriptions. This combination allows dealing with the common problems of synonymy, homonymy, hypernymy/hyponymy and meronymy of the terms in Spanish. This research uses natural language processing to offer a novel approach for assessing the match of the competencies described by the applicants and by the employers, even if they use different terminology. The resulting approach, while developed in Spanish for computer science jobs, can be extended to other languages and domains, such is the case of recruitment, where it will contribute to the creation of better tools that give feedback to job seekers about how to best align their competencies with job opportunities.

8 citations


Cites methods from "A Survey of Text Similarity Approac..."

  • ...…relationships between knowledge topics (Malzahn et al. 2013), and to classify documents using the Dice’s coefficient as a measure of similarity (Gomaa and Fahmy 2013); others to find the best contractors to perform the business process tasks, comparing candidates’ skills and knowledge with…...

    [...]

Proceedings ArticleDOI
TL;DR: In this paper, a text classifier with domain specific features is proposed to find the exact tutorial fragments explaining APIs, which can find up to 90% fragments explaining API and improve the state-of-the-art model by up to 30% in terms of F-measure.
Abstract: Developers prefer to utilize third-party libraries when they implement some functionalities and Application Programming Interfaces (APIs) are frequently used by them. Facing an unfamiliar API, developers tend to consult tutorials as learning resources. Unfortunately, the segments explaining a specific API scatter across tutorials. Hence, it remains a challenging issue to find the relevant segments. In this study, we propose a more accurate model to find the exact tutorial fragments explaining APIs. This new model consists of a text classifier with domain specific features. More specifically, we discover two important indicators to complement traditional text based features, namely co-occurrence APIs and knowledge based API extensions. In addition, we incorporate Word2Vec, a semantic similarity metric to enhance the new model. Extensive experiments over two publicly available tutorial datasets show that our new model could find up to 90% fragments explaining APIs and improve the state-of-the-art model by up to 30% in terms of F-measure.

8 citations

Journal Article
TL;DR: It can be induced that the semantic role labeling and named entity recognition approaches could be used as a good keyword selection mechanism and combining the similarity measures of different algorithms would lead to generate a good distractors.
Abstract: This research introduces an automatic multiple choice question generation system to evaluate the understanding of the semantic role labels and named entities in a text. The system provided selects the informative sentence and the keyword to be asked based on the semantic labels and named entities that exist in the sentence, the distractors are chosen based on a similarity measure between sentences in the data set. The system is tested using a set of sentences extracted from the TREC 2007 dataset for question answering. From the experimental results, it can be induced that the semantic role labeling and named entity recognition approaches could be used as a good keyword selection mechanism. The second conclusion is that the string similarity measures proved to be a very good approach that can used in generating the distractors for an automatic multiple choice question. Also, combining the similarity measures of different algorithms would lead to generate a good distractors.

8 citations


Cites background from "A Survey of Text Similarity Approac..."

  • ...Finding the similarity between words is a fundamental step in finding the similarity between sentences and documents [10]....

    [...]

  • ...A survey about these algorithms and text similarity approaches exists in [10]....

    [...]

Proceedings ArticleDOI
12 Oct 2020
TL;DR: A unified adversarial detection framework for detecting adaptive audio adversarial examples, which combines noise padding with sound reverberation is proposed, which consistently outperforms the state-of-the-art audio defense methods, even for the adaptive and robust attacks.
Abstract: Adversarial attacks have been widely recognized as the security vulnerability of deep neural networks, especially in deep automatic speech recognition (ASR) systems. The advanced detection methods against adversarial attacks mainly focus on pre-processing the input audio to alleviate the threat of adversarial noise. Although these methods could detect some simplex adversarial attacks, they fail to handle robust complex attacks especially when the attacker knows the detection details. In this paper, we propose a unified adversarial detection framework for detecting adaptive audio adversarial examples, which combines noise padding with sound reverberation. Specifically, a well-designed adaptive artificial utterances generator is proposed to balance the design complexity, such that the artificial utterances (speech with reverberation) are efficiently determined to reduce the false positive rate and false negative rate of detection results. Moreover, to destroy the continuity of the adversarial noise, we develop a novel multi-noise padding strategy, which implants the Gaussian noises in the silent fragments of the input speech by the voice activity detector. Furthermore, our proposed method can effectively tackle the robust adaptive attacks in an adaptive learning manner. Importantly, the conceived system is easily embedded into any ASR models without requiring additional retraining or modification. The experimental results show that our method consistently outperforms the state-of-the-art audio defense methods, even for the adaptive and robust attacks.

8 citations

References
More filters
Journal ArticleDOI
01 Sep 2000-Language
TL;DR: The lexical database: nouns in WordNet, Katherine J. Miller a semantic network of English verbs, and applications of WordNet: building semantic concordances are presented.
Abstract: Part 1 The lexical database: nouns in WordNet, George A. Miller modifiers in WordNet, Katherine J. Miller a semantic network of English verbs, Christiane Fellbaum design and implementation of the WordNet lexical database and searching software, Randee I. Tengi. Part 2: automated discovery of WordNet relations, Marti A. Hearst representing verb alterations in WordNet, Karen T. Kohl et al the formalization of WordNet by methods of relational concept analysis, Uta E. Priss. Part 3 Applications of WordNet: building semantic concordances, Shari Landes et al performance and confidence in a semantic annotation task, Christiane Fellbaum et al WordNet and class-based probabilities, Philip Resnik combining local context and WordNet similarity for word sense identification, Claudia Leacock and Martin Chodorow using WordNet for text retrieval, Ellen M. Voorhees lexical chains as representations of context for the detection and correction of malapropisms, Graeme Hirst and David St-Onge temporal indexing through lexical chaining, Reem Al-Halimi and Rick Kazman COLOR-X - using knowledge from WordNet for conceptual modelling, J.F.M. Burg and R.P. van de Riet knowledge processing on an extended WordNet, Sanda M. Harabagiu and Dan I Moldovan appendix - obtaining and using WordNet.

13,049 citations

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
01 Jul 1945-Ecology

10,500 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...Dice’s coefficient is defined as twice the number of common terms in the compared strings divided by the total number of terms in both strings [11]....

    [...]

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"A Survey of Text Similarity Approac..." refers background in this paper

  • ...It is useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context [8]....

    [...]

Journal ArticleDOI
TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.
Abstract: How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.

6,014 citations


"A Survey of Text Similarity Approac..." refers methods in this paper

  • ...The GLSA approach can combine any kind of similarity measure on the space of terms with any suitable method of dimensionality reduction....

    [...]

  • ...LSA assumes that words that are close in meaning will occur in similar pieces of text....

    [...]

  • ...Latent Semantic Analysis (LSA) [15] is the most popular technique of Corpus-Based similarity....

    [...]

  • ...Generalized Latent Semantic Analysis (GLSA) [16] is a framework for computing semantically motivated term and document vectors....

    [...]

  • ...Mining the web for synonyms: PMIIR versus LSA on TOEFL....

    [...]