scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2011"


Journal ArticleDOI
18 Jul 2011-PLOS ONE
TL;DR: REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures.
Abstract: Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.

4,919 citations


Proceedings ArticleDOI
28 Mar 2011
TL;DR: This paper proposes a new semantic relatedness model, Temporal Semantic Analysis (TSA), which captures this temporal information in word semantics as a vector of concepts over a corpus of temporally-ordered documents.
Abstract: Computing the degree of semantic relatedness of words is a key functionality of many language applications such as search, clustering, and disambiguation. Previous approaches to computing semantic relatedness mostly used static language resources, while essentially ignoring their temporal aspects. We believe that a considerable amount of relatedness information can also be found in studying patterns of word usage over time. Consider, for instance, a newspaper archive spanning many years. Two words such as "war" and "peace" might rarely co-occur in the same articles, yet their patterns of use over time might be similar. In this paper, we propose a new semantic relatedness model, Temporal Semantic Analysis (TSA), which captures this temporal information. The previous state of the art method, Explicit Semantic Analysis (ESA), represented word semantics as a vector of concepts. TSA uses a more refined representation, where each concept is no longer scalar, but is instead represented as time series over a corpus of temporally-ordered documents. To the best of our knowledge, this is the first attempt to incorporate temporal evidence into models of semantic relatedness. Empirical evaluation shows that TSA provides consistent improvements over the state of the art ESA results on multiple benchmarks.

482 citations


Proceedings Article
23 Jun 2011
TL;DR: A novel discriminative training method that projects the raw term vectors into a common, low-dimensional vector space, which not only outperforms existing state-of-the-art approaches, but also achieves high accuracy at low dimensions and is thus more efficient.
Abstract: Traditional text similarity measures consider each term similar only to itself and do not model semantic relatedness of terms. We propose a novel discriminative training method that projects the raw term vectors into a common, low-dimensional vector space. Our approach operates by finding the optimal matrix to minimize the loss of the pre-selected similarity function (e.g., cosine) of the projected vectors, and is able to efficiently handle a large number of training examples in the high-dimensional space. Evaluated on two very different tasks, cross-lingual document retrieval and ad relevance measure, our method not only outperforms existing state-of-the-art approaches, but also achieves high accuracy at low dimensions and is thus more efficient.

298 citations


Journal ArticleDOI
TL;DR: This analysis leads to a conclusion that SSC is more constructive and has higher locality than SAC, NSM and SC; it believes these are the main reasons for the improved performance of SSC.
Abstract: We investigate the effects of semantically-based crossover operators in genetic programming, applied to real-valued symbolic regression problems. We propose two new relations derived from the semantic distance between subtrees, known as semantic equivalence and semantic similarity. These relations are used to guide variants of the crossover operator, resulting in two new crossover operators--semantics aware crossover (SAC) and semantic similarity-based crossover (SSC). SAC, was introduced and previously studied, is added here for the purpose of comparison and analysis. SSC extends SAC by more closely controlling the semantic distance between subtrees to which crossover may be applied. The new operators were tested on some real-valued symbolic regression problems and compared with standard crossover (SC), context aware crossover (CAC), Soft Brood Selection (SBS), and No Same Mate (NSM) selection. The experimental results show on the problems examined that, with computational effort measured by the number of function node evaluations, only SSC and SBS were significantly better than SC, and SSC was often better than SBS. Further experiments were also conducted to analyse the perfomance sensitivity to the parameter settings for SSC. This analysis leads to a conclusion that SSC is more constructive and has higher locality than SAC, NSM and SC; we believe these are the main reasons for the improved performance of SSC.

259 citations


Journal ArticleDOI
TL;DR: This paper analyzes ontology-based approaches for IC computation and proposes several improvements aimed to better capture the semantic evidence modelled in the ontology for the particular concept.
Abstract: The information content (IC) of a concept provides an estimation of its degree of generality/concreteness, a dimension which enables a better understanding of concept's semantics. As a result, IC has been successfully applied to the automatic assessment of the semantic similarity between concepts. In the past, IC has been estimated as the probability of appearance of concepts in corpora. However, the applicability and scalability of this method are hampered due to corpora dependency and data sparseness. More recently, some authors proposed IC-based measures using taxonomical features extracted from an ontology for a particular concept, obtaining promising results. In this paper, we analyse these ontology-based approaches for IC computation and propose several improvements aimed to better capture the semantic evidence modelled in the ontology for the particular concept. Our approach has been evaluated and compared with related works (both corpora and ontology-based ones) when applied to the task of semantic similarity estimation. Results obtained for a widely used benchmark show that our method enables similarity estimations which are better correlated with human judgements than related works.

256 citations


Journal ArticleDOI
TL;DR: A new measure based on the exploitation of the taxonomical structure of a biomedical ontology is proposed, using SNOMED CT as the input ontology and shows that it outperforms most of the previous measures avoiding, at the same time, some of their limitations.

239 citations


Book ChapterDOI
29 May 2011
TL;DR: It is shown that the adoption of Semantic Web standards can provide added value for lexicon models by supporting a rich axiomatization of linguistic categories that can be used to constrain the usage of the model and to perform consistency checks.
Abstract: There are a large number of ontologies currently available on the Semantic Web. However, in order to exploit them within natural language processing applications, more linguistic information than can be represented in current Semantic Web standards is required. Further, there are a large number of lexical resources available representing a wealth of linguistic information, but this data exists in various formats and is difficult to link to ontologies and other resources. We present a model we call lemon (Lexicon Model for Ontologies) that supports the sharing of terminological and lexicon resources on the Semantic Web as well as their linking to the existing semantic representations provided by ontologies. We demonstrate that lemon can succinctly represent existing lexical resources and in combination with standard NLP tools we can easily generate new lexica for domain ontologies according to the lemon model. We demonstrate that by combining generated and existing lexica we can collaboratively develop rich lexical descriptions of ontology entities. We also show that the adoption of Semantic Web standards can provide added value for lexicon models by supporting a rich axiomatization of linguistic categories that can be used to constrain the usage of the model and to perform consistency checks.

229 citations


Proceedings Article
19 Jun 2011
TL;DR: This work combines several graph alignment features with lexical semantic similarity measures using machine learning techniques and shows that the student answers can be more accurately graded than if the semantic measures were used in isolation.
Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.

219 citations


Journal ArticleDOI
TL;DR: This work proposes an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words, and proposes a novel pattern extraction algorithm and a pattern clustering algorithm that significantly improves the accuracy in a community mining task.
Abstract: Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark data sets showing a high correlation with human ratings. Moreover, the proposed method significantly improves the accuracy in a community mining task.

218 citations


31 Jul 2011
TL;DR: This paper presents a novel approach for automatic detection of semantic change of words based on distributional similarity models and shows that the method obtains good results with respect to a reference ranking produced by human raters.
Abstract: This paper presents a novel approach for automatic detection of semantic change of words based on distributional similarity models. We show that the method obtains good results with respect to a reference ranking produced by human raters. The evaluation also analyzes the performance of frequency-based methods, comparing them to the similarity method proposed.

195 citations


Proceedings ArticleDOI
20 Jun 2011
TL;DR: The insights gained from analysis enable building a novel distance function between images assessing whether they are from the same basic-level category, which goes beyond direct visual distance as it also exploits semantic similarity measured through ImageNet.
Abstract: Many computer vision approaches take for granted positive answers to questions such as “Are semantic categories visually separable?” and “Is visual similarity correlated to semantic similarity?”. In this paper, we study experimentally whether these assumptions hold and show parallels to questions investigated in cognitive science about the human visual system. The insights gained from our analysis enable building a novel distance function between images assessing whether they are from the same basic-level category. This function goes beyond direct visual distance as it also exploits semantic similarity measured through ImageNet. We demonstrate experimentally that it outperforms purely visual distances.

Proceedings ArticleDOI
22 Jun 2011
TL;DR: A model-theoretical approach for semantic data compression and reliable semantic communication is investigated and it is shown that Shannon's source and channel coding theorems have semantic counterparts.
Abstract: This paper studies methods of quantitatively measuring semantic information in communication. We review existing work on quantifying semantic information, then investigate a model-theoretical approach for semantic data compression and reliable semantic communication. We relate our approach to the statistical measurement of information by Shannon, and show that Shannon's source and channel coding theorems have semantic counterparts.

Journal ArticleDOI
TL;DR: It is found that an information-theoretical redefinition of well-known semantic measures and similarity coefficients, and an intrinsic estimation of concept IC result in noticeable improvements in their accuracy, resulting in new semantic similarity measures expressed in terms of concept Information Content.

Journal Article
TL;DR: To cope with the ubiquitous problems of subjectivity and inconsistency in multi-media similarity, this work develops graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure.
Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi-media similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure.

Proceedings Article
07 Aug 2011
TL;DR: A novel method for measuring semantic relatedness using semantic profiles constructed from salient encyclopedic features built on the notion that the meaning of a word can be characterized by the salient concepts found in its immediate context is introduced.
Abstract: This paper introduces a novel method for measuring semantic relatedness using semantic profiles constructed from salient encyclopedic features. The model is built on the notion that the meaning of a word can be characterized by the salient concepts found in its immediate context. In addition to being computationally efficient, the new model has superior performance and remarkable consistency when compared to both knowledge-based and corpus-based state-of-the-art semantic relatedness models.

Journal ArticleDOI
TL;DR: It is argued that the amount of perceptual and other semantic information that can be learned from purely distributional statistics has been underappreciated and that future focus should be on understanding the cognitive mechanisms humans use to integrate the two sources.
Abstract: Since their inception, distributional models of semantics have been criticized as inadequate cognitive theories of human semantic learning and representation. A principal challenge is that the representations derived by distributional models are purely symbolic and are not grounded in perception and action; this challenge has led many to favor feature-based models of semantic representation. We argue that the amount of perceptual and other semantic information that can be learned from purely distributional statistics has been underappreciated. We compare the representations of three feature-based and nine distributional models using a semantic clustering task. Several distributional models demonstrated semantic clustering comparable with clustering-based on feature-based representations. Furthermore, when trained on child-directed speech, the same distributional models perform as well as sensorimotor-based feature representations of children's lexical semantic knowledge. These results suggest that, to a large extent, information relevant for extracting semantic categories is redundantly coded in perceptual and linguistic experience. Detailed analyses of the semantic clusters of the feature-based and distributional models also reveal that the models make use of complementary cues to semantic organization from the two data streams. Rather than conceptualizing feature-based and distributional models as competing theories, we argue that future focus should be on understanding the cognitive mechanisms humans use to integrate the two sources.

Journal ArticleDOI
01 Apr 2011
TL;DR: The results show that SyMSS outperforms state-of-the-art methods in terms of rank correlation with human intuition, thus proving the importance of syntactic information in sentence semantic similarity computation.
Abstract: Sentence and short-text semantic similarity measures are becoming an important part of many natural language processing tasks, such as text summarization and conversational agents. This paper presents SyMSS, a new method for computing short-text and sentence semantic similarity. The method is based on the notion that the meaning of a sentence is made up of not only the meanings of its individual words, but also the structural way the words are combined. Thus, SyMSS captures and combines syntactic and semantic information to compute the semantic similarity of two sentences. Semantic information is obtained from a lexical database. Syntactic information is obtained through a deep parsing process that finds the phrases in each sentence. With this information, the proposed method measures the semantic similarity between concepts that play the same syntactic role. Psychological plausibility is added to the method by using previous findings about how humans weight different syntactic roles when computing semantic similarity. The results show that SyMSS outperforms state-of-the-art methods in terms of rank correlation with human intuition, thus proving the importance of syntactic information in sentence semantic similarity computation.

Journal ArticleDOI
TL;DR: A model to relate keywords based on their semantic relationship and define similarity functions to quantify the similarity between a pair of users is developed and it is concluded that direct friends are more similar than any other user pair.
Abstract: How do two people become friends? What role does homophily play in bringing two people closer to help them forge friendship? Is the similarity between two friends different from the similarity between any two people? How does the similarity between a friend of a friend compare to similarity between direct friends? In this work, our goal is to answer these questions. We study the relationship between semantic similarity of user profile entries and the social network topology. A user profile in an on-line social network is characterized by its profile entries. The entries are termed as user keywords. We develop a model to relate keywords based on their semantic relationship and define similarity functions to quantify the similarity between a pair of users. First, we present a ‘forest model’ to categorize keywords across multiple categorization trees and define the notion of distance between keywords. Second, we use the keyword distance to define similarity functions between a pair of users. Third, we analyze a set of Facebook data according to the model to determine the effect of homophily in on-line social networks. Based on our evaluations, we conclude that direct friends are more similar than any other user pair. However, the more striking observation is that except for direct friends, similarities between users are approximately equal, irrespective of the topological distance between them.

Journal ArticleDOI
TL;DR: Semantic saliency maps of real-world scenes based on the semantic similarity of scene objects to the currently fixated object or the search target are generated and reveal a preference for transitions to objects that were semantically similar to the Currently inspected one.

Book ChapterDOI
30 Aug 2011
TL;DR: A proper metric to quantify process similarity based on behavioral profiles is introduced, grounded in the Jaccard coefficient and leverages behavioral relations between pairs of process model activities.
Abstract: With the increasing influence of Business Process Management, large process model repositories emerged in enterprises and public administrations. Their effective utilization requires meaningful and efficient capabilities to search for models that go beyond text based search or folder navigation, e.g., by similarity. Existing measures for process model similarity are often not applicable for efficient similarity search, as they lack metric features. In this paper, we introduce a proper metric to quantify process similarity based on behavioral profiles. It is grounded in the Jaccard coefficient and leverages behavioral relations between pairs of process model activities. The metric is successfully evaluated towards its approximation of human similarity assessment.

Journal ArticleDOI
TL;DR: This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content that performs better than the traditional edge-counting approach.
Abstract: This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge-counting approach. The article presents algorithms that take advantage of taxonomic similarity in resolving syntactic and semantic ambiguity, along with experimental results demonstrating their effectiveness.

Journal ArticleDOI
TL;DR: A method is proposed which manually determines the quality and automatically compares its coverage with ResearchCyc, one of the largest manually created ontologies, and the lexical database WordNet, which shows that the taxonomy compares favorably in quality and coverage with broad-coverage manually created resources.

Patent
01 Feb 2011
TL;DR: In this paper, the authors provide ontology mapping algorithms and concept weighting algorithms that create accurate semantic tags that can be used to improve enterprise content management, and search for better knowledge management and collaboration.
Abstract: Systems and methods are disclosed that perform automated semantic tagging. Automated semantic tagging produces semantically linked tags for a given text content. Embodiments provide ontology mapping algorithms and concept weighting algorithms that create accurate semantic tags that can be used to improve enterprise content management, and search for better knowledge management and collaboration.

Journal ArticleDOI
TL;DR: This work introduces a framework to specify the semantics of similarity, and discusses similarity-based information retrieval paradigms as well as their implementation in web-based user interfaces for geo- graphic information retrieval to demonstrate the applicability of the framework.
Abstract: Similarity measures have a long tradition in fields such as information retrieval, artificial intelligence, and cognitive science. Within the last years, these measures have been extended and reused to measure semantic similarity; i.e., for comparing meanings rather than syntactic differences. Various measures for spatial applications have been de- veloped, but a solid foundation for answering what they measure; how they are best ap- plied in information retrieval; which role contextual information plays; and how similarity values or rankings should be interpreted is still missing. It is therefore difficult to decide which measure should be used for a particular application or to compare results from dif- ferent similarity theories. Based on a review of existing similarity measures, we introduce a framework to specify the semantics of similarity. We discuss similarity-based information retrieval paradigms as well as their implementation in web-based user interfaces for geo- graphic information retrieval to demonstrate the applicability of the framework. Finally, we formulate open challenges for similarity research.

Journal ArticleDOI
TL;DR: This communication provides an introduction, an example, pointers to relevant software, and summarizes the choices that can be made by the analyst, so that visualization (“semantic mapping”) is made more accessible.

Journal ArticleDOI
TL;DR: It was found that number of features and contexts consistently facilitated word recognition but that the effects of semantic neighborhood density and number of associates were less robust, which point to how the results are selectively and adaptively modulated by task-specific demands.
Abstract: Evidence from large-scale studies (Pexman, Hargreaves, Siakaluk, Bodner, & Pope, 2008) suggests that semantic richness, a multidimensional construct reflecting the extent of variability in the information associated with a word's meaning, facilitates visual word recognition. Specifically, recognition is better for words that (1) have more semantic neighbors, (2) possess referents with more features, and (3) are associated with more contexts. The present study extends Pexman et al. (2008) by examining how two additional measures of semantic richness, number of senses and number of associates (Pexman, Hargreaves, Edwards, Henry, & Goodyear, 2007), influence lexical decision, speeded pronunciation, and semantic classification performance, after controlling for an array of lexical and semantic variables. We found that number of features and contexts consistently facilitated word recognition but that the effects of semantic neighborhood density and number of associates were less robust. Words with more senses also elicited faster lexical decisions but less accurate semantic classifications. These findings point to how the effects of different semantic dimensions are selectively and adaptively modulated by task-specific demands.

Journal ArticleDOI
TL;DR: An overview of the Watson system, a Semantic Web search engine providing various functionalities not only to find and locate ontologies and semantic data online, but also to explore the content of these semantic documents.
Abstract: In this tool report, we present an overview of the Watson system, a Semantic Web search engine providing various functionalities not only to find and locate ontologies and semantic data online, but also to explore the content of these semantic documents. Beyond the simple facade of a search engine for the Semantic Web, we show that the availability of such a component brings new possibilities in terms of developing semantic applications that exploit the content of the Semantic Web. Indeed, Watson provides a set of APIs containing high level functions for finding, exploring and querying semantic data and ontologies that have been published online. Thanks to these APIs, new applications have emerged that connect activities such as ontology construction, matching, sense disambiguation and question answering to the Semantic Web, developed by our group and others. In addition, we also describe Watson as a unprecedented research platform for the study the Semantic Web, and of formalised knowledge in general.

Proceedings ArticleDOI
24 Jul 2011
TL;DR: Two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR) are presented.
Abstract: This paper presents two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR). Assuming that a query is parallel to the titles of the documents clicked on for that query, large amounts of query-title pairs are constructed from clickthrough data; two latent semantic models are learned from this data. One is a bilingual topic model within the language modeling framework. It ranks documents for a query by the likelihood of the query being a semantics-based translation of the documents. The semantic representation is language independent and learned from query-title pairs, with the assumption that a query and its paired titles share the same distribution over semantic topics. The other is a discriminative projection model within the vector space modeling framework. Unlike Latent Semantic Analysis and its variants, the projection matrix in our model, which is used to map from term vectors into sematic space, is learned discriminatively such that the distance between a query and its paired title, both represented as vectors in the projected semantic space, is smaller than that between the query and the titles of other documents which have no clicks for that query. These models are evaluated on the Web search task using a real world data set. Results show that they significantly outperform their corresponding baseline models, which are state-of-the-art.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed approach outperforms TF-IDF in the cases that the amount of training data is small or the content of documents is focused on well-defined categories and compares favorably with two previous studies.
Abstract: Traditional term weighting schemes in text categorization, such as TF-IDF, only exploit the statistical information of terms in documents. Instead, in this paper, we propose a novel term weighting scheme by exploiting the semantics of categories and indexing terms. Specifically, the semantics of categories are represented by senses of terms appearing in the category labels as well as the interpretation of them by WordNet. Also, the weight of a term is correlated to its semantic similarity with a category. Experimental results on three commonly used data sets show that the proposed approach outperforms TF-IDF in the cases that the amount of training data is small or the content of documents is focused on well-defined categories. In addition, the proposed approach compares favorably with two previous studies.

Book ChapterDOI
28 Jun 2011
TL;DR: This work focuses on the investigation of a vocabulary independent natural language query mechanism for Linked Data, using an approach based on the combination of entity search, a Wikipediabased semantic relatedness measure and spreading activation.
Abstract: Linked Data brings the promise of incorporating a new dimension to the Web where the availability of Web-scale data can determine a paradigmatic transformation of the Web and its applications. However, together with its opportunities, Linked Data brings inherent challenges in the way users and applications consume the available data. Users consuming Linked Data on the Web, or on corporate intranets, should be able to search and query data spread over potentially a large number of heterogeneous, complex and distributed datasets. Ideally, a query mechanism for Linked Data should abstract users from the representation of data. This work focuses on the investigation of a vocabulary independent natural language query mechanism for Linked Data, using an approach based on the combination of entity search, a Wikipediabased semantic relatedness measure and spreading activation. The combination of these three elements in a query mechanism for Linked Data is a new contribution in the space. Wikipedia-based relatedness measures address existing limitations of existing works which are based on similarity measures/term expansion based on WordNet. Experimental results using the query mechanism to answer 50 natural language queries over DBPedia achieved a mean reciprocal rank of 61.4%, an average precision of 48.7% and average recall of 57.2%, answering 70% of the queries.