scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2008"


Book ChapterDOI
28 Jan 2008
TL;DR: In this article, the authors present a semantic theory of pronominal anaphora, which combines a definition of truth with a systematic account of semantic representations, and is informed by the conviction that the mechanisms which govern deictic and anaphoric occurrences of pronouns are basically the same.
Abstract: Two conceptions of meaning have dominated formal semantics of natural language. The first of these sees meaning principally as that which determines conditions of truth. This notion has inspired the disciplines of truth theoretic and model-theoretic semantics. According to the second conception meaning is, first and foremost, that which a language user grasps when he understands the words he hears or reads. The two conceptions have remained separated, and the semantic theory presented in this chapter is an attempt to remove this obstacle. The theory combines a definition of truth with a systematic account of semantic representations. The analysis of pronominal anaphora is informed by the conviction that the mechanisms which govern deictic and anaphoric occurrences of pronouns are basically the same. Keywords:anaphoric; language; pronoun; semantic representation; Theory of Truth

1,769 citations


01 Jan 2008
TL;DR: This paper describes a new technique for obtaining measures of semantic relatedness that uses Wikipedia to provide structured world knowledge about the terms of interest using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content.
Abstract: This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter.

787 citations


Journal ArticleDOI
TL;DR: A method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence string matching algorithm is presented.
Abstract: We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing methods for computing text similarity have focused mainly on either large documents or individual words. We focus on computing the similarity between two sentences or two short paragraphs. The proposed method can be exploited in a variety of applications involving textual knowledge representation and knowledge discovery. Evaluation results on two different data sets show that our method outperforms several competing methods.

519 citations


Journal ArticleDOI
TL;DR: A systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations.
Abstract: Several semantic similarity measures have been applied to gene products annotated with Gene Ontology terms, providing a basis for their functional comparison. However, it is still unclear which is the best approach to semantic similarity in this context, since there is no conclusive evaluation of the various measures. Another issue, is whether electronic annotations should or not be used in semantic similarity calculations. We conducted a systematic evaluation of GO-based semantic similarity measures using the relationship with sequence similarity as a means to quantify their performance, and assessed the influence of electronic annotations by testing the measures in the presence and absence of these annotations. We verified that the relationship between semantic and sequence similarity is not linear, but can be well approximated by a rescaled Normal cumulative distribution function. Given that the majority of the semantic similarity measures capture an identical behaviour, but differ in resolution, we used the latter as the main criterion of evaluation. This work has provided a basis for the comparison of several semantic similarity measures, and can aid researchers in choosing the most adequate measure for their work. We have found that the hybrid simGIC was the measure with the best overall performance, followed by Resnik's measure using a best-match average combination approach. We have also found that the average and maximum combination approaches are problematic since both are inherently influenced by the number of terms being combined. We suspect that there may be a direct influence of data circularity in the behaviour of the results including electronic annotations, as a result of functional inference from sequence similarity.

339 citations


Book ChapterDOI
26 Oct 2008
TL;DR: Several measures of tag similarity are analyzed and a semantic grounding is provided by mapping pairs of similar tags in the folksonomy to pairs of synsets in Wordnet, where validated measures of semantic distance characterize the semantic relation between the mapped tags.
Abstract: Collaborative tagging systems have nowadays become important data sources for populating semantic web applications. For tasks like synonym detection and discovery of concept hierarchies, many researchers introduced measures of tag similarity. Even though most of these measures appear very natural, their design often seems to be rather ad hoc, and the underlying assumptions on the notion of similarity are not made explicit. A more systematic characterization and validation of tag similarity in terms of formal representations of knowledge is still lacking. Here we address this issue and analyze several measures of tag similarity: Each measure is computed on data from the social bookmarking system del.icio.us and a semantic grounding is provided by mapping pairs of similar tags in the folksonomy to pairs of synsets in Wordnet, where we use validated measures of semantic distance to characterize the semantic relation between the mapped tags. This exposes important features of the investigated similarity measures and indicates which ones are better suited in the context of a given semantic application.

237 citations


Journal ArticleDOI
TL;DR: A class of graph similarity measures that uses the structural similarity of local neighborhoods to derive pairwise similarity scores for the nodes of two different graphs are outlined and a related similarity measure that uses a linear update to generate both node and edge similarity scores is presented.

230 citations


Journal ArticleDOI
TL;DR: The experiments suggest that term overlap can serve as a simple and fast alternative to other approaches which use explicit information content estimation or require complex pre-calculations, while also avoiding problems that some other measures may encounter.
Abstract: The availability of various high-throughput experimental and computational methods allows biologists to rapidly infer functional relationships between genes. It is often necessary to evaluate these predictions computationally, a task that requires a reference database for functional relatedness. One such reference is the Gene Ontology (GO). A number of groups have suggested that the semantic similarity of the GO annotations of genes can serve as a proxy for functional relatedness. Here we evaluate a simple measure of semantic similarity, term overlap (TO). We computed the TO for randomly selected gene pairs from the mouse genome. For comparison, we implemented six previously reported semantic similarity measures that share the feature of using computation of probabilities of terms to infer information content, in addition to three vector based approaches and a normalized version of the TO measure. We find that the overlap measure is highly correlated with the others but differs in detail. TO is at least as good a predictor of sequence similarity as the other measures. We further show that term overlap may avoid some problems that affect the probability-based measures. Term overlap is also much faster to compute than the information content-based measures. Our experiments suggest that term overlap can serve as a simple and fast alternative to other approaches which use explicit information content estimation or require complex pre-calculations, while also avoiding problems that some other measures may encounter.

216 citations


Proceedings Article
13 Jul 2008
TL;DR: It is shown that Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and the concept vector based approach yields the best results on all datasets in both evaluations.
Abstract: We introduce Wiktionary as an emerging lexical semantic resource that can be used as a substitute for expert-made resources in AI applications. We evaluate Wiktionary on the pervasive task of computing semantic relatedness for English and German by means of correlation with human rankings and solving word choice problems. For the first time, we apply a concept vector based measure to a set of different concept representations like Wiktionary pseudo glosses, the first paragraph of Wikipedia articles, English WordNet glosses, and GermaNet pseudo glosses. We show that: (i) Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and (ii) the concept vector based approach yields the best results on all datasets in both evaluations.

175 citations


Proceedings ArticleDOI
23 Oct 2008
TL;DR: This paper proposes Social Ranking, a method that exploits recommender system techniques to increase the efficiency of searches within Web 2.0, and proposes a mechanism to answer a user's query that ranks content based on the inferred semantic distance of the query to the tags associated to such content, weighted by the similarity of the querying user to the users who created those tags.
Abstract: Social (or folksonomic) tagging has become a very popular way to describe, categorise, search, discover and navigate content within Web 2.0 websites. Unlike taxonomies, which overimpose a hierarchical categorisation of content, folksonomies empower end users by enabling them to freely create and choose the categories (in this case, tags) that best describe some content. However, as tags are informally defined, continually changing, and ungoverned, social tagging has often been criticised for lowering, rather than increasing, the efficiency of searching, due to the number of synonyms, homonyms, polysemy, as well as the heterogeneity of users and the noise they introduce. In this paper, we propose Social Ranking, a method that exploits recommender system techniques to increase the efficiency of searches within Web 2.0. We measure users' similarity based on their past tag activity. We infer tags' relationships based on their association to content. We then propose a mechanism to answer a user's query that ranks (recommends) content based on the inferred semantic distance of the query to the tags associated to such content, weighted by the similarity of the querying user to the users who created those tags. A thorough evaluation conducted on the CiteULike dataset demonstrates that Social Ranking neatly improves coverage, while not compromising on accuracy.

171 citations


Journal ArticleDOI
TL;DR: This work provides the first set of norms for event concepts, a set of feature norms collected from approximately 280 participants for a total of 456 words, used in research addressing questions concerning the similarities and differences between the semantic representation of objects and events and in research concerning the interface between semantics and syntax.
Abstract: Semantic features produced by speakers of a language when given a word corresponding to a concept have provided insight into numerous behavioral phenomena concerning semantic representation in language-impaired and -unimpaired speakers. A number of theories concerning the organization of semantic memory have used features as their starting point. Here, we provide a set of feature norms collected from approximately 280 participants for a total of 456 words (169 nouns referring to objects, 71 nouns referring to events, and 216 verbs referring to events). Whereas a number of feature norms for object concepts already exist, we provide the first set of norms for event concepts. We have used these norms (for both objects and events) in research addressing questions concerning the similarities and differences between the semantic representation of objects and events and in research concerning the interface between semantics and syntax, given that events can be expressed in language as nouns or verbs. Some of this research is summarized here. These norms may be downloaded from www.psychonomic.org/archive.

170 citations



Journal ArticleDOI
TL;DR: This article reviews existing similarity measures in geometric, feature, network, alignment and transformational models, and evaluates the semantic similarity models with respect to the requirements for semantic similarity measurement between geospatial data.
Abstract: Semantic similarity is central for the functioning of semantically enabled processing of geospatial data. It is used to measure the degree of potential semantic interoperability between data or different geographic information systems (GIS). Similarity is essential for dealing with vague data queries, vague concepts or natural language and is the basis for semantic information retrieval and integration. The choice of similarity measurement influences strongly the conceptual design and the functionality of a GIS. The goal of this article is to provide a survey presentation on theories of semantic similarity measurement and review how these approaches – originally developed as psychological models to explain human similarity judgment – can be used in geographic information science. According to their knowledge representation and notion of similarity we classify existing similarity measures in geometric, feature, network, alignment and transformational models. The article reviews each of these models and outlines its notion of similarity and metric properties. Afterwards, we evaluate the semantic similarity models with respect to the requirements for semantic similarity measurement between geospatial data. The article concludes by comparing the similarity measures and giving general advice how to choose an appropriate semantic similarity measure. Advantages and disadvantages point to their suitability for different tasks.

Journal ArticleDOI
TL;DR: The current findings lend support to spreading activation and feature overlap theories of priming, but do not support priming based upon contextual similarity as captured by LSA.
Abstract: The current study explores a set of variables that have the potential to predict semantic priming effects for 300 prime–target associates at the item level. Young and older adults performed either lexical decision (LDT) or naming tasks. A multiple regression procedure was used to predict priming based upon prime characteristics, target characteristics, and prime–target semantic similarity. Results indicate that semantic priming (a) can be reliably predicted at an item level; (b) is equivalent in magnitude across standardized measures of priming in LDTs and naming tasks; (c) is greater following quickly recognized primes; (d) is greater in LDTs for targets that produce slow lexical decision latencies; (e) is greater for pairs high in forward associative strength across tasks and across stimulus onset asynchronies (SOAs); (f) is greater for pairs high in backward associative strength in both tasks, but only at a long SOA; and (g) does not vary as a function of estimates from latent semantic analysis (LSA). ...

Journal ArticleDOI
TL;DR: Derivational morphological effects in masked priming seem to be primarily driven by morphological decomposability at an early stage of visual word recognition, and are independent of semantic factors.
Abstract: The role of morphological, semantic, and form-based factors in the early stages of visual word recognition was investigated across different SOAs in a masked priming paradigm, focusing on English derivational morphology. In a first set of experiments, stimulus pairs co-varying in morphological decomposability and in semantic and orthographic relatedness were presented at three SOAs (36, 48, and 72 ms). No effects of orthographic relatedness were found at any SOA. Semantic relatedness did not interact with effects of morphological decomposability, which came through strongly at all SOAs, even for pseudo-suffixed pairs such as archer-arch. Derivational morphological effects in masked priming seem to be primarily driven by morphological decomposability at an early stage of visual word recognition, and are independent of semantic factors. A second experiment reversed the order of prime and target (stem-derived rather than derived-stem), and again found that morphological priming did not interact with semantic relatedness. This points to an early segmentation process that is driven by morphological decomposability and not by the structure or content of central lexical representations.

Book ChapterDOI
01 Sep 2008
TL;DR: This paper explores the use of a semantic relatedness measure between words, that uses the Web as knowledge source, and defines a new semanticrelatedness measure among ontology terms that fulfils the above mentioned desirable properties to be used on the Semantic Web.
Abstract: Semantic relatedness measures quantify the degree in which some words or concepts are related, considering not only similarity but any possible semantic relationship among them Relatedness computation is of great interest in different areas, such as Natural Language Processing, Information Retrieval, or the Semantic Web Different methods have been proposed in the past; however, current relatedness measures lack some desirable properties for a new generation of Semantic Web applications: maximum coverage, domain independence, and universality In this paper, we explore the use of a semantic relatedness measure between words, that uses the Web as knowledge source This measure exploits the information about frequencies of use provided by existing search engines Furthermore, taking this measure as basis, we define a new semantic relatedness measure among ontology terms The proposed measure fulfils the above mentioned desirable properties to be used on the Semantic Web We have tested extensively this semantic measure to show that it correlates well with human judgment, and helps solving some particular tasks, as word sense disambiguation or ontology matching

Journal ArticleDOI
TL;DR: In this article, a probabilistic framework for interpreting similarity measures that directly correlates the similarity value to a quantitative expectation that two molecules will in fact be equipotent is presented, based on extensive benchmarking of 10 different similarity methods (MACCS keys, Daylight fingerprints, maximum common subgraphs, rapid overlay of chemical structures (ROCS) shape simil...
Abstract: A wide variety of computational algorithms have been developed that strive to capture the chemical similarity between two compounds for use in virtual screening and lead discovery. One limitation of such approaches is that, while a returned similarity value reflects the perceived degree of relatedness between any two compounds, there is no direct correlation between this value and the expectation or confidence that any two molecules will in fact be equally active. A lack of a common framework for interpretation of similarity measures also confounds the reliable fusion of information from different algorithms. Here, we present a probabilistic framework for interpreting similarity measures that directly correlates the similarity value to a quantitative expectation that two molecules will in fact be equipotent. The approach is based on extensive benchmarking of 10 different similarity methods (MACCS keys, Daylight fingerprints, maximum common subgraphs, rapid overlay of chemical structures (ROCS) shape simil...

Proceedings ArticleDOI
07 Apr 2008
TL;DR: The paper describes the inferencing requirements, challenges in supporting a sufficiently expressive set of RDFS/OWL constructs, and techniques adopted to build a scalable inference engine for Oracle semantic data store.
Abstract: This inference engines are an integral part of semantic data stores. In this paper, we describe our experience of implementing a scalable inference engine for Oracle semantic data store. This inference engine computes production rule based entailment of one or more RDFS/OWL encoded semantic data models. The inference engine capabilities include (i) inferencing based on semantics of RDFS/OWL constructs and user-defined rules, (ii) computing ancillary information (namely, semantic distance and proof) for inferred triples, and (iii) validation of semantic data model based on RDFS/OWL semantics. A unique aspect of our approach is that the inference engine is implemented entirely as a database application on top of Oracle database. The paper describes the inferencing requirements, challenges in supporting a sufficiently expressive set of RDFS/OWL constructs, and techniques adopted to build a scalable inference engine. A performance study conducted using both native and synthesized semantic datasets demonstrates the effectiveness of our approach.

Proceedings ArticleDOI
20 Jul 2008
TL;DR: The problem of bridging the semantic gap between low-level image features and high-level semantic concepts, which is the key hindrance in content-based image retrieval, is studied and a ranking-based distance metric learning method is proposed.
Abstract: We study in this paper the problem of bridging the semantic gap between low-level image features and high-level semantic concepts, which is the key hindrance in content-based image retrieval Piloted by the rich textual information of Web images, the proposed framework tries to learn a new distance measure in the visual space, which can be used to retrieve more semantically relevant images for any unseen query image The framework differentiates with traditional distance metric learning methods in the following ways 1) A ranking-based distance metric learning method is proposed for image retrieval problem, by optimizing the leave-one-out retrieval performance on the training data 2) To be scalable, millions of images together with rich textual information have been crawled from the Web to learn the similarity measure, and the learning framework particularly considers the indexing problem to ensure the retrieval efficiency 3) To alleviate the noises in the unbalanced labels of images and fully utilize the textual information, a Latent Dirichlet Allocation based topic-level text model is introduced to define pairwise semantic similarity between any two images The learnt distance measure can be directly applied to applications such as content-based image retrieval and search-based image annotation Experimental results on the two applications in a two million Web image database show both the effectiveness and efficiency of the proposed framework

Proceedings ArticleDOI
13 Dec 2008
TL;DR: A new model of IC is presented, which relies on hierarchical structure alone and considers not only the hyponyms of each word sense but also its depth in the structure.
Abstract: Information Content (IC) is an important dimension of assessing the semantic similarity between two terms or word senses in word knowledge. The conventional method of obtaining IC of word senses is to combine knowledge of their hierarchical structure from an ontology like WordNet with actual usage in text as derived from a large corpus. In this paper, a new model of IC is presented, which relies on hierarchical structure alone. The model considers not only the hyponyms of each word sense but also its depth in the structure. The IC value is easier to calculate based on our model, and when used as the basis of a similarity approach it yields judgments that correlate more closely with human assessments than others, which using IC value obtained only considering the hyponyms and IC value got by employing corpus analysis.

Proceedings ArticleDOI
25 Oct 2008
TL;DR: This paper presented a dependency-based semantic role labeling system for English that is integrated with a dependency parser, which achieved state-of-the-art performance on the CoNLL-2005 test set.
Abstract: We present a PropBank semantic role labeling system for English that is integrated with a dependency parser. To tackle the problem of joint syntactic--semantic analysis, the system relies on a syntactic and a semantic subcomponent. The syntactic model is a projective parser using pseudo-projective transformations, and the semantic model uses global inference mechanisms on top of a pipeline of classifiers. The complete syntactic-semantic output is selected from a candidate pool generated by the subsystems. We evaluate the system on the CoNLL-2005 test sets using segment-based and dependency-based metrics. Using the segment-based CoNLL-2005 metric, our system achieves a near state-of-the-art F1 figure of 77.97 on the WSJ+Brown test set, or 78.84 if punctuation is treated consistently. Using a dependency-based metric, the F1 figure of our system is 84.29 on the test set from CoNLL-2008. Our system is the first dependency-based semantic role labeler for PropBank that rivals constituent-based systems in terms of performance.

Journal ArticleDOI
TL;DR: Rogers and McClelland as discussed by the authors presented a parallel distributed processing theory of the acquisition, representation, and use of human semantic knowledge, which proposes that semantic abilities arise from the flow of activation among simple, neuron-like processing units, as governed by the strengths of interconnecting weights.
Abstract: In this precis of our recent book, Semantic Cognition: A Parallel Distributed Processing Approach (Rogers & McClelland 2004), we present a parallel distributed processing theory of the acquisition, representation, and use of human semantic knowledge. The theory proposes that semantic abilities arise from the flow of activation among simple, neuron-like processing units, as governed by the strengths of interconnecting weights; and that acquisition of new semantic information involves the gradual adjustment of weights in the system in response to experience. These simple ideas explain a wide range of empirical phenomena from studies of categorization, lexical acquisition, and disordered semantic cognition. In this precis we focus on phenomena central to the reaction against similarity-based theories that arose in the 1980s and that subsequently motivated the "theory-theory" approach to semantic knowledge. Specifically, we consider (1) how concepts differentiate in early development, (2) why some groupings of items seem to form "good" or coherent categories while others do not, (3) why different properties seem central or important to different concepts, (4) why children and adults sometimes attest to beliefs that seem to contradict their direct experience, (5) how concepts reorganize between the ages of 4 and 10, and (6) the relationship between causal knowledge and semantic knowledge. The explanations our theory offers for these phenomena are illustrated with reference to a simple feed- forward connectionist model. The relationships between this simple model, the broader theory, and more general issues in cognitive science are discussed.

Journal ArticleDOI
TL;DR: In this article, the effects of irrelevant sound on category-exemplar recall were shown to be functionally distinct from those found in the context of serial short-term memory by showing sensitivity to: the lexical-semantic, rather than acoustic, properties of sound and between-sequence semantic similarity.

01 Jan 2008
TL;DR: The evaluations show that DISCO has a higher correlation with semantic similarities derived from WordNet than latent semantic analysis (LSA) and the web-based PMI-IR.
Abstract: This paper 1 presents DISCO, a tool for retrieving the distributional similarity be- tween two given words, and for retrieving the distributionally most similar words for a given word. Pre-computed word spaces are freely available for a number of languages including English, German, French and Italian, so DISCO can be used off the shelf. The tool is imple- mented in Java, provides a Java API, and can also be called from the command line. The per- formance of DISCO is evaluated by measuring the correlation with WordNet-based semantic similarities and with human relatedness judgements. The evaluations show that DISCO has a higher correlation with semantic similarities derived from WordNet than latent semantic analysis (LSA) and the web-based PMI-IR.

Journal ArticleDOI
TL;DR: The results of a feature generation task in which the exemplars and labels of 15 semantic categories served as cues are described and the importance of the generated features was assessed by tallying the frequency with which they were generated and by obtaining judgments of their relevance.
Abstract: Features are at the core of many empirical and modeling endeavors in the study of semantic concepts. This article is concerned with the delineation of features that are important in natural language concepts and the use of these features in the study of semantic concept representation. The results of a feature generation task in which the exemplars and labels of 15 semantic categories served as cues are described. The importance of the generated features was assessed by tallying the frequency with which they were generated and by obtaining judgments of their relevance. The generated attributes also featured in extensive exemplar by feature applicability matrices covering the 15 different categories, as well as two large semantic domains (that of animals and artifacts). For all exemplars of the 15 semantic categories, typicality ratings, goodness ratings, goodness rank order, generation frequency, exemplar associative strength, category associative strength, estimated age of acquisition, word frequency, familiarity ratings, imageability ratings, and pairwise similarity ratings are described as well. By making these data easily available to other researchers in the field, we hope to provide ample opportunities for continued investigations into the nature of semantic concept representation. These data may be downloaded from the Psychonomic Society’s Archive of Norms, Stimuli, and Data, www.psychonomic.org/archive.

Journal ArticleDOI
TL;DR: A fuzzy sets based approach is used to develop attribute based prototype definitions of land cover classes to look at land cover changes as a semantic change evaluated through semantic similarity metrics.

Journal ArticleDOI
TL;DR: In this paper, a method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author, which consists in determining similarity of concept descriptors (attributes) by using the information content approach, rather than relying on human domain expertise.
Abstract: Formal Concept Analysis (FCA) is revealing interesting in supporting difficult activities that are becoming fundamental in the development of the Semantic Web. Assessing concept similarity is one of such activities since it allows the identification of different concepts that are semantically close. In this paper, a method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author. The refinement consists in determining the similarity of concept descriptors (attributes) by using the information content approach, rather than relying on human domain expertise. The information content approach which has been adopted allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: This work focuses on weighted similarity functions like TF/IDF, and introduces variants that are well suited for set similarity selections in a relational database context that have special semantic properties that can be exploited to design very efficient index structures and algorithms for answering queries efficiently.
Abstract: Data collections often have inconsistencies that arise due to a variety of reasons, and it is desirable to be able to identify and resolve them efficiently. Set similarity queries are commonly used in data cleaning for matching similar data. In this work we concentrate on set similarity selection queries: Given a query set, retrieve all sets in a collection with similarity greater than some threshold. Various set similarity measures have been proposed in the past for data cleaning purposes. In this work we concentrate on weighted similarity functions like TF/IDF, and introduce variants that are well suited for set similarity selections in a relational database context. These variants have special semantic properties that can be exploited to design very efficient index structures and algorithms for answering queries efficiently. We present modifications of existing technologies to work for set similarity selection queries. We also introduce three novel algorithms based on the Threshold Algorithm, that exploit the semantic properties of the new similarity measures to achieve the best performance in theory and practice.

Book ChapterDOI
09 Nov 2008
TL;DR: Experimental evaluations using WordNet indicate that the proposed metric, coupled with the notion of intrinsic IC, yields results above the state of the art, and the intrinsic IC formulation also improves the accuracy of other IC based metrics.
Abstract: In many research fields such as Psychology, Linguistics, Cognitive Science, Biomedicine, and Artificial Intelligence, computing semantic similarity between words is an important issue. In this paper we present a new semantic similarity metric that exploits some notions of the early work done using a feature based theory of similarity, and translates it into the information theoretic domain which leverages the notion of Information Content (IC). In particular, the proposed metric exploits the notion of intrinsic IC which quantifies IC values by scrutinizing how concepts are arranged in an ontological structure. In order to evaluate this metric, we conducted an on line experiment asking the community of researchers to rank a list of 65 word pairs. The experiment's web setup allowed to collect 101 similarity ratings, and to differentiate native and non-native English speakers. Such a large and diverse dataset enables to confidently evaluate similarity metrics by correlating them with human assessments. Experimental evaluations using WordNet indicate that our metric, coupled with the notion of intrinsic IC, yields results above the state of the art. Moreover, the intrinsic IC formulation also improves the accuracy of other IC based metrics. We implemented our metric and several others in the Java WordNet Similarity Library.

Journal ArticleDOI
TL;DR: Data indicate that semantic interference can be observed when target picture naming latencies do not reflect the bottleneck at the level of lexical selection, supporting the view that the semantic interference effect arises at a postlexical level of processing.
Abstract: In 2 experiments participants named pictures of common objects with superimposed distractor words. In one naming condition, the pictures and words were presented simultaneously on every trial, and participants produced the target response immediately. In the other naming condition, the presentation of the picture preceded the presentation of the distractor by 1,000 ms, and participants delayed production of their naming response until distractor word presentation. Within each naming condition, the distractor words were either semantic category coordinates of the target pictures or unrelated. Orthogonal to this manipulation of semantic relatedness, the frequency of the pictures' names was manipulated. The authors observed semantic interference effects in both the immediate and delayed naming conditions but a frequency effect only in the immediate naming condition. These data indicate that semantic interference can be observed when target picture naming latencies do not reflect the bottleneck at the level of lexical selection. In the context of other findings from the picture-word interference paradigm, the authors interpret these data as supporting the view that the semantic interference effect arises at a postlexical level of processing.

Patent
07 Oct 2008
TL;DR: In this article, a plurality of corresponding pair similarity score values according to a first and at least a second classifier using electronic information sources is calculated to provide the overall semantic similarity score value between pairs of named entities in a text corpus.
Abstract: An overall semantic similarity score value between pairs of named entities in a text corpus is obtained by calculating for at least one pair of named entities a plurality of corresponding pair similarity score values according to a first and at least a second classifier using electronic information sources. Each pair similarity score value of the pair of named entities per classifier is normalized by calculating a rank list per classifier, for example, for each named entity. The rank list holds each pair of named entities of the text corpus, wherein a rank of each pair of named entities within the rank list reflects the respective pair similarity score value. Further an arithmetic mean of the normalized pair similarity score value of each pair of named entities is calculated to provide the overall semantic similarity score value.