scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2005"


Proceedings ArticleDOI
25 Jun 2005
TL;DR: A meta-algorithm is applied, based on a metric labeling formulation of the rating-inference problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels.
Abstract: We address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). This task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star".We first evaluate human performance at the task. Then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. We show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of SVMs when we employ a novel similarity measure appropriate to the problem.

2,544 citations


Journal ArticleDOI
TL;DR: An automatic system for semantic role tagging trained on the corpus is described and the effect on its performance of various types of information is discussed, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty trace categories of the treebank.
Abstract: The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank. The resulting resource can be thought of as shallow, in that it does not represent coreference, quantification, and many other higher-order phenomena, but also broad, in that it covers every instance of every verb in the corpus and allows representative statistics to be calculated.We discuss the criteria used to define the sets of semantic roles used in the annotation process and to analyze the frequency of syntactic/semantic alternations in the corpus. We describe an automatic system for semantic role tagging trained on the corpus and discuss the effect on its performance of various types of information, including a comparison of full syntactic parsing with a flat representation and the contribution of the empty ''trace'' categories of the treebank.

2,416 citations


Journal ArticleDOI
TL;DR: A simple model for semantic growth is described, in which each new word or concept is connected to an existing network by differentiating the connectivity pattern of an existing node, which generates appropriate small-world statistics and power-law connectivity distributions.

1,224 citations


Proceedings Article
01 Jan 2005
TL;DR: An algorithm implementing semantic matching is presented, and its implementation within the S-Match system is discussed, and the results, though preliminary, look promising, in particular for what concerns precision and recall.
Abstract: We think of Match as an operator which takes two graph-like structures and produces a mapping between those nodes of the two graphs that correspond semantically to each other. Semantic matching is a novel approach where semantic correspondences are discovered by computing and returning as a result, the semantic information implicitly or explicitly codified in the labels of nodes and arcs. In this paper we present an algorithm implementing semantic matching, and we discuss its implementation within the S-Match system. We also test S-Match against three state of the art matching systems. The results, though preliminary, look promising, in particular for what concerns precision and recall.

523 citations


Proceedings ArticleDOI
30 Jun 2005
TL;DR: A method that combines word- to-word similarity metrics into a text-to-text metric is introduced, and it is shown that this method outperforms the traditional text similarity metrics based on lexical matching.
Abstract: This paper presents a knowledge-based method for measuring the semantic-similarity of texts. While there is a large body of previous work focused on finding the semantic similarity of concepts and words, the application of these word-oriented methods to text similarity has not been yet explored. In this paper, we introduce a method that combines word-to-word similarity metrics into a text-to-text metric, and we show that this method outperforms the traditional text similarity metrics based on lexical matching.

378 citations


Proceedings ArticleDOI
04 Nov 2005
TL;DR: A novel information retrieval method is proposed that is capable of detecting similarities between documents containing semantically similar but not necessarily lexicographically similar terms.
Abstract: Semantic Similarity relates to computing the similarity between concepts which are not lexicographically similar. We investigate approaches to computing semantic similarity by mapping terms (concepts) to an ontology and by examining their relationships in that ontology. Some of the most popular semantic similarity methods are implemented and evaluated using WordNet as the underlying reference ontology. Building upon the idea of semantic similarity, a novel information retrieval method is also proposed. This method is capable of detecting similarities between documents containing semantically similar but not necessarily lexicographically similar terms. The proposed method has been evaluated in retrieval of images and documents on the Web. The experimental results demonstrated very promising performance improvements over state-of-the-art information retrieval methods.

368 citations


Journal ArticleDOI
TL;DR: fMRI data provide critical support for the hypothesis that concrete, imageable concepts activate perceptually based representations not available to abstract concepts, and are compatible with dual coding theory and less consistent with single-code models of conceptual representation.

316 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: An approach that ranks results based on how predictable a result might be for users is presented, based on a relevance model SemRank, which is a rich blend of semantic and information-theoretic techniques with heuristics that supports the novel idea of modulative searches, where users may vary their search modes to effect changes in the ordering of results depending on their need.
Abstract: While the idea that querying mechanisms for complex relationships (otherwise known as Semantic Associations) should be integral to Semantic Web search technologies has recently gained some ground, the issue of how search results will be ranked remains largely unaddressed Since it is expected that the number of relationships between entities in a knowledge base will be much larger than the number of entities themselves, the likelihood that Semantic Association searches would result in an overwhelming number of results for users is increased, therefore elevating the need for appropriate ranking schemes Furthermore, it is unlikely that ranking schemes for ranking entities (documents, resources, etc) may be applied to complex structures such as Semantic AssociationsIn this paper, we present an approach that ranks results based on how predictable a result might be for users It is based on a relevance model SemRank, which is a rich blend of semantic and information-theoretic techniques with heuristics that supports the novel idea of modulative searches, where users may vary their search modes to effect changes in the ordering of results depending on their need We also present the infrastructure used in the SSARK system to support the computation of SemRank values for resulting Semantic Associations and their ordering

291 citations


Proceedings ArticleDOI
06 Nov 2005
TL;DR: This paper proposes a novel approach that augments the classical model with generic knowledge-based, WordNet to prune irrelevant keywords by the usage of WordNet and shows that by augmenting knowledge- based with classical model the authors can improve annotation accuracy by removing irrelevant keywords.
Abstract: The development of technology generates huge amounts of non-textual information, such as images. An efficient image annotation and retrieval system is highly desired. Clustering algorithms make it possible to represent visual features of images with finite symbols. Based on this, many statistical models, which analyze correspondence between visual features and words and discover hidden semantics, have been published. These models improve the annotation and retrieval of large image databases. However, current state of the art including our previous work produces too many irrelevant keywords for images during annotation. In this paper, we propose a novel approach that augments the classical model with generic knowledge-based, WordNet. Our novel approach strives to prune irrelevant keywords by the usage of WordNet. To identify irrelevant keywords, we investigate various semantic similarity measures between keywords and finally fuse outcomes of all these measures together to make a final decision using Dempster-Shafer evidence combination. We have implemented various models to link visual tokens with keywords based on knowledge-based, WordNet and evaluated performance using precision, and recall using benchmark dataset. The results show that by augmenting knowledge-based with classical model we can improve annotation accuracy by removing irrelevant keywords.

268 citations


Journal ArticleDOI
TL;DR: The results suggest that the Resnik similarity measure outperforms the others and seems better suited for use in Gene Ontology, and deduce that there seems to be correlation between semantic similarity in the GO annotation and gene expression for the three GO ontologies.
Abstract: This research analyzes some aspects of the relationship between gene expression, gene function, and gene annotation. Many recent studies are implicitly based on the assumption that gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their gene ontology (GO) annotation. We analyze how accurate this assumption proves to be using real publicly available data. We also aim to validate a measure of semantic similarity for GO annotation. We use the Pearson correlation coefficient and its absolute value as a measure of similarity between expression profiles of gene products. We explore a number of semantic similarity measures (Resnik, Jiang, and Lin) and compute the similarity between gene products annotated using the GO. Finally, we compute correlation coefficients to compare gene expression similarity against GO semantic similarity. Our results suggest that the Resnik similarity measure outperforms the others and seems better suited for use in gene ontology. We also deduce that there seems to be correlation between semantic similarity in the GO annotation and gene expression for the three GO ontologies. We show that this correlation is negligible up to a certain semantic similarity value; then, for higher similarity values, the relationship trend becomes almost linear. These results can be used to augment the knowledge provided by clustering algorithms and in the development of bioinformatic tools for finding and characterizing gene products.

242 citations


Proceedings ArticleDOI
25 Jun 2005
TL;DR: This paper suggests refinements for the Distributional Similarity Hypothesis by developing an inclusion testing algorithm for characteristic features of two words, which incorporates corpus and web-based feature sampling to overcome data sparseness.
Abstract: This paper suggests refinements for the Distributional Similarity Hypothesis. Our proposed hypotheses relate the distributional behavior of pairs of words to lexical entailment -- a tighter notion of semantic similarity that is required by many NLP applications. To automatically explore the validity of the defined hypotheses we developed an inclusion testing algorithm for characteristic features of two words, which incorporates corpus and web-based feature sampling to overcome data sparseness. The degree of hypotheses validity was then empirically tested and manually analyzed with respect to the word sense level. In addition, the above testing algorithm was exploited to improve lexical entailment acquisition.

01 Jan 2005
TL;DR: An attempt to establish a ‘psychological ground truth’ for evaluating models of the ability of word-based, n-gram and Latent Semantic Analysis approaches to model human judgments of text document similarity is reported.
Abstract: An Empirical Evaluation of Models of Text Document Similarity Michael D. Lee (michael.lee@adelaide.edu.au) Department of Psychology, University of Adelaide South Australia, 5005, AUSTRALIA Brandon Pincombe (brandon.pincombe@dsto.defence.gov.au) Intelligence Surveillance and Reconnaissance Division, Defence Science and Technology Organisation PO Box 1500, Edinburgh SA 5111 AUSTRALIA Matthew Welsh (matthew.welsh@adelaide.edu.au) Australian School of Petroleum Engineering, University of Adelaide South Australia, 5005, AUSTRALIA Abstract Modeling the semantic similarity between text docu- ments presents a significant theoretical challenge for cognitive science, with ready-made applications in in- formation handling and decision support systems deal- ing with text. While a number of candidate models exist, they have generally not been assessed in terms of their ability to emulate human judgments of simi- larity. To address this problem, we conducted an ex- periment that collected repeated similarity measures for each pair of documents in a small corpus of short news documents. An analysis of human performance showed inter-rater correlations of about 0.6. We then considered the ability of existing models—using word- based, n-gram and Latent Semantic Analysis (LSA) approaches—to model these human judgments. The best performed LSA model produced correlations of about 0.6, consistent with human performance, while the best performed word-based and n-gram models achieved correlations closer to 0.5. Many of the re- maining models showed almost no correlation with hu- man performance. Based on our results, we provide some discussion of the key strengths and weaknesses of the models we examined. Introduction Modeling the semantic similarity between text docu- ments is an interesting problem for cognitive science, for both theoretical and practical reasons. Theoret- ically, it involves the study of a basic cognitive pro- cess with richly structured natural stimuli. Practically, search engines, text corpus visualizations, and a vari- ety of other applications for filtering, sorting, retriev- ing, and generally handling text rely fundamentally on similarity measures. For this reason, the ability to as- sess semantic similarity in an accurate, automated, and scalable way is a key determinant of the effectiveness of most information handling and decision support soft- ware that deals with text. A variety of different approaches have been devel- oped for modeling text document similarity. These in- clude simple word-based, keyword-based and n-gram measures (e.g., Salton, 1989; Damashek, 1995), and more complicated approaches such as Latent Seman- tic Analysis (LSA: Deerwester et al., 1990; Landauer and Dumais, 1997). While all of these approaches have achieved some level of practical success, they have gen- erally not been assessed in terms of their ability to model human judgments of text document similarity. The most likely reason for this failure is that no suit- able empirical data exist, and considerable effort is in- volved in collecting pairwise ratings of text document similarity for even a moderate number of documents. This paper reports the collection of data that give ten independent ratings of the similarity of every pair of 50 short text documents, and so represents an attempt to establish a ‘psychological ground truth’ for evaluating models. Using the new data, we report a first eval- uation of the ability of word-based, n-gram and LSA approaches to model human judgments. Experiment Materials The text corpus evaluated by human judges contained 50 documents selected from the Australian Broadcast- ing Corporation’s news mail service, which provides text e-mails of headline stories. The documents varied in length from 51 to 126 words, and covered a number of broad topics. A further 314 documents from the same were collected to act as a larger ‘backgrounding’ corpus for LSA. Both document sets were assessed against a stan- dard corpus of five English texts using four models of language. These were the log-normal, generalized in- verse Gauss-Poisson (with γ = −0.5), Yule-Simon and Zipfian models (Baayen, 2001). Both document sets were within the normal range of English text for word frequency spectrum and vocabulary growth and were therefore regarded as representative of normal English texts. Subjects The subjects were 83 University of Adelaide students (29 males and 54 females), with a mean age of 19.7 years. They were each paid with a ten (Australian) dollar gift voucher for every 100 document pair ratings made.

Proceedings ArticleDOI
10 May 2005
TL;DR: This work defines a new measure of the temporal correlation of two queries based on the correlation coefficient of their frequency functions, and develops a method of efficiently finding the highest correlated queries for a given input query using far less space and time than the naive approach.
Abstract: We investigate the idea of finding semantically related search engine queries based on their temporal correlation; in other words, we infer that two queries are related if their popularities behave similarly over time. To this end, we first define a new measure of the temporal correlation of two queries based on the correlation coefficient of their frequency functions. We then conduct extensive experiments using our measure on two massive query streams from the MSN search engine, revealing that this technique can discover a wide range of semantically similar queries. Finally, we develop a method of efficiently finding the highest correlated queries for a given input query using far less space and time than the naive approach, making real-time implementation possible.

Proceedings ArticleDOI
10 May 2005
TL;DR: An information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology is defined, and an experimental study shows that this measure improves significantly on the traditional taxonomy-based approach.
Abstract: Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata --- namely topical directories --- to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking.

Proceedings Article
01 Jan 2005
TL;DR: A comprehensive framework for measuring similarity within and between ontologies as a basis for the interoperability across various application fields is presented and validated with several practical case studies to prove benefits of applying the approach compared to traditional similarity measures.
Abstract: In this paper we present a comprehensive framework for measuring similarity within and between ontologies as a basis for the interoperability across various application fields. In order to define such a framework, we base our work on an abstract ontology model that allows adhering to various existing and evolving ontology standards. The main characteristic of the framework is its layered structure: We have defined three levels on which the similarity between two entities (concepts or instances) can be measured: data layer, ontology layer, and context layer, that cope with the data representation, ontological meaning and the usage of these entities, respectively. In addition, in each of the layers corresponding background information is used in order to define the similarity more precisely. The framework is complete in the sense of covering the similarity between all elements defined in the abstract ontology model by comprising similarity measures for all above-named layers as well as relations between them. Moreover, we have validated our framework with several practical case studies in order to prove benefits of applying our approach compared to traditional similarity measures. One of these case studies is described in detail within the paper.

Proceedings Article
01 Jan 2005
TL;DR: The OWL web ontology language is extended, with fuzzy set theory, in order to be able to capture, represent and reason with information that is many times imprecise or vague.
Abstract: In the Semantic Web context information would be retrieved, processed, shared, reused and aligned in the maximum automatic way possible. Our experience with such applications in the Semantic Web has shown that these are rarely a matter of true or false but rather procedures that require degrees of relatedness, similarity, or ranking. Apart from the wealth of applications that are inherently imprecise, information itself is many times imprecise or vague. For example, the concepts of a “hot” place, an “expensive” item, a “fast” car, a “near” city, are examples of such concepts. Dealing with such type of information would yield more realistic, intelligent and effective applications. In the current paper we extend the OWL web ontology language, with fuzzy set theory, in order to be able to capture, represent and reason with such type of information.

Proceedings Article
30 Jul 2005
TL;DR: Latent Relational Analysis (LRA) as discussed by the authors is a method for measuring semantic similarity between two pairs of words that is based on the cosine of the angle between the vectors that represent the two pairs.
Abstract: This paper introduces Latent Relational Analysis (LRA), a method for measuring semantic similarity. LRA measures similarity in the semantic relations between two pairs of words. When two pairs have a high degree of relational similarity, they are analogous. For example, the pair cat:meow is analogous to the pair dog:bark. There is evidence from cognitive science that relational similarity is fundamental to many cognitive and linguistic tasks (e.g., analogical reasoning). In the Vector Space Model (VSM) approach to measuring relational similarity, the similarity between two pairs is calculated by the cosine of the angle between the vectors that represent the two pairs. The elements in the vectors are based on the frequencies of manually constructed patterns in a large corpus. LRA extends the VSM approach in three ways: (1) patterns are derived automatically from the corpus, (2) Singular Value Decomposition is used to smooth the frequency data, and (3) synonyms are used to reformulate word pairs. This paper describes the LRA algorithm and experimentally compares LRA to VSM on two tasks, answering college-level multiple-choice word analogy questions and classifying semantic relations in noun-modifier expressions. LRA achieves state-of-the-art results, reaching human-level performance on the analogy questions and significantly exceeding VSM performance on both tasks.

01 Jan 2005
TL;DR: A new model to measure semantic similarity in the taxonomy of WordNet, using edge-counting techniques achieves a much improved result compared with other methods: the correlation with average human judgment on a standard 28 word pair dataset is better than anything reported in the literature.
Abstract: This paper presents a new model to measure semantic similarity in the taxonomy of WordNet, using edge-counting techniques We weigh up our model against a benchmark set by human similarity judgment, and achieve a much improved result compared with other methods: the correlation with average human judgment on a standard 28 word pair dataset is 0921, which is better than anything reported in the literature and also significantly better than average individual human judgments As this set has been effectively used for algorithm selection and tuning, we also cross-validate an independent 37 word pair test set (0876) and present results for the full 65 word pair superset (0897)

Proceedings ArticleDOI
25 Jun 2005
TL;DR: This paper presents a novel algorithm for the acquisition of Information Extraction patterns that makes the assumption that useful patterns will have similar meanings to those already identified as relevant.
Abstract: This paper presents a novel algorithm for the acquisition of Information Extraction patterns. The approach makes the assumption that useful patterns will have similar meanings to those already identified as relevant. Patterns are compared using a variation of the standard vector space model in which information from an ontology is used to capture semantic similarity. Evaluation shows this algorithm performs well when compared with a previously reported document-centric approach.

01 Jan 2005
TL;DR: This article presents a method of word sense disambiguation that assigns a target word the sense that is most related to the senses of its neighboring words and explores the use of measures of similarity and relatedness that are based on finding paths in a concept network, information content derived from a large corpus, and word sense glosses.
Abstract: This article presents a method of word sense disambiguation that assigns a target word the sense that is most related to the senses of its neighboring words. We explore the use of measures of similarity and relatedness that are based on finding paths in a concept network, information content derived from a large corpus, and word sense glosses. We observe that measures of relatedness are useful sources of information for disambiguation, and in particular we find that two gloss based measures that we have developed are particularly flexible and effective measures for word sense disambiguation.

Journal ArticleDOI
TL;DR: A flexible, parameterized framework for calculating distributional similarity is proposed and the problem of finding distributionally similar words is cast as one of co-occurrence retrieval (CR) for which precision and recall can be measured by analogy with the way they are measured in document retrieval.
Abstract: Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of Natural Language Processing. For example, in language modeling, the sparse data problem can be alleviated by estimating the probabilities of unseen co-occurrences of events from the probabilities of seen co-occurrences of similar events. In other applications, distributional similarity is taken to be an approximation to semantic similarity. However, due to the wide range of potential applications and the lack of a strict definition of the concept of distributional similarity, many methods of calculating distributional similarity have been proposed or adopted.In this work, a flexible, parameterized framework for calculating distributional similarity is proposed. Within this framework, the problem of finding distributionally similar words is cast as one of co-occurrence retrieval (CR) for which precision and recall can be measured by analogy with the way they are measured in document retrieval. As will be shown, a number of popular existing measures of distributional similarity are simulated with parameter settings within the CR framework. In this article, the CR framework is then used to systematically investigate three fundamental questions concerning distributional similarity. First, is the relationship of lexical similarity necessarily symmetric, or are there advantages to be gained from considering it as an asymmetric relationship? Second, are some co-occurrences inherently more salient than others in the calculation of distributional similarity? Third, is it necessary to consider the difference in the extent to which each word occurs in each co-occurrence type?Two application-based tasks are used for evaluation: automatic thesaurus generation and pseudo-disambiguation. It is possible to achieve significantly better results on both these tasks by varying the parameters within the CR framework rather than using other existing distributional similarity measures; it will also be shown that any single unparameterized measure is unlikely to be able to do better on both tasks. This is due to an inherent asymmetry in lexical substitutability and therefore also in lexical distributional similarity.

Journal ArticleDOI
TL;DR: A suite of methods that assess the similarity between two WSDL (Web Service Description Language) specifications based on the structure of their data types and operations and the semantics of their natural language descriptions and identifiers are developed.
Abstract: The web-services stack of standards is designed to support the reuse and interoperation of software components on the web. A critical step in the process of developing applications based on web services is service discovery, i.e. the identification of existing web services that can potentially be used in the context of a new web application. Discovery through catalog-style browsing (such as supported currently by web-service registries) is clearly insufficient. To support programmatic service discovery, we have developed a suite of methods that assess the similarity between two WSDL (Web Service Description Language) specifications based on the structure of their data types and operations and the semantics of their natural language descriptions and identifiers. Given only a textual description of the desired service, a semantic information-retrieval method can be used to identify and order the most relevant WSDL specifications based on the similarity of the element descriptions of the available specifications with the query. If a (potentially partial) specification of the desired service behavior is also available, this set of likely candidates can be further refined by a semantic structure-matching step, assessing the structural similarity of the desired vs the retrieved services and the semantic similarity of their identifiers. In this paper, we describe and experimentally evaluate our suite of service-similarity assessment methods.

Journal ArticleDOI
TL;DR: It is argued that this viewpoint allows adequate coverage of theory and empirical findings in learning, reasoning, categorization, and language, and also a reassessment of the objectives in research on rules versus similarity.
Abstract: The distinction between rules and similarity is central to our understanding of much of cognitive psychology. Two aspects of existing research have motivated the present work. First, in different cognitive psychology areas we typically see different conceptions of rules and similarity; for example, rules in language appear to be of a different kind compared to rules in categorization. Second, rules processes are typically modeled as separate from similarity ones; for example, in a learning experiment, rules and similarity influences would be described on the basis of separate models. In the present article, I assume that the rules versus similarity distinction can be understood in the same way in learning, reasoning, categorization, and language, and that a unified model for rules and similarity is ap- propriate. A rules process is considered to be a similarity one where only a single or a small subset of an object's properties are involved. Hence, rules and overall similarity operations are extremes in a single continuum of similarity operations. It is argued that this viewpoint allows adequate coverage of theory and empirical findings in learning, reasoning, categorization, and language, and also a reassessment of the objectives in research on rules versus similarity.

Book ChapterDOI
30 Aug 2005
TL;DR: In this paper, the authors propose a proactive method to build a semantic overlay based on an epidemic protocol that clusters peers with similar content, without requiring the user to specify his preferences or to characterize the content of files he shares.
Abstract: A lot of recent research on content-based P2P searching for file- sharing applications has focused on exploiting semantic relations between peers to facilitate searching. To the best of our knowledge, all methods proposed to date suggest reactive ways to seize peers' semantic relations. That is, they rely on the usage of the underlying search mechanism, and infer semantic relations based on the queries placed and the corresponding replies received. In this paper we follow a different approach, proposing a proactive method to build a semantic overlay. Our method is based on an epidemic protocol that clusters peers with similar content. It is worth noting that this peer clustering is done in a completely implicit way, that is, without requiring the user to specify his preferences or to characterize the content of files he shares.

Journal ArticleDOI
TL;DR: This semantic analysis approach can be used in semantic annotation and transcoding systems, which take into consideration the users environment including preferences, devices used, available network bandwidth and content identity.
Abstract: An approach to knowledge-assisted semantic video object detection based on a multimedia ontology infrastructure is presented. Semantic concepts in the context of the examined domain are defined in an ontology, enriched with qualitative attributes (e.g., color homogeneity), low-level features (e.g., color model components distribution), object spatial relations, and multimedia processing methods (e.g., color clustering). Semantic Web technologies are used for knowledge representation in the RDF(S) metadata standard. Rules in F-logic are defined to describe how tools for multimedia analysis should be applied, depending on concept attributes and low-level features, for the detection of video objects corresponding to the semantic concepts defined in the ontology. This supports flexible and managed execution of various application and domain independent multimedia analysis tasks. Furthermore, this semantic analysis approach can be used in semantic annotation and transcoding systems, which take into consideration the users environment including preferences, devices used, available network bandwidth and content identity. The proposed approach was tested for the detection of semantic objects on video data of three different domains.

Book ChapterDOI
01 Jan 2005
TL;DR: This research will apply scaling techniques such as SVD as well as Multidimensional Scaling on a large database of free association collected by Nelson, McEvoy, and Schreiber (1999) containing norms for first associates for over 5000 words to uncover the latent information available in the free association norms that is not directly available by investigating simple measures for associative strengths.
Abstract: A common assumption of theories of memory is that the meaning of a word can be represented by a vector which places a word as a point in a multidimensional semantic space (e.g. Landauer & Dumais, 1997; Burgess & Lund, 2000; Osgood, Suci, & Tannenbaum, 1957). Representing words as vectors in a multidimensional space allows simple geometric operations such as the Euclidian distance or the angle between the vectors to compute the semantic (dis)similarity between arbitrary pairs or groups of words. This representation makes it possible to make predictions about performance in psychological tasks where the semantic distance between pairs or groups of words is assumed to play a role. One recent framework for placing words in a multidimensional space is Latent Semantic Analysis or LSA (Derweester, Dumais, Furnas, Landauer, & Harshman, 1990; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). The main assumption is that the similarity between words can be inferred by analyzing the statistical regularities between words and text samples in which they occur. For example, a textbook with a paragraph that mentions “cats” might also mention “dogs”, “fur”, “pets” etc. This knowledge can be used to infer that “cats” and “dogs” are related in meaning. The technique underlying LSA is singular value decomposition (SVD). This procedure is applied to the matrix of word-context frequencies in a high dimensional space (typically with 200-400 dimensions) in which words that appear in similar contexts are placed in similar regions of the space. Interestingly, some words that never occur in the same context might still be similar in LSA space if they co-occurred with other words that do occur together in text samples. Landauer and Dumais (1997) applied the LSA approach to over 60,000 words appearing in over 30,000 contexts of a large encyclopedia. More recently, LSA was applied to over 90,000 words appearing in over 37,000 contexts of reading material that an English reader might be exposed to from 3 grade up to 1 year of college from various sources such as textbooks, novels, and newspaper articles. The LSA representation has been successfully applied to multiple choice vocabulary tests, domain knowledge tests and content evaluation (see Landauer & Dumais, 1997; Landauer et al. 1998). In this research, we will apply scaling techniques such as SVD as well as Multidimensional Scaling on a large database of free association collected by Nelson, McEvoy, and Schreiber (1999) containing norms for first associates for over 5000 words. By applying scaling methods on the free association norms, we hope to uncover the latent information available in the free association norms that is not directly available by investigating simple measures for associative strengths based on the direct and indirect associative strengths through short chains of associates (e.g., Nelson & Zhang, 2000). The basic approach is illustrated in Figure 1. The free association norms were represented in matrix form with the rows representing the cues and the columns representing the responses. The entries in the matrix are filled by some measure of associative strength between cues and responses. By applying scaling methods on the matrix, words are placed in a high dimensional space such that words with similar associative patterns are placed in similar regions of *Send correspondence to: Mark Steyvers, Department of Cognitive Sciences, 3151 Social Sciences Plaza, University of California, Irvine, CA 92697-5100. msteyver@uci.edu the space. We will refer to the resulting space as the

Dissertation
01 Jan 2005
TL;DR: An algorithmic approach to learning similarity from examples of what objects are deemed similar according to the task-specific notion of similarity at hand, as well as optional negative examples, which allows to predict when two previously unseen examples are similar and to efficiently search a very large database for examples similar to a query.
Abstract: The right measure of similarity between examples is important in many areas of computer science. In particular it is a critical component in example-based learning methods. Similarity is commonly defined in terms of a conventional distance function, but such a definition does not necessarily capture the inherent meaning of similarity, which tends to depend on the underlying task. We develop an algorithmic approach to learning similarity from examples of what objects are deemed similar according to the task-specific notion of similarity at hand, as well as optional negative examples. Our learning algorithm constructs, in a greedy fashion, an encoding of the data. This encoding can be seen as an embedding into a space, where a weighted Hamming distance is correlated with the unknown similarity. This allows us to predict when two previously unseen examples are similar and, importantly, to efficiently search a very large database for examples similar to a query. This approach is tested on a set of standard machine learning benchmark problems. The model of similarity learned with our algorithm provides and improvement over standard example-based classification and regression. We also apply this framework to problems in computer vision: articulated pose estimation of humans from single images, articulated tracking in video, and matching image regions subject to generic visual similarity. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Book ChapterDOI
06 Nov 2005
TL;DR: In this paper, the source and target ontologies are first translated into Bayesian networks (BN) and the concept mapping between the two ontologies is treated as evidential reasoning between the translated BNs.
Abstract: This paper presents our ongoing effort on developing a principled methodology for automatic ontology mapping based on BayesOWL, a probabilistic framework we developed for modeling uncertainty in semantic web. In this approach, the source and target ontologies are first translated into Bayesian networks (BN); the concept mapping between the two ontologies are treated as evidential reasoning between the two translated BNs. Probabilities needed for constructing conditional probability tables (CPT) during translation and for measuring semantic similarity during mapping are learned using text classification techniques where each concept in an ontology is associated with a set of semantically relevant text documents, which are obtained by ontology guided web mining. The basic ideas of this approach are validated by positive results from computer experiments on two small real-world ontologies.

Journal Article
TL;DR: This paper describes an approach to visualization of text document collection based on methods from linear algebra, the system implementing it and results of using the system on several datasets.
Abstract: Visualization is commonly used in data analysis to help the user in getting an initial idea about the raw data as well as visual representation of the regularities obtained in the analysis. In similar way, when we talk about automated text processing and the data consists of text documents, visualization of text document corpus can be very useful. From the automated text processing point of view, natural language is very redundant in the sense that many different words share a common or similar meaning. For computer this can be hard to understand without some background knowledge. We describe an approach to visualization of text document collection based on methods from linear algebra. We apply Latent Semantic Indexing (LSI) as a technique that helps in extracting some of the background knowledge from corpus of text documents. This can be also viewed as extraction of hidden semantic concepts from text documents. In this way visualization can be very helpful in data analysis, for instance, for finding main topics that appear in larger sets of documents. Extraction of main concepts from documents using techniques such as LSI, can make the results of visualizations more useful. For example, given a set of descriptions of European Research projects (6FP) one can find main areas that these projects cover including semantic web, e-learning, security, etc. In this paper we describe a method for visualization of document corpus based on LSI, the system implementing it and give results of using the system on several datasets.

01 Jan 2005
TL;DR: This paper presents a general-purpose knowledge integration framework that employs Bayesian networks in integrating both low-level and semantic features, and demonstrates that effective inference engines can be built within this powerful and flexible framework according to specific domain knowledge and available training data.
Abstract: Current research in content-based semantic image understanding is largely confined to exemplar-based approaches built on low-level feature extraction and classification. The ability to extract both low-level and semantic features and perform knowledge integration of different types of features is expected to raise semantic image understanding to a new level. Belief networks, or Bayesian networks (BN), have proven to be an effective knowledge representation and inference engine in artificial intelligence and expert systems research. Their effectiveness is due to the ability to explicitly integrate domain knowledge in the network structure and to reduce a joint probability distribution to conditional independence relationships. In this paper, we present a general-purpose knowledge integration framework that employs BN in integrating both low-level and semantic features. The efficacy of this framework is demonstrated via three applications involving semantic understanding of pictorial images. The first application aims at detecting main photographic subjects in an image, the second aims at selecting the most appealing image in an event, and the third aims at classifying images into indoor or outdoor scenes. With these diverse examples, we demonstrate that effective inference engines can be built within this powerful and flexible framework according to specific domain knowledge and available training data to solve inherently uncertain vision problems.