scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2001"


Book ChapterDOI
05 Sep 2001
TL;DR: This paper presented an unsupervised learning algorithm for recognizing synonyms based on statistical data acquired by querying a web search engine, called Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words.
Abstract: This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

1,232 citations


Journal ArticleDOI
TL;DR: The time to compare two numbers shows additive effects of number notation and of semantic distance, suggesting that the comparison task can be decomposed into distinct stages of identification and semantic processing.

663 citations


Proceedings Article
01 Jan 2001
TL;DR: The goal of Anchor-PROMPT is not to provide a complete solution to automated ontology merging but rather to augment existing methods, like PROMPT and Chimaera, by determining additional possible points of similarity between ontologies.
Abstract: Researchers in the ontology-design field have developed the content for ontologies in many domain areas. Recently, ontologies have become increasingly common on the WorldWide Web where they provide semantics for annotations in Web pages. This distributed nature of ontology development has led to a large number of ontologies covering overlapping domains, which researchers now need to merge or align to one another. The processes of ontology alignment and merging are usually handled manually and often constitute a large and tedious portion of the sharing process. We have developed and implemented Anchor-PROMPT—an algorithm that finds semantically similar terms automatically. Anchor-PROMPT takes as input a set of anchors—pairs of related terms defined by the user or automatically identified by lexical matching. AnchorPROMPT treats an ontology as a graph with classes as nodes and slots as links. The algorithm analyzes the paths in the subgraph limited by the anchors and determines which classes frequently appear in similar positions on similar paths. These classes are likely to represent semantically similar concepts. Our experiments show that when we use Anchor-PROMPT with ontologies developed independently by different groups of researchers, 75% of its results are correct. 1 Ontology Merging and Anchor-PROMPT Researchers have pursued development of ontologies— explicit formal specifications of domains of discourse—on the premise that ontologies facilitate knowledge sharing and reuse (Musen 1992; Gruber 1993). Today, ontology development is moving from academic knowledgerepresentation projects to the world of e-commerce. Companies use ontologies to share information and to guide customers through their Web sites. The ontologies on the World-Wide Web range from large taxonomies categorizing Web sites (such as on Yahoo!) to categorizations of products for sale and their features (such as on Amazon.com). The WWW Consortium is developing the Resource Description Framework (Brickley and Guha 1999), a language for encoding semantic information on Web pages in machine-readable form. Such encoding makes it possible for electronic agents searching for information to share the common understanding of the semantics of the data represented on the Web. Many disciplines now develop standardized ontologies that domain experts can use to share and annotate information in their fields. Medicine, for example, has produced large, standardized, structured vocabularies such as SNOMED (Price and Spackman 2000) and the semantic network of the Unified Medical Language System (Humphreys and Lindberg 1993). With this widespread distributed use of ontologies, different parties inevitably develop ontologies with overlapping content. For example, both Yahoo! and the DMOZ Open Directory (Netscape 1999) categorize information available on the Web. The two resulting directories are similar, but also have many differences. Currently, there are extremely few theories or methods that facilitate or automate the process of reconciling disparate ontologies. Ontology management today is mostly a manual process. A domain expert who wants to determine a correlation between two ontologies must find all the concepts in the two source ontologies that are similar to one another, determine what the similarities are, and either change the source ontologies to remove the overlaps or record a mapping between the sources for future reference. This process is both labor-intensive and error-prone. The semi-automated approaches to ontology merging that do exist today (Section 2) such as PROMPT and Chimaera analyze only local context in ontology structure: given two similar classes, the algorithms consider classes and slots that are directly related to the classes in question. The algorithm that we present here, Anchor-PROMPT, uses a set of heuristics to analyze non-local context. The goal of Anchor-PROMPT is not to provide a complete solution to automated ontology merging but rather to augment existing methods, like PROMPT and Chimaera, by determining additional possible points of similarity between ontologies. Anchor-PROMPT takes as input a set of pairs of related terms—anchors—from the source ontologies. Either the user identifies the anchors manually or the system generates them automatically. From this set of previously identified anchors, Anchor-PROMPT produces a set of new pairs of semantically close terms. To do that, Anchor-PROMPT traverses the paths between the anchors in the corresponding ontologies. A path follows the links between classes defined by the hierarchical relations or by slots and their domains and ranges. Anchor-PROMPT then compares the terms along these paths to find similar terms. For example, suppose we identify two pairs of anchors: classes A and B and classes H and G (Figure 1). That is, a class A from one ontology is similar to a class B in the other ontology; and a class H from the first ontology is similar to a class G from the second one. Figure 1 shows one path from A to H in the first ontology and one path from B to G in the second ontology. We traverse the two paths in parallel, incrementing the similarity score between each two classes that we reach in the same step. For example, after traversing the paths in Figure 1, we increment the similarity score between the classes C and D and between the classes E and F. We repeat the process for all the existing paths that originate and terminate in the anchor points, cumulatively aggregating the similarity score. The central observation behind Anchor-PROMPT is that if two pairs of terms from the source ontologies are similar and there are paths connecting the terms, then the elements in those paths are often similar as well. Therefore, from a small set of previously identified related terms, AnchorPROMPT is able to suggest a large number of terms that are likely to be semantically similar as well. Figure 1. Traversing the paths between anchors. The rectangles represent classes and labeled edges represent slots that relate classes to one another. The left part of the figure represents classes and slots from one ontology; the right part represents classes and slots from the other. Solid arrows connect pairs of anchors; dashed arrows connect pairs of related terms.

482 citations


Proceedings ArticleDOI
26 Nov 2001
TL;DR: The intention of the approach is to enhance and augment existing clone detection methods that are based on structural analysis and improve the quality of clone detection.
Abstract: Source code duplication occurs frequently within large software systems. Pieces of source code, functions, and data types are often duplicated in part or in whole, for a variety of reasons. Programmers may simply be reusing a piece of code via copy and paste or they may be "re-inventing the wheel". Previous research on the detection of clones is mainly focused on identifying pieces of code with similar (or nearly similar) structure. Our approach is to examine the source code text (comments and identifiers) and identify implementations of similar high-level concepts (e.g., abstract data types). The approach uses an information retrieval technique (i.e., latent semantic indexing) to statically analyze the software system and determine semantic similarities between source code documents (i.e., functions, files, or code segments). These similarity measures are used to drive the clone detection process. The intention of our approach is to enhance and augment existing clone detection methods that are based on structural analysis. This synergistic use of methods will improve the quality of clone detection. A set of experiments is presented that demonstrate the usage of semantic similarity measure to identify clones within a version of NCSA Mosaic.

295 citations


Journal ArticleDOI
TL;DR: This paper presents six basic principles, and then applies those principles in aggregating the existing 134 semantic types into a set of 15 groupings, and discusses some possible uses of the semantic groups.
Abstract: The conceptual complexity of a domain can make it difficult for users of information systems to comprehend and interact with the knowledge embedded in those systems. The Unified Medical Language System (UMLS) currently integrates over 730,000 biomedical concepts from more than fifty biomedical vocabularies. The UMLS semantic network reduces the complexity of this construct by grouping concepts according to the semantic types that have been assigned to them. For certain purposes, however, an even smaller and coarser-grained set of semantic type groupings may be desirable. In this paper, we discuss our approach to creating such a set. We present six basic principles, and then apply those principles in aggregating the existing 134 semantic types into a set of 15 groupings. We present some of the difficulties we encountered and the consequences of the decisions we have made. We discuss some possible uses of the semantic groups, and we conclude with implications for future work.

293 citations


Proceedings ArticleDOI
01 Jul 2001
TL;DR: Focuses on investigating the combined use of semantic and structural information of programs to support the comprehension tasks involved in the maintenance and reengineering of software systems.
Abstract: Focuses on investigating the combined use of semantic and structural information of programs to support the comprehension tasks involved in the maintenance and reengineering of software systems. "Semantic information" refers to the domain-specific issues (both the problem and the development domains) of a software system. The other dimension, structural information, refers to issues such as the actual syntactic structure of the program, along with the control and data flow that it represents. An advanced information retrieval method, latent semantic indexing, is used to define a semantic similarity measure between software components. Components within a software system are then clustered together using this similarity measure. Simple structural information (i.e. the file organization) of the software system is then used to assess the semantic cohesion of the clusters and files with respect to each other. The measures are formally defined for general application. A set of experiments is presented which demonstrates how these measures can assist in the understanding of a nontrivial software system, namely a version of NCSA Mosaic.

254 citations


Journal ArticleDOI
TL;DR: This article examined the relative strength of judgmental and non-judgmental anchoring effects and explored their boundary conditions in an integrative model which differentiates between two stages of anchoring.

205 citations


Patent
10 Aug 2001
TL;DR: In this paper, a method of order-ranking document clusters using entropy data and Bayesian self-organizing feature maps (SOM) is provided in which an accuracy of information retrieval is improved by adopting Bayesian SOM for performing a real-time document clustering for relevant documents.
Abstract: A method of order-ranking document clusters using entropy data and Bayesian self-organizing feature maps(SOM) is provided in which an accuracy of information retrieval is improved by adopting Bayesian SOM for performing a real-time document clustering for relevant documents in accordance with a degree of semantic similarity between entropy data extracted using entropy value and user profiles and query words given by a user, wherein the Bayesian SOM is a combination of Bayesian statistical technique and Kohonen network that is a type of an unsupervised learning.

188 citations


Proceedings ArticleDOI
22 Oct 2001
TL;DR: This paper develops a domain-independent approach for developing semantic portals, viz.
Abstract: The core idea of the Semantic Web is to make information accessible to human and software agents on a semantic basis. Hence, Web sites may feed directly from the Semantic Web exploiting the underlying structures for human and machine access. We have developed a domain-independent approach for developing semantic portals, viz. SEAL (SEmantic portAL), that exploits semantics for providing and accessing information at a portal as well as constructing and maintaining the portal. In this paper we focus on semantics-based means that make semantic Web sites accessible from the outside, i.e. semantics-based browsing, semantic querying, querying with semantic similarity, and machine access to semantic information. In particular, we focus on methods for acquiring and structuring community information as well as methods for sharing information.As a case study we refer to the AIFB portal - a place that is increasingly driven by Semantic Web technologies. We also discuss lessons learned from the ontology development of the AIFB portal..

156 citations


01 Jan 2001
TL;DR: A theoretical framework for semantic space models is developed by synthesizing theoretical analyses from vector space information re- trieval and categorical data analysis with new basic re- search.
Abstract: Towards a Theory of Semantic Space Will Lowe (wlowe02@tufts.edu) Center for Cognitive Studies Tufts University; MA 21015 USA Abstract This paper adds some theory to the growing literature of semantic space models. We motivate semantic space models from the perspective of distributional linguistics and show how an explicit mathematical formulation can provide a better understanding of existing models and suggest changes and improvements. In addition to pro- viding a theoretical framework for current models, we consider the implications of statistical aspects of language data that have not been addressed in the psychological modeling literature. Statistical approaches to language must deal principally with count data, and this data will typically have a highly skewed frequency distribution due to Zipf’s law. We consider the consequences of these facts for the construction of semantic space models, and present methods for removing frequency biases from se- mantic space models. Introduction There is a growing literature on the empirical adequacy of semantic space models across a wide range of sub- ject domains (Burgess et al., 1998; Landauer et al., 1998; Foltz et al., 1998; McDonald and Lowe, 1998; Lowe and McDonald, 2000). However, semantic space mod- els are typically structured and parameterized differently by each researcher. Levy and Bullinaria (2000) have ex- plored the implications of parameter changes empirically by running multiple simulations, but there has up until now been no work that places semantic space models in an overarching theoretical framework; consequently there there are few statements of how semantic spaces ought to be structured in the light of their intended pur- pose. In this paper we attempt to develop a theoretical framework for semantic space models by synthesizing theoretical analyses from vector space information re- trieval and categorical data analysis with new basic re- search. The structure of the paper is as follows. The next sec- tion brie¤y motivates semantic space models using ideas from distributional linguistics. We then review Zipf’s law and its consequences the distributional character of linguistic data. The £nal section presents a formal de£- nition of semantic space models and considers what ef- fects different choices of component have on the result- ing models. Motivating Semantic Space Firth (1968) observed that “you shall know a word by the company it keeps”. If we interpret company as lex- ical company, the words that occur near to it in text or speech, then two related claims are possible. The £rst is unexceptional: we come to know about the syntactic character of a word by examining the other words that may and may not occur around it in text. Syntactic theory then postulates latent variables e.g. parts of speech and branching structure, that control the distributional prop- erties of words and restrictions on their contexts of occur- rence. The second claim is that we come to know about the semantic character of a word by examining the other words that may and may not occur around it in text. The intuition for this distributional characterization of semantics is that whatever makes words similar or dis- similar in meaning, it must show up distributionally, in the lexical company of the word. Otherwise the suppos- edly semantic difference is not available to hearers and it is not easy to see how it may be learned. If words are similar to the extent that they occur in the similar contexts then we may de£ne a statistical re- placement test (Finch, 1993) which tests the meaning- fulness of the result of switching one word for another in a sentence. When a corpus of meaningful sentences is available the test may be reversed (Lowe, 2000a), and un- der a suitable representation of lexical context, we may hold each word constant and estimate its typical sur- rounding context. A semantic space model is a way of representing similarity of typical context in a Euclidean space with axes determined by local word co-occurrence counts. Counting the co-occurrence of a target word with a £xed set of D other words makes it possible to position the target in a space of dimension D. A target’s position with respect to other words then expresses similarity of lexical context. Since the basic notion from distributional linguistics is ‘intersubstitutability in context’, a semantic space model is effective to the extent it realizes this idea accurately. Zipf’s Law The frequency of a word is (approximately) proportional to the reciprocal of its rank in a frequency list (Zipf, 1949; Mandelbrot, 1954). This is Zipf’s Law. Zipf’s law ensures dramatically skewed distributions for almost

146 citations


Proceedings ArticleDOI
07 Oct 2001
TL;DR: The method uses multidimensional scaling and hierarchical cluster analysis to model the semantic categories into which human observers organize images, and devise an image similarity metric that embodies the results, and develop a prototype system.
Abstract: We propose a method for semantic categorization and retrieval of photographic images based on low-level image descriptors. In this method, we first use multidimensional scaling (MDS) and hierarchical cluster analysis (HCA) to model the semantic categories into which human observers organize images. Through a series of psychophysical experiments and analyses, we refine our definition of these semantic categories, and use these results to discover a set of low-level image features to describe each category. We then devise an image similarity metric that embodies our results, and develop a prototype system, which identifies the semantic category of the image and retrieves the most similar images from the database. We tested the metric on a new set of images, and compared the categorization results with that of human observers. Our results provide a good match to human performance, thus validating the use of human judgments to develop semantic descriptors.

Book ChapterDOI
12 Jul 2001
TL;DR: A computational model is developed to determine the directional similarity between extended spatial objects, which forms a foundation for meaningful spatial similarity operators and confirms the cognitive plausibility of the similarity model.
Abstract: Like people who casually assess similarity between spatial scenes in their routine activities, users of pictorial databases are often interested in retrieving scenes that are similar to a given scene, and ranking them according to degrees of their match. For example, a town architect would like to query a database for the towns that have a landscape similar to the landscape of the site of a planned town. In this paper, we develop a computational model to determine the directional similarity between extended spatial objects, which forms a foundation for meaningful spatial similarity operators. The model is based on the direction-relation matrix. We derive how the similarity assessment of two direction-relation matrices corresponds to determining the least cost for transforming one direction-relation matrix into another. Using the transportation algorithm, the cost can be determined efficiently for pairs of arbitrary direction-relation matrices. The similarity values are evaluated empirically with several types of movements that create increasingly less similar direction relations. The tests confirm the cognitive plausibility of the similarity model.

Proceedings ArticleDOI
01 Apr 2001
TL;DR: This paper introduces and describes the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit, and presents an algorithm to eAEciently retrieve information units.
Abstract: Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eAEciently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.

Proceedings Article
04 Aug 2001
TL;DR: This paper proposes a definition of semantic interoperability based on model theory and shows how it applies to already existing works in the domain and new applications of this definition to family of languages, ontology patterns and explicit description of semantics are presented.
Abstract: Semantic interoperability is the faculty of interpreting knowledge imported from other languages at the semantic level, i.e. to ascribe to each imported piece of knowledge the correct interpretation or set of models. It is a very important requirement for delivering a worldwide semantic web. This paper presents preliminary investigations towards developing a unified view of the problem. It proposes a definition of semantic interoperability based on model theory and shows how it applies to already existing works in the domain. Then, new applications of this definition to family of languages, ontology patterns and explicit description of semantics are presented.

Journal ArticleDOI
TL;DR: It is conjecture that for normal subjects high on MI “loose associations” may not be loose after all, and that the tendency to link uncommon, nonobvious, percepts may not only be the basis of paranormal and paranoid ideas of reference, but also a prerequisite of creative thinking.
Abstract: An abnormal facilitation of the spreading activation within semantic networks is thought to under-lie schizophrenics' remote associations and referential ideas. In normal subjects, elevated magical ideation (MI) has also been associated with a style of thinking similar to that of schizotypal subjects. We thus wondered whether normal subjects with a higher MI score would judge "loose associations" as being more closely related than do subjects with a lower MI score. In two experiments, we investigated whether judgments of the semantic distance between stimulus words varied as a function of MI. In the first experiment, random word pairs of two word classes, animals and fruits, were presented. Subjects had to judge the semantic distance between word pairs. In the second experiment, sets of three words were presented, consisting of a pair of indirectly related, or unrelated nouns plus a third noun. Subjects had to judge the semantic distance of the third noun to the word pair The results of both experiments showed that higher MI subjects considered unrelated words as more closely associated than did lower MI subjects. We conjecture that for normal subjects high on MI "loose associations" may not be loose after all. We also note that the tendency to link uncommon, nonobvious, percepts may not only be the basis of paranormal and paranoid ideas of reference, but also a prerequisite of creative thinking.

01 Jan 2001
TL;DR: The general principle under exploration the Distributional Hypothesis, which combines the convergence of these recent studies into a cognitive role for distributional information in explaining language ability, is called.
Abstract: Testing the Distributional Hypothesis: The Influence of Context on Judgements of Semantic Similarity Scott McDonald (scottm@cogsci.ed.ac.uk) Michael Ramscar (michael@cogsci.ed.ac.uk) Institute for Communicating and Collaborative Systems, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland Abstract Distributional information has recently been implicated as playing an important role in several aspects of lan- guage ability. Learning the meaning of a word is thought to be dependent, at least in part, on exposure to the word in its linguistic contexts of use. In two experiments, we manipulated subjects’ contextual experience with mar- ginally familiar and nonce words. Results showed that similarity judgements involving these words were af- fected by the distributional properties of the contexts i n which they were read. The accrual of contextual experi- ence was simulated in a semantic space model, by succes- sively adding larger amounts of experience in the form of item-in-context exemplars sampled from the British National Corpus. The experiments and the simulation provide support for the role of distributional information in developing representations of word meaning. The Distributional Hypothesis The basic human ability of language understanding – mak- ing sense of another person’s utterances – does not develop in isolation from the environment. There is a growing body of research suggesting that distributional information plays a more powerful role than previously thought in a number of aspects of language processing. The exploita- tion of statistical regularities in the linguistic environment has been put forward to explain how language learners accomplish tasks from segmenting speech to bootstrap- ping word meaning. For example, Saffran, Aslin and Newport (1996) have demonstrated that infants are highly sensitive to simple conditional probability statistics, indicating how the ability to segment the speech stream into words may be realised. Adults, when faced with the task of identifying the word boundaries in an artificial language, also appear able to readily exploit such statistics (Saffran, Newport & Aslin, 1996). Redington, Chater and Finch (1998) have proposed that distributional information may contribute to the acquisition of syntactic knowledge by children. Useful information about the similarities and differences in the meaning of words has also been shown to be present in simple distributional statistics (e.g., Landauer & Dumais, 1997; McDonald, 2000). Based on the convergence of these recent studies into a cognitive role for distributional information in explaining language ability, we call the general principle under exploration the Distributional Hypothesis. The purpose of the present paper is to further test the distributional hypothesis, by examining the influence of context on similarity judgements involving marginally familiar and novel words. Our investigations are framed under the ‘semantic space’ approach to representing word meaning, to which we turn next. Distributional Models of Word Meaning The distributional hypothesis has provided the motivation for a class of objective statistical methods for representing meaning. Although the surge of interest in the approach arose in the fields of computational linguistics and infor- mation retrieval (e.g., Schutze, 1998; Grefenstette, 1994), where large-scale models of lexical semantics are crucial for tasks such as word sense disambiguation, high- dimensional ‘semantic space’ models are also useful tools for investigating how the brain represents the meaning of words. Word meaning can be considered to vary along many dimensions; semantic space models attempt to capture this variation in a coherent way, by positioning words in a geometric space. How to determine what the crucial dimensions are has been a long-standing problem; a recent and fruitful approach to this issue has been to label the dimensions of semantic space with words. A word is located in the space according to the degree to which it co- occurs with each of the words labelling the dimensions of the space. Co-occurrence frequency information is extracted from a record of language experience – a large corpus of natural language. Using this approach, two words that tend to occur in similar linguistic contexts – that is, they are distributionally similar – will be positioned closer together in semantic space than two words which are not as distributionally similar. Such simple distributional knowledge has been implicated in a variety of language processing behaviours, such as lexical priming (e.g., Lowe & McDonald, 2000; Lund, Burgess & Atchley, 1995; McDonald & Lowe, 1998), synonym selection (Landauer & Dumais, 1997), retrieval in analogical reason- ing (Ramscar & Yarlett, 2000) and judgements of semantic similarity (McDonald, 2000). Contextual co-occurrence, the fundamental relationship underlying the success of the semantic space approach to representing word meaning, can be defined in a number of ways. Perhaps the simplest (and the approach taken in the majority of the studies cited above) is to define co- occurrence in terms of a ‘context window’: the co-occur-

01 Jan 2001
TL;DR: This paper presents formal definitions of similarity relations based on intensional definitions and concludes the extensional consequences of the pro- cess of merging ontologies based on the detected similarity relations.
Abstract: Interoperability and integration of data sources are becoming ever more impor- tant issues as both, the amount of data and the number of data producers are growing. Interoper- ability not only has to resolve the differences in data structures, it also has to deal with semantic heterogeneity. Semantics refer to the meaning of data in contrast to syntax, which only defines the structure of the schema items (e.g., classes and attributes). We focus on the part of semantics related to the meanings of the terms used as identifiers in schema definitions. This paper pre- sents an approach to integrate schemas from different communities, where each such commu- nity is using its own ontology. The approach is based on merging ontologies based on similarity relations among concepts of different ontologies. We present formal definitions of similarity relations based on intensional definitions and conclude the extensional consequences. The pro- cess of merging ontologies based on the detected similarity relations is discussed. The merged ontology is finally used to derive an integrated schema. The resulting schema can be used as the global schema in a federated database system.

Journal ArticleDOI
TL;DR: A comparative description of sparse binary distributed representation developed in the framework of the associative-projective neural network architecture and the more well known holographic reduced representations of T.A. Plate and P. Kanerva is provided.
Abstract: The schemes for compositional distributed representations include those allowing on-the-fly construction of fixed dimensionality codevectors to encode structures of various complexity. Similarity of such codevectors takes into account both structural and semantic similarity of represented structures. We provide a comparative description of sparse binary distributed representation developed in the framework of the associative-projective neural network architecture and the more well known holographic reduced representations of T.A. Plate (1995) and binary spatter codes of P. Kanerva (1996). The key procedure in associative-projective neural networks is context-dependent thinning which binds codevectors and maintains their sparseness. The codevectors are stored in structured memory array which can be realized as distributed auto-associative memory. Examples of distributed representation of structured data are given. Fast estimation of the similarity of analogical episodes by the overlap of their codevectors is used in the modeling of analogical reasoning both for retrieval of analogs from memory and for analogical mapping.

Journal Article
TL;DR: This paper discusses issues concerning the augmentation of thesaurus relationships, in light of new application possibilities for retrieval, and illustrates how hierarchical spatial relationships can be used to provide more flexible retrieval for queries incorporating place names in applications employing online gazetteers and geographical thesauri.
Abstract: This paper discusses issues concerning the augmentation of thesaurus relationships, in light of new application possibilities for retrieval. We first discuss a case study that explored the retrieval potential of an augmented set of thesaurus relationships by specialising standard relationships into richer subtypes, in particular hierarchical geographical containment and the associative relationship. We then locate this work in a broader context by reviewing various attempts to build taxonomies of thesaurus relationships, and conclude by discussing the feasibility of hierarchically augmenting the core set of thesaurus relationships, particularly the associative relationship. We discuss the possibility of enriching the specification and semantics of Related Term (RT relationships), while maintaining compatibility with traditional thesauri via a limited hierarchical extension of the associative (and hierarchical) relationships. This would be facilitated by distinguishing the type of term from the (sub)type of relationship and explicitly specifying semantic categories for terms following a faceted approach. We first illustrate how hierarchical spatial relationships can be used to provide more flexible retrieval for queries incorporating place names in applications employing online gazetteers and geographical thesauri. We then employ a set of experimental scenarios to investigate key issues affecting use of the associative (RT) thesaurus relationships in semantic distance measures. Previous work has noted the potential of RTs in thesaurus search aids but also the problem of uncontrolled expansion of query term sets. Results presented in this paper suggest the potential for taking account of the hierarchical context of an RT link and specialisations of the RT relationship.

Proceedings ArticleDOI
02 Jun 2001
TL;DR: Tests performed on vocabularies of four Algonquian languages indicate that the method is capable of discovering on average nearly 75% percent of cognates at 50% precision.
Abstract: I present a method of identifying cognates in the vocabularies of related languages. I show that a measure of phonetic similarity based on multivalued features performs better than "orthographic" measures, such as the Longest Common Subsequence Ratio (LCSR) or Dice's coefficient. I introduce a procedure for estimating semantic similarity of glosses that employs keyword selection and WordNet. Tests performed on vocabularies of four Algonquian languages indicate that the method is capable of discovering on average nearly 75% percent of cognates at 50% precision.

Journal ArticleDOI
TL;DR: It will be argued that the evidence supports a theory of semantic memory that represents meaning in a continuum of levels of meaning from the most specific and Context-bound to the most generalisable and context-free.
Abstract: This paper presents evidence that the breakdown of semantic memory in semantic dementia reveals the influence of two properties of script theory (Schank, 1982; Schank & Abelson, 1977). First, the physical and personal context of specific scripts supports meaning for words, objects, and locations that are involved in the script. Second, meaning is updated or transformed by a dynamic memory system that learns continuously from personal experience. In severe cases, semantic dementia exposes the basic level of this learning system from which all knowledge normally develops. It will be argued that the evidence supports a theory of semantic memory that represents meaning in a continuum of levels of meaning from the most specific and context-bound to the most generalisable and context-free. This contrasts with current theories of semantic memory that represent meaning as a collection of abstracted properties entirely removed from the context of events and activities.

01 Jan 2001
TL;DR: A representational scheme based on the Distributional Hypothesis is identified as the rationale for vector-based semantic analysis, and a new method for calculating semantic word vectors is described, which shows that incorporating linguistic information in the context vectors can enhance the results.
Abstract: Vector-based semantic analysis is the practice of representing word meanings as semantic vectors, calculated from the co-occurrence statistics of words in large text data This paper discusses the theoretical presumptions behind this practice, and a representational scheme based on the Distributional Hypothesis is identified as the rationale for vector-based semantic analysis A new method for calculating semantic word vectors is then described The method uses random labelling of words in narrow context windows to calculate semantic context vectors for each word type in the text data The method is evaluated with a standardised synonym test, and it is shown that incorporating linguistic information in the context vectors can enhance the results

Patent
27 Dec 2001
TL;DR: In this paper, a method and system for determining the semantic meaning of images is disclosed, which includes deriving a set of perceptual semantic categories for representing important semantic cues in the human perception of images, where each semantic category is modeled through a combination of perceptual features.
Abstract: A method and system for determining the semantic meaning of images is disclosed. The method includes deriving a set of perceptual semantic categories for representing important semantic cues in the human perception of images, where each semantic category is modeled through a combination of perceptual features that define the semantics of that category and that discriminate that category from other categories and, for each semantic category, forming a set of the perceptual features as a complete feature set CFS. The perceptual features and their combinations are preferably derived through subjective experiments performed with human observers. The method includes extracting perceptual features from an input image and applying a perceptually-based metric to determine the semantic category for that image. The input image can be processed to compute the CFS, followed by comparing the input image to each semantic category through the perceptually-based metric that computes a similarity measure between the features used to describe the semantic category and the corresponding features extracted from the input image; followed by assigning the input image to the semantic category that corresponds to a highest value of the similarity measure. The distance measure may also be used for characterizing a relationship of a selected image to another image in the image database by applying the perceptually-based similarity metric.

Patent
19 Dec 2001
TL;DR: In this article, a method and system for matching a reference document with a plurality of corpus documents is provided for matching the two documents according to a hierarchical arrangement of semantic types, and a matching score is produced for each corpus document by determining a relatedness between the corpus document and the reference document.
Abstract: A method and system are provided for matching a reference document with a plurality of corpus documents. Semantic content is derived from the reference document according to a hierarchical arrangement of semantic types. For each corpus document, semantic content is also derived from the corpus document according to the hierarchical arrangement of semantic types. A matching score is produced for each corpus document by determining a relatedness between the corpus document and the reference document. This relatedness is derived from the respective semantic contents of the two documents. The corpus documents may be ranked in accordance with the determined matching scores.

Proceedings ArticleDOI
26 Aug 2001
TL;DR: A new method of estimating the novelty of rules discovered by data-mining methods using WordNet, a lexical knowledge-base of English words, finds that the automatic scoring of rules based on the novelty measure correlates with human judgments about as well as human judgments correlate with one another.
Abstract: In this paper, we present a new method of estimating the novelty of rules discovered by data-mining methods using WordNet, a lexical knowledge-base of English words. We assess the novelty of a rule by the average semantic distance in a knowledge hierarchy between the words in the antecedent and the consequent of the rule - the more the average distance, more is the novelty of the rule. The novelty of rules extracted by the DiscoTEX text-mining system on Amazon.com book descriptions were evaluated by both human subjects and by our algorithm. By computing correlation coefficients between pairs of human ratings and between human and automatic ratings, we found that the automatic scoring of rules based on our novelty measure correlates with human judgments about as well as human judgments correlate with one another. @Text mining

Book ChapterDOI
09 Jul 2001
TL;DR: The role that semantic structures make for establishing communication between different agents in general are discussed and a number of intelligent means that make semantic web sites accessible from the outside are elaborate.
Abstract: The core idea of the Semantic Web is to make information accessible to human and software agents on a semantic basis Hence, web sites may feed directly from the Semantic Web exploiting the underlying structures for human and machine access We have developed a generic approach for developing semantic portals, viz SEAL (SEmantic portAL), that exploits semantics for providing and accessing information at a portal as well as constructing and maintaining the portalIn this paper, we discuss the role that semantic structures make for establishing communication between different agents in general We elaborate on a number of intelligent means that make semantic web sites accessible from the outside, viz semantics-based browsing, semantic querying and querying with semantic similarity, and machine access to semantic information at a semantic portal As a case study we refer to the AIFB web site -- a place that is increasingly driven by Semantic Web technologies

Proceedings ArticleDOI
07 Jul 2001
TL;DR: A new approach under the example-based machine translation paradigm retrieves the most similar example by carrying out DP-matching of the input sentence and example sentences while measuring the semantic distance of the words.
Abstract: We propose a new approach under the example-based machine translation paradigm. First, the proposed approach retrieves the most similar example by carrying out DP-matching of the input sentence and example sentences while measuring the semantic distance of the words. Second, the approach adjusts the gap between the input and the most similar example by using a bilingual dictionary. We show the results of a computational experiment.

Journal ArticleDOI
TL;DR: It is proposed that the ability to resist the lure of a semantically related impostor word is related to the individual’s skill at accessing and reasoning about knowledge from long-term memory and to theIndividuals’ capacity to simultaneously process and store information in working memory.
Abstract: When askedHow many animals of each kind did Moses take on the ark?, people frequently respond “two” even though they know it was Noah, not Moses, who took animals on the ark. We replicate previous research by showing that susceptibility to semantic illusions is influenced by the semantic relatedness of both the impostor word and the surrounding context. However, we also show that the two text manipulations make independent contributions to semantic illusions, and we propose two individual-differences mechanisms that might underlie these two effects. We propose that the ability to resist the lure of a semantically related impostor word is related to the individual’s skill at accessing and reasoning about knowledge from long-term memory. And we propose that the ability to resist the lure of the surrounding sentential context is related to the individual’s capacity to simultaneously process and store information in working memory.

Journal ArticleDOI
TL;DR: This research is working to determine the efficacy of local annotation for Web sources, as well as performing reconciliation that is qualified by measures of semantic distance, to enable software agents to resolve semantic misconceptions that inhibit successful interoperation with other agents and that limit the effectiveness of searching distributed information sources.
Abstract: As you build a Web site, it is worthwhile asking, "Should I put my information where it belongs or where people are most likely to look for it?" Our recent research into improving searching through ontologies is providing some interesting results to answer this question. The techniques developed by our research bring organization to the information received and reconcile the semantics of each document. Our goal is to help users retrieve dynamically generated information that is tailored to their individual needs and preferences. We believe that it is easier for individuals or small groups to develop their own ontologies, regardless of whether global ones are available, and that these can be automatically and ex-post-facto related. We are working to determine the efficacy of local annotation for Web sources, as well as performing reconciliation that is qualified by measures of semantic distance. If successful, this research will enable software agents to resolve the semantic misconceptions that inhibit successful interoperation with other agents and that limit the effectiveness of searching distributed information sources.

Patent
YeYi Wang1
21 Aug 2001
TL;DR: This paper presented a dialogue system in which semantic ambiguity is reduced by selectively choosing which semantic structures are to be made available for parsing based on previous information obtained from the user or other context information.
Abstract: The present invention provides a dialogue system in which semantic ambiguity is reduced by selectively choosing which semantic structures are to be made available for parsing based on previous information obtained from the user or other context information. In one embodiment, the semantic grammar used by the parser is altered so that the grammar is focused based on information about the user or the dialogue state. In other embodiments, the semantic parsing is focused on certain parse structures by giving preference to structures that the dialogue system has marked as being expected.