scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2000"


Proceedings ArticleDOI
03 Oct 2000
TL;DR: This work presents a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame, derived from parse trees and hand-annotated training data.
Abstract: We present a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame. Various lexical and syntactic features are derived from parse trees and used to derive statistical classifiers from hand-annotated training data.

944 citations


Journal ArticleDOI
J.R. Bellegarda1
01 Aug 2000
TL;DR: This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus, and proposes an integrative formulation for harnessing this synergy.
Abstract: Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more difficult to handle within a data-driven formalism. This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a (continuous) semantic vector space, in which familiar clustering techniques can be applied. This leads to the specification of a powerful framework for automatic semantic classification, as well as the derivation of several language model families with various smoothing properties. Because of their large-span nature, these language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to adjust the standard n-gram probability. Such hybrid language modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. This paper concludes with a discussion of intrinsic tradeoffs, such as the influence of training data selection on the resulting performance.

565 citations


Journal ArticleDOI
TL;DR: This article found that morphological structure plays a significant role in early visual recognition of English words that is independent of both semantic and orthographic relatedness, and reported two sets of visual priming experiments in which the morphological, semantic, and Orthographic relationships between primes and targets are varied in three SOA conditions (43 ms, 72 ms, and 230 ms).
Abstract: Some theories of visual word recognition postulate that there is a level of processing or representation at which morphemes are treated differently fromwhole words. Support for these theories has been derived frompriming experiments in which the recognition of a target word is facilitated by the prior presentation of a morphologically related prime (departure-DEPART). In English, such facilitation could be due to morphological relatedness, or to some combination of the orthographic and semantic relatedness characteristic of derivationally related words. We report two sets of visual priming experiments in which the morphological, semantic, and orthographic relationships between primes and targets are varied in three SOA conditions (43 ms, 72 ms, and 230 ms). Results showed that morphological structure plays a significant role in the early visual recognitionof English words that is independent of both semantic and orthographic relatedness. Findings are discussed in terms of current approaches to morphologic...

452 citations


Journal ArticleDOI
TL;DR: Simulations in which a set of morphologically related words varying in semantic transparency were embedded in either a morphologically rich or impoverished artificial language found that morphological priming increased with degree of semantic transparency in both languages.
Abstract: On a distributed connectionist approach, morphology reflects a learned sensitivity to the systematic relationships among the surface forms of words and their meanings. Performance on lexical tasks should thus exhibit graded effects of both semantic and formal similarity. Although there is evidence for such effects, there are also demonstrations of morphological effects in the absence of semantic similarity (when formal similarity is controlled) in morphologically rich languages like Hebrew. Such findings are typically interpreted as being problematic for the connectionist account. To evaluate whether this interpretation is valid, we carried out simulations in which a set of morphologically related words varying in semantic transparency were embedded in either a morphologically rich or impoverished artificial language. We found that morphological priming increased with degree of semantic transparency in both languages. Critically, priming extended to semantically opaque items in the morphologically rich la...

349 citations


Patent
29 Aug 2000
TL;DR: In this paper, a neural network is used to extract semantic profiles from text corpus and a new set of documents, such as world wide web pages obtained from the Internet, are then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents.
Abstract: A process and system for database storage and retrieval are described along with methods for obtaining semantic profiles from a training text corpus, i.e., text of known relevance, a method for using the training to guide context-relevant document retrieval, and a method for limiting the range of documents that need to be searched after a query. A neural network is used to extract semantic profiles from text corpus. A new set of documents, such as world wide web pages obtained from the Internet, is then submitted for processing to the same neural network, which computes a semantic profile representation for these pages using the semantic relations learned from profiling the training documents. These semantic profiles are then organized into clusters in order to minimize the time required to answer a query. When a user queries the database, i.e., the set of documents, his or her query is similarly transformed into a semantic profile and compared with the semantic profiles of each cluster of documents. The query profile is then compared with each of the documents in that cluster. Documents with the closest weighted match to the query are returned as search results.

270 citations


Journal ArticleDOI
TL;DR: It is demonstrated that a wide variety of recently reported "rule- described" and "prototype-described" phenomena in perceptual classification, which have led to the development of a number of multiple-system models, can be given an alternative interpretation in terms of a single-system exemplar-similarity model.
Abstract: We demonstrate that a wide variety of recently reported "rule-described" and "prototype-described" phenomena in perceptual classification, which have led to the development of a number of multiple-system models, can be given an alternative interpretation in terms of a single-system exemplar-similarity model. The phenomena include various rule- and prototype-described patterns of generalization, dissociations between categorization and similarity judgments, and dissociations between categorization and old-new recognition. The alternative exemplar-based interpretation relies on the idea that similarity is not an invariant relation but a context-dependent one. Similarity relations among exemplars change systematically because of selective attention to dimensions and because of changes in the level of sensitivity relating judged similarity to distance in psychological space. Adaptive learning principles may help explain the systematic influence of the selective attention process and of modulation in sensitivity settings on judged similarity.

252 citations


Proceedings ArticleDOI
13 Sep 2000
TL;DR: A semantics-only algorithm for learning morphology which only proposes affixes when the stem and stem-plus-affix are sufficiently similar semantically and it is shown that this approach provides morphology induction results that rival a current state-of-the-art system.
Abstract: Morphology induction is a subproblem of important tasks like automatic learning of machine-readable dictionaries and grammar induction. Previous morphology induction approaches have relied solely on statistics of hypothesized stems and affixes to choose which affixes to consider legitimate. Relying on stem-and-affix statistics rather than semantic knowledge leads to a number of problems, such as the inappropriate use of valid affixes ("ally" stemming to "all"). We introduce a semantic-based algorithm for learning morphology which only proposes affixes when the stem and stem-plus-affix are sufficiently similar semantically. We implement our approach using Latent Semantic Analysis and show that our semantics-only approach provides morphology induction results that rival a current state-of-the-art system.

233 citations


Patent
18 Oct 2000
TL;DR: In this paper, state vectors representing the semantic content of a document are superpositioned to construct a single vector representing a semantic abstract for the document, which can be used to locate documents with similar semantic content.
Abstract: State vectors representing the semantic content of a document are created. The state vectors are superpositioned to construct a single vector representing a semantic abstract for the document. The single vector can be normalized. Once constructed, the single vector semantic abstract can be compared with semantic abstracts for other documents to measure a semantic distance between the documents, and can be used to locate documents with similar semantic content.

180 citations


Journal ArticleDOI
TL;DR: This article focuses on three aspects of comparison that are central in structure-mapping theory: first, comparison involves structured representations, second, the comparison process is driven by a preference for connected relational structure and third, the mapping between domains is rooted in semantic similarity between the relations that characterize the domains.
Abstract: Carrying out similarity and analogy comparisons can be modeled as the alignment and mapping of structured representations. In this article we focus on three aspects of comparison that are central in structure-mapping theory. All three are controversial. First, comparison involves structured representations. Second, the comparison process is driven by a preference for connected relational structure. Third, the mapping between domains is rooted in semantic similarity between the relations that characterize the domains. For each of these points, we review supporting evidence and discuss some challenges raised by other researchers. We end with a discussion of the role of structure mapping in other cognitive processes.

156 citations


Proceedings ArticleDOI
30 Oct 2000
TL;DR: Weight ChainNet is a novel image representation model based on lexical chain that represents the semantics of an image from its nearby text that outperform existing technique and can lead to significantly better retrieval effectiveness.
Abstract: Images are increasingly being embedded in HTML documents on the WWW. Such documents over the WWW essentially provides a rich source of image collection from which user can query. Interestingly, the semantics of these images are typically described by their surrounding text. Unfortunately, most WWW image search engines fail to exploit these image semantics and give rise to poor recall and precision performance. In this paper, we propose a novel image representation model called Weight ChainNet. Weight ChainNet is based on lexical chain that represents the semantics of an image from its nearby text. A new formula, called list space model, for computing semantic similarities is also introduced. To further improve the retrieval effectiveness, we also propose two relevance feedback mechanisms. We conducted an extensive performance study on a collection of 5000 images obtained from documents identified by more than 2000 URLs. Our results show that our models and methods outperform existing technique. Moreover, the relevant feedback mechanisms can lead to significantly better retrieval effectiveness.

130 citations


Journal ArticleDOI
TL;DR: A way to factorize (simplify) concept lattices by the similarity of concepts is shown and how to reduce the computation of the similarity relations is shown.
Abstract: This paper studies the issue of similarity relations in fuzzy concept lattices Fuzzy concepts and fuzzy concept lattices represent a formal approach to the modelling of non-sharp (fuzzy) concepts and conceptual structures in the sense of traditional (Port-Royal) logic Applications of concept lattices are in representation of conceptual knowledge and in conceptual analysis of (fuzzy) data Similarity relations are defined and considered on three levels: similarity of objects (and similarity of attributes), similarity of concepts, and similarity of concept lattices We show a way to factorize (simplify) concept lattices by the similarity of concepts Also shown is how to reduce the computation of the similarity relations

Patent
28 Jan 2000
TL;DR: In this article, the semantic space is created by a lexicon of concepts and relations between concepts, and each data element in the target data set being searched is associated with a location in semantic space.
Abstract: The present invention is directed to a system in which a semantic space is searched in order to determine the semantic distance between two locations. A further aspect of the present invention provides a system in which a portion of semantic space is purchased and associated with a target data set element which is returned in response to a search input. The semantic space is created by a lexicon of concepts and relations between concepts. An input is associated with a location in the semantic space. Similarly, each data element in the target data set being searched is associated with a location in the semantic space. Searching is accomplished by determining a semantic distance between the first and second location in semantic space, wherein this distance represents their closeness in meaning and where the cost for retrieval of target data elements is based on this distance.

Book ChapterDOI
TL;DR: A number of similarity measures are listed, some of which are not well known (such as the Monge-Kantorovich metric), or newly introduced (reflection metric), and a set constructions are given that have been used in the design of some similarity measures.
Abstract: This paper formulates properties of similarity measures. We list a number of similarity measures, some of which are not well known (such as the Monge-Kantorovich metric), or newly introduced (reflection metric), and give a set constructions that have been used in the design of some similarity measures.

Journal ArticleDOI
TL;DR: The present PET study examined three associative encoding conditions differing in the number of words semantically related to a third word representing the name of a semantic category to identify task-related patterns of activity for associative encode and cued-recall tasks.

21 Jun 2000
TL;DR: Several models of semantic similarity are presented, based on differing rep- resentational assumptions, and their properties are investigated via comparison with human ratings of verb similarity to offer insight into the bases for human similarity judgments.
Abstract: : The way we model semantic similarity is closely tied to our understanding of linguistic representations. We present several models of semantic similarity, based on differing rep- resentational assumptions, and investigate their properties via comparison with human ratings of verb similarity. The results offer insight into the bases for human similarity judgments and provide a testbed for further investigation of the interactions among syntactic properties, semantic structure, and semantic content.

Journal ArticleDOI
TL;DR: Two studies investigating the computer-based representation of the semantic information content of databases using object location in two- and three-dimensional virtual space supported the conclusion that, for the purpose of information search, the amount of additional semantic information that can be conveyed by a three- dimensional solution does not outweigh the associated additional cognitive demands.
Abstract: This paper reports two studies investigating the computer-based representation of the semantic information content of databases using object location in two- and three-dimensional virtual space. In the first study, the cognitive demands associated with performing an information search task were examined under conditions where the “goodness of fit” of the spatial-semantic “mapping” was manipulated. The effects of individual differences in spatial ability and associative memory ability also were considered. Results indicated that performance equivalence, between two- and three-dimensional interfaces, could be achieved when the two-dimensional interface accounted for between 50 and 70% of the semantic variance accounted for by the three-dimensional solution. A second study, in which automatic text analysis was used to generate two- and three-dimensional solutions for document sets of varying sizes and types, supported the conclusion that, for the purpose of information search, the amount of additional semantic information that can be conveyed by a three-dimensional solution does not outweigh the associated additional cognitive demands.

Patent
11 Jul 2000
TL;DR: In this paper, a method for computing the similarity between a first and a second set of words comprises identifying a word of the second sentence as being most similar to a word in the first sentence, and computing a score of similarity between the first and second sentence based at least in part on that word.
Abstract: A system and associated methods determine the semantic similarity of different sentences to one another. A particularly appropriate application of the present invention is to automatic processing of Chinese-language text, for example, for document retrieval. A method for computing the similarity between a first and a second set of words comprises identifying a word of the second set of words as being most similar to a word of the first set of words, wherein the word of the second set of words need not be identical to the word of the first set of words; and computing a score of the similarity between the first and second set of words based at least in part on the word of the second set of words.

Patent
05 Dec 2000
TL;DR: In this paper, a search engine for searching a corpus improves the relevancy of the results by classifying multiple terms in a search query as a single semantic unit, and the resultant semantic units are used to refine the results of the search.
Abstract: A search engine for searching a corpus improves the relevancy of the results by classifying multiple terms in a search query as a single semantic unit. A semantic unit locator of the search engine generates a subset of documents that are generally relevant to the query based on the individual terms within the query. Combinations of search terms that define potential semantic units from the query are then evaluated against the subset of documents to determine which combinations of search terms should be classified as a semantic unit. The resultant semantic units are used to refine the results of the search.

Journal ArticleDOI
TL;DR: This result shows that semantic priming, as indexed by the N400 component, can be supported by nonassociative visual-perceptual semantic relations and provide support for models of semantic representation which incorporate semantic features and form information.

Journal ArticleDOI
TL;DR: In this article, the relation between similarity and dissimilarity of meaning and similarity of context was analyzed for synonymous nouns, and the data strongly supported a contextual hypothesis of meaning.
Abstract: The relation between similarity and dissimilarity of meaning and similarity of context was analyzed for synonymous nouns. New semantic similarity and dissimilarity rating tests with an empirically determined series of linguistic anchors and conventional, arbitrarily anchored semantic similarity ratings were compared. Contextual similarity was elicited by a sorting test based on substitution and yielding d-primes. The study found reliable correlations between the d-primes and the different ratings for semantic similarity and dissimilarity of the synonymous nouns across a wide continuum of meaning. The data strongly supported a contextual hypothesis of meaning. The data endorsed the claim that people abstract a contextual representation from experiencing the multiple natural linguistic contexts of a word. Semantic similarity and dissimilarity rating formats with an empirically chosen series of linguistic anchors and a sorting test of contextual similarity yielded stronger support for a contextual hypothesis than did alternative methods of eliciting lexical and contextual similarity.

Journal ArticleDOI
TL;DR: The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database.
Abstract: This paper presents a syntactic/semantic string representation scheme as well as a string matching method as part of a computer-assisted system to identify dolphins from photographs of their dorsal fins. A low-level string representation is constructed from the curvature function of a dolphin's fin trailing edge, consisting of positive and negative curvature primitives. A high-level string representation is then built over the low-level string via merging appropriate groupings of primitives in order to have a less sensitive representation to curvature fluctuations or noise. A family of syntactic/semantic distance measures between two strings is introduced. A composite distance measure is then defined and used as a dissimilarity measure for database search, highlighting both the syntax (structure or sequence) and semantic (attribute or feature) differences. The syntax consists of an ordered sequence of significant protrusions and intrusions on the edge, while the semantics consist of seven attributes extracted from the edge and its curvature function. The matching results are reported for a database of 624 images corresponding to 164 individual dolphins. The identification results indicate that the developed string matching method performs better than the previous matching methods including dorsal ratio, curvature, and curve matching. The developed computer-assisted system can help marine mammalogists in their identification of dolphins, since it allows them to examine only a handful of candidate images instead of the currently used manual searching of the entire database. © 2000 Biomedical Engineering Society. PAC00: 8780Tq, 4230Sy, 0705Pj


Book ChapterDOI
18 Sep 2000
TL;DR: This paper illustrates how hierarchical spatial relationships can be used to provide more flexible retrieval for queries incorporating place names in applications employing online gazetteers and geographical thesauri and investigates key issues affecting use of the associative thesaurus relationships in semantic distance measures.
Abstract: The OASIS (Ontologically Augmented Spatial Information System) project explores terminology systems for thematic and spatial access in digital library applications. A prototype implementation uses data from the Royal Commission on the Ancient and Historical Monuments of Scotland, together with the Getty AAT and TGN thesauri. This paper describes its integrated spatial and thematic schema and discusses novel approaches to the application of thesauri in spatial and thematic semantic distance measures. Semantic distance measures can underpin interactive and automatic query expansion techniques by ranking lists of candidate terms. We first illustrate how hierarchical spatial relationships can be used to provide more flexible retrieval for queries incorporating place names in applications employing online gazetteers and geographical thesauri. We then employ a set of experimental scenarios to investigate key issues affecting use of the associative (RT) thesaurus relationships in semantic distance measures. Previous work has noted the potential of RTs in thesaurus search aids but the problem of increased noise in result sets has been emphasised. Specialising RTs allows the possibility of dynamically linking RT type to query context. Results presented in this paper demonstrate the potential for filtering on the context of the RT link and on subtypes of RT relationships.

01 Dec 2000
TL;DR: The onto-matching technique introduced in the paper makes extensive use of lexical ontologies similar to WordNet to address the question how concepts can be used as ATEs more eAEciently in order to match \small duck" with \small bird.
Abstract: Evaluation of the closeness of two texts is a subtask for FTR and IR systems. The basic means used to accomplish it is the matching of atomic text entities (ATEs) such as words, stems, simple phrases and/or concepts. We address the question how concepts can be used as ATEs more eAEciently in order to match \small duck" with \small bird". The onto-matching technique introduced in the paper makes extensive use of lexical ontologies similar to WordNet. We work with two tasks in mind: query expansion and text concept indexing. We outline some arguments showing why onto-matching is useful and how it can be implemented. Also, we conducted some experiments with query expansion for

Journal ArticleDOI
TL;DR: A conceptual integration approach is introduced that exploits the similarity in metalevel information in existing systems and performs metadata mining on database objects to discover a set of concepts that serve as a domain abstraction and provide a conceptual layer above existing legacy systems.
Abstract: Autonomy of operations combined with decentralized management of data gives rise to a number of heterogeneous databases or information systems within an enterprise. These systems are often incompatible in structure as well as content and, hence, difficult to integrate. Depsite heterogeneity, the unity of overall purpose within a common application domain, nevertheless, provides a degree of semantic similarity that manifests itself in the form of similar data structures and common usage patterns of existing information systems. This article introduces a conceptual integration approach that exploits the similarity in metalevel information in existing systems and performs metadata mining on database objects to discover a set of concepts that serve as a domain abstraction and provide a conceptual layer above existing legacy systems. This conceptual layer is further utilized by an information reengineering framework that customizes and packages information to reflect the unique needs of different user groups within the application domain. The architecture of the information reengineering framework is based on an object-oriented model that represents the discovered concepts as customized application objects for each distinct user group.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: An overview of the usage of the LSA for analysis of textual data and the potential of LSA on selected corpus of religious and sacred texts is demonstrated.
Abstract: Latent Semantic Analysis of Text Information The paper presents an overview of the usage of LSA for analysis of textual data. The mathematical apparatus is explained in brief and special attention if pointed on the key parameters that influence the quality of the results obtained. The potential of LSA is demonstrated on selected corpus of religious and sacred texts. The results of an experimental application of LSA for educational purposes are also present.

Patent
June-Jei Kuo1
18 Jul 2000
TL;DR: In this article, a Chinese word segmentation apparatus relates to processing of a Chinese sentence input to a computer, where a character-to-phonetic converter converts the Chinese sentence into a phonetic symbol string while referring to a character phonetic dictionary and a ductionary for characters with different pronunciations.
Abstract: A Chinese word segmentation apparatus relates to processing of a Chinese sentence input to a computer. A character-to-phonetic converter of the segmentation apparatus initially converts a Chinese sentence into a phonetic symbol string while referring to a character phonetic dictionary and a ductionary for characters with different pronunciations. Thereafter, a candidate word-selector refers to a system dictionary to retrieve all of the possible candidate characters or words in the phonetic symbol string and relevant information, such as frequency of use, using the phonetic symbols as indexing terms. Unfeasible candidate characters or words are discarded. Subsequently, an optimum candidate character string-decider builds a candidate word network using starting and ending positions of each candidate character or word in the input sentence as indexing terms. By referring to semantic and syntax information portions, frequency of use prioritization, word length prioritization, semantic similarity prioritization and syntax prioritization are combined to obtain a total estimate. The optimum route for word segmentation marking portion adds word segmentation markers into the input sentence while referring to the optimum route to complete word segmentation.

Patent
26 Jul 2000
TL;DR: In this paper, a character-to-phonetic converting portion of the Chinese word segmentation apparatus initially converts a Chinese sentence inputted from an input portion of a computer system into a phonetic symbol string while referring to a character phonetic dictionary and a dictionary for characters with different pronunciations.
Abstract: The Chinese word segmentation apparatus of this invention relates to a technique for word segmentation processing of a Chinese sentence inputted into a computer by using character phonetic information in the computer system. A character-to-phonetic converting portion of the Chinese word segmentation apparatus initially converts a Chinese sentence inputted from an input portion of the computer system into a phonetic symbol string while referring to a character phonetic dictionary and a dictionary for characters with different pronunciations. Thereafter, a candidate word-selecting portion refers to a system dictionary to retrieve all of the possible candidate characters or words in the phonetic symbol string and the relevant information, such as freguency of use, etc., using the phonetic symbols as indexing terms. Unfeasibie candidate characters or words are discarded via matching means while referring to the characters in the input sentence and syntax constraints of connected candidate words. Subsequently, an optimum candidate character string-deciding portion builds a candidate word network using starting and ending positions of each candidate character or word in the input sentence as indexing terms. By referring to a semantic information portion and a syntax information portion, frequency of use prioritization, word length prioritization, semantic similarity prioritization and syntax prioritization are combined to obtain a total estimate. The optimum route for word segmentation is then found by a dynamic programming method. Finally, a word segmentation marking portion adds word segmentation markers into the input sentence while referring to the optimum route to complete word segmentation in the Chinese word segmentation apparatus. The apparatus of this invention can achieve a word segmentation accuracy of more than 98%. The invention does not require troublesome and iterative calculations to dramatically increase the operating efficiency and accuracy during Chinese word segmentation.

Journal ArticleDOI
TL;DR: The results suggest that the advantage shown by concrete words in terms of greater number of predicates is only apparent for words of low frequency, and models based on the assumption of a "richer" semantic representation for concrete words are not supported.

Proceedings Article
01 Jan 2000
TL;DR: The method was validated by finding the k nearest neighbors of ten different diagnoses from the ICD-10 cardiovascular chapter by using the notion of semantic distance to find the nearest neighbor of a medical concept in a controlled vocabulary.
Abstract: OBJECTIVE: To use the notion of semantic distance to find the nearest neighbors of a medical concept in a controlled vocabulary MATERIAL AND METHOD: 392 concepts from the cardiovascular chapter of the ICD-10 were projected on the axes of SNOMED III Distances were measured on each axis and the resulting distance was found using a Lp norm RESULTS: The distance between a set of ischemic diseases and a set of non-ischemic diseases was significant (p < 00001) Our method was validated by finding the k nearest neighbors of ten different diagnoses from the ICD-10 cardiovascular chapter DISCUSSION: The availability of SNOMED-RT should improve our method Several more steps are necessary to provide an ideal coding tool