scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 1997"


01 Aug 1997
TL;DR: This paper presents a new approach for measuring semantic similarity/distance between words and concepts that combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data.
Abstract: This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human subjects replicate the same task.

3,061 citations


Posted Content
TL;DR: This paper proposed a new approach for measuring semantic similarity/distance between words and concepts, which combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data.
Abstract: This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the semantic space constructed by the taxonomy can be better quantified with the computational evidence derived from a distributional analysis of corpus data. Specifically, the proposed measure is a combined approach that inherits the edge-based approach of the edge counting scheme, which is then enhanced by the node-based approach of the information content calculation. When tested on a common data set of word pair similarity ratings, the proposed approach outperforms other computational models. It gives the highest correlation value (r = 0.828) with a benchmark based on human similarity judgements, whereas an upper bound (r = 0.885) is observed when human subjects replicate the same task.

973 citations


Journal ArticleDOI
TL;DR: In this article, the role of correlations among features and differences between speeded and untimed tasks with respect to the use of featural information was explored, and it was shown that the degree to which features are intercorrelated plays an important role in the organization of semantic memory.
Abstract: Behavioral experiments and a connectionist model were used to explore the use of featural representations in the computation of word meaning. The research focused on the role of correlations among features, and differences between speeded and untimed tasks with respect to the use of featural information. The results indicate that featural representations are used in the initial computation of word meaning (as in an attractor network), patterns of feature correlations differ between artifacts and living things, and the degree to which features are intercorrelated plays an important role in the organization of semantic memory. The studies also suggest that it may be possible to predict semantic priming effects from independently motivated featural theories of semantic relatedness. Implications for related behavioral phenomena such as the semantic impairments associated with Alzheimer's disease (AD) are discussed.

577 citations


Journal ArticleDOI
TL;DR: Two new similarity measures to represent the similarity measure between fuzzy sets and between elements, respectively are proposed, which can be computed easily and express the confidable similarity relation apparently.

302 citations


Journal ArticleDOI
TL;DR: The proposed method to handle approximate searching by image content in medical image databases has several desirable properties: it is much faster than sequential scanning for searching in the main memory and on the disk, thus scaling-up well for large databases.
Abstract: We propose a method to handle approximate searching by image content in medical image databases. Image content is represented by attributed relational graphs holding features of objects and relationships between objects. The method relies on the assumption that a fixed number of "labeled" or "expected" objects (e.g., "heart", "lungs", etc.) are common in all images of a given application domain in addition to a variable number of "unexpected" or "unlabeled" objects (e.g., "tumor", "hematoma", etc.). The method can answer queries by example, such as "find all X-rays that are similar to Smith's X-ray". The stored images are mapped to points in a multidimensional space and are indexed using state-of-the-art database methods (R-trees). The proposed method has several desirable properties: (a) Database search is approximate, so that all images up to a prespecified degree of similarity (tolerance) are retrieved. (b) It has no "false dismissals" (i.e., all images qualifying query selection criteria are retrieved). (c) It is much faster than sequential scanning for searching in the main memory and on the disk (i.e., by up to an order of magnitude), thus scaling-up well for large databases.

284 citations


Proceedings Article
01 Jan 1997
TL;DR: It is shown that, using semantic information, mixture LMs performs better than a conventional single LM with slight increase of computational cost and compared to manual clustering, this work builds on previous work in the eld of information retrieval.
Abstract: In this paper, an approach for constructing mixture language models (LMs) based on some notion of semantics is discussed. To this end, a technique known as latent semantic analysis (LSA) is used. The approach encapsulates corpus-derived semantic information and is able to model the varying style of the text. Using such information , the corpus texts are clustered in an unsuper-vised manner and mixture LMs are automatically created. This work builds on previous work in the eld of information retrieval which was recently applied by Bel-legarda et. al. to the problem of clustering words by semantic categories. The principal contribution of this work is to characterize the document space resulting from the LSA modeling and to demonstrate the approach for mixture LM application. Comparison is made between manual and automatic clustering in order to elucidate how the semantic information is expressed in the space. It is shown that, using semantic information, mixture LMs performs better than a conventional single LM with slight increase of computational cost.

192 citations


Proceedings Article
01 Jan 1997
TL;DR: This paper presents a corpus-based method that can be used to build semantic lexicons for specific categories using a small set of seed words for a category and a representative text corpus.
Abstract: Semantic knowledge can be a great asset to natural language processing systems, but it is usually hand-coded for each application. Although some semantic information is available in general-purpose knowledge bases such as WordNet and Cyc, many applications require domain-specific lexicons that represent words and categories for a particular topic. In this paper, we present a corpus-based method that can be used to build semantic lexicons for specific categories. The input to the system is a small set of seed words for a category and a representative text corpus. The output is a ranked list of words that are associated with the category. A user then reviews the top-ranked words and decides which ones should be entered in the semantic lexicon. In experiments with five categories, users typically found about 60 words per category in 10-15 minutes to build a core semantic lexicon.

162 citations


Journal ArticleDOI
01 Jan 1997-Cortex
TL;DR: A patient suffering from semantic dementia is described who consistently demonstrated the preserved ability to support specific types of semantic judgements from visual, but not from verbal, input.

114 citations


Journal ArticleDOI
01 Mar 1997
TL;DR: This paper describes a research project, in which similarity measures have been extended to include imprecise matching over different dimensions of structured classification schemes (subject, space, time), and a semantic hypermedia architecture is outlined.
Abstract: Notions of similarity underlie a wide variety of access methods in hypermedia and information retrieval. This paper describes a research project, in which similarity measures have been extended to include imprecise matching over different dimensions of structured classification schemes (subject, space, time). The semantic similarity of information units forms the basis for the automatic construction of links and is integrated into hypermedia navigation. A semantic hypermedia architecture is outlined, and a prototype museum social history application is described. Illustrative navigation scenarios are presented which make use of a navigation via similarity tool. Three different measures of semantic closeness underpin the similarity tool. The temporal measure takes account of periods as well as time points. The most general measure is based on a traversal of a semantic net, taking into account relationship type and level of specialization. It is based on a notion of closeness rather than absolute distance, and returns a set of semantically close terms. A method of calculating semantic similarity between sets of index terms, based on the maximal closeness values achieved by each term is discussed.

69 citations


Journal ArticleDOI
TL;DR: A multi-modal logical language for reasoning about relative similarities is presented and the modalities correspond semantically to the upper and lower approximations of a set of objects by similarity relations corresponding to all subsets of a given set of properties of objects.
Abstract: A similarity relation is a reflexive and symmetric binary relation between objects. Similarity is relative: it depends on the set of properties of objects used in determining their similarity or dissimilarity. A multi-modal logical language for reasoning about relative similarities is presented. The modalities correspond semantically to the upper and lower approximations of a set of objects by similarity relations corresponding to all subsets of a given set of properties of objects. A complete deduction system for the language is presented.

68 citations


Journal ArticleDOI
TL;DR: This paper examines the locus of the refractoriness by comparing JM's performance on tasks requiring access to presemantic perceptual representations or to semantic information, and finds that when both words and pictures were used as stimuli, JM was unimpaired on tasks thought to tap presemantics representations, but performance became refractory on tasks that required access to fine-grained semantic inform...
Abstract: A single case study is presented of a global aphasic patient, JM, who shows similar impairments to other patients in the literature classified as "access dysphasics": On auditory word-written word matching tasks, JM's performance is inconsistent, it declines over repetitions, and it is sensitive to presentation rate and to the semantic relatedness of distractors (Forde & Humphreys, 1995). The deleterious effects of stimulus repetition can be attributed to JM's access to semantic information becoming refractory following activation of a target word or object. This paper examines the locus of the refractoriness by comparing JM's performance on tasks requiring access to presemantic perceptual representations or to semantic information. When both words and pictures were used as stimuli, JM was unimpaired on tasks thought to tap presemantic representations, such as visual lexical decision and unusual views matching, but performance became refractory on tasks that required access to fine-grained semantic inform...

01 Jan 1997
TL;DR: An introduction to some of the emerging research in the application of corpusbased learning techniques to problems in semantic interpretation, namely, word-sense disambiguation and semantic parsing.
Abstract: into empirical, corpus-based learning approaches to natural language processing (NLP). Most empirical NLP work to date has focused on relatively low-level language processing such as part-ofspeech tagging, text segmentation, and syntactic parsing. The success of these approaches has stimulated research in using empirical learning techniques in other facets of NLP, including semantic analysis—uncovering the meaning of an utterance. This article is an introduction to some of the emerging research in the application of corpusbased learning techniques to problems in semantic interpretation. In particular, we focus on two important problems in semantic interpretation, namely, word-sense disambiguation and semantic parsing.

Proceedings ArticleDOI
07 Jul 1997
TL;DR: A data-oriented semantic interpretation algorithm was tested on two semantically annotated corpora: the English ATIS corpus and the Dutch OVIS corpus and shows an increase in semantic accuracy if larger corpus-fragments are taken into consideration.
Abstract: In data-oriented language processing, an annotated language corpus is used as a stochastic grammar. The most probable analysis of a new sentence is constructed by combining fragments from the corpus in the most probable way. This approach has been successfully used for syntactic analysis, using corpora with syntactic annotations such as the Penn Tree-bank. If a corpus with semantically annotated sentences is used, the same approach can also generate the most probable semantic interpretation of an input sentence. The present paper explains this semantic interpretation method. A data-oriented semantic interpretation algorithm was tested on two semantically annotated corpora: the English ATIS corpus and the Dutch OVIS corpus. Experiments show an increase in semantic accuracy if larger corpus-fragments are taken into consideration.

Journal ArticleDOI
TL;DR: A geometric model of similarity measurement is proposed that subsumes most of the models proposed for psychological similarity and should be replaced by similarity assessment and, in particular, by something close to human preattentive similarity.
Abstract: Multimedia databases (in particular image databases) are different from traditional system since they cannot ignore the perceptual substratum on which the data come. There are several consequences of this fact. The most relevant for our purposes is that it is no longer possible to identify a well defined meaning of an image and, therefore, matching based on meaning is impossible. Matching should be replaced by similarity assessment and, in particular, by something close to human preattentive similarity. In this paper we propose a geometric model of similarity measurement that subsumes most of the models proposed for psychological similarity.

Patent
31 Jul 1997
TL;DR: In this article, the authors proposed a method for determining whether a semantic relation should be inferred from a lexical knowledge base, despite the fact that it does not occur in the knowledge base.
Abstract: The present invention provides a facility for determining, for a semantic relation that does not occur in a lexical knowledge base, whether this semantic relation should be inferred despite its absence from the lexical knowledge base. This semantic relation to be inferred is preferably made up of a first word, a second word, and a relation type relating the meanings of the first and second words. In a preferred embodiment, the facility identifies a salient semantic relation having the relation type of the semantic relation to be inferred and relating the first word to an intermediate word other than the second word. The facility then generates a quantitative measure of the similarity in meaning between the intermediate word and the second word. The facility further generates a confidence weight for the semantic relation to be inferred based upon the generated measure of similarity in meaning between the intermediate word and the second word. The facility may also generate a confidence weight for the semantic relation to be inferred based upon the weights of one or more paths connecting the first and second words.

Proceedings Article
27 Jul 1997
TL;DR: This paper presents a new similarity assessment approach which couples similarity judgments directly to a case library containing the system's adaptation knowledge, and examines this approach in the context of a case-based planning system that learns both new plans and new adaptations.
Abstract: Case-based problem-solving systems rely on similarity assessment to select stored cases whose solutions are easily adaptable to fit current problems. However, widely-used similarity assessment strategies, such as evaluation of semantic similarity, can be poor predictors of adaptability. As a result, systems may select cases that are difficult or impossible for them to adapt, even when easily adaptable cases are available in memory. This paper presents a new similarity assessment approach which couples similarity judgments directly to a case library containing the system's adaptation knowledge. It examines this approach in the context of a case-based planning system that learns both new plans and new adaptations. Empirical tests of alternative similarity assessment strategies show that this approach enables better case selection and increases the benefits accrued from learned adaptations.

Journal ArticleDOI
TL;DR: A prototype system was built to implement this model by modifying the SMART system and using the Xerox Part-Of-Speech (P-O-S) tagger as the pre-processor of the indexing process.
Abstract: This article presents the Semantic Vector Space Model (SVSM), a text representation and searching technique based on the combination of Vector Space Model (VSM) with heuristic syntax parsing and distributed representation of semantic case structures. In this model, both documents and queries are represented as semantic matrices. A search mechanism is designed to compute the similarity between two semantic matrices to predict relevancy. A prototype system was built to implement this model by modifying the SMART system and using the Xerox Part-Of-Speech (P-O-S) tagger as the pre-processor of the indexing process. The prototype system was used in an experimental study to evaluate this technique in terms of precision, recall, and effectiveness of relevance ranking. The results of the study showed that if documents and queries were too short (typically less than 2 lines in length), the technique was less effective than VSM. But with longer documents and queries, especially when original documents were used as queries, we found that the system based on our technique had significantly better performance than SMART. © 1997 John Wiley & Sons, Inc.


Proceedings Article
01 Jan 1997
TL;DR: The result is that links between articles generated using the hypertext linking methodology have a significant advantage over links generated by the competing methodology.
Abstract: We describe a novel method for automatically generating hypertext links within and between newspaper articles. The method is based on lexical chaining, a technique for extracting the sets of related words that occur in texts. Links between the paragraphs of a single article are built by considering the distribution of the lexical chains in that article. Links between articles are built by considering how the chains in the two articles are related. By using lexical chaining we mitigate the problems of synonymy and polysemy that plague traditional information retrieval approaches to automatic hypertext generation. In order to motivate our research, we discuss the results of a study that shows that humans are inconsistent when assigning hypertext links within newspaper articles. Even if humans were consistent, the time needed to build a large hypertext and the costs associated with the production of such a hypertext make relying on human linkers an untenable decision. Thus we are left to automatic hypertext generation. Because we wish to determine how our hypertext generation methodology performs when compared to other proposed methodologies, we present a study comparing the hypertext linking methodology that we propose with a methodology based on a traditional information retrieval approach. In this study, subjects were asked to perform a question-answering task using a combination of links generated by our methodology and the competing methodology. The result is that links between articles generated using our methodology have a significant advantage over links generated by the competing methodology. We show combined results for all subjects tested, along with results based on subjects' experience in using the World Wide Web. We detail the construction of a system for performing automatic hypertext generation in the context of an online newspaper. The proposed system is fully capable of handling large databases of news articles in an efficient manner.

Journal ArticleDOI
TL;DR: This paper showed that latencies appeared to be conjointly determined by syntactic and semantic context, which suggests the existence of an isolable level of syntactic assignment that precedes semantic integration of content words in sentence comprehension.
Abstract: Manipulating the semantic relatedness of noun and verb targets in contexts where they are grammatically appropriate or inappropriate allows for simultaneous examination of syntactic and semantic context effects. A lexical-decision experiment showed both a syntactic context effect and a semantic relatedness effect that was stronger in syntactically appropriate conditions. Thus, latencies appeared to be conjointly determined by syntactic and semantic context. In contrast, naming experiments also showed both semantic and syntactic effects, but the syntactic context effect was independent of semantic relatedness and was observed in the virtual absence of sensitivity to semantic anomaly. Thus, syntactic and semantic processing are largely dissociable in the naming task. In conjunction with other findings in the literature, this suggests the existence of an isolable level of syntactic assignment that precedes semantic integration of content words in sentence comprehension.

Patent
31 Jul 1997
TL;DR: In this article, the authors identify salient semantic relation paths between two words using a knowledge base, and then determine the level of saliency of a particular path between the two words by combining the levels of salience determined for the semantic relations in the path.
Abstract: The present invention identifies salient semantic relation paths between two words using a knowledge base. For a group of semantic relations occurring in the knowledge base, the facility models with a mathematical function the relation between a frequency of occurrence of unique semantic relations and the number of unique semantic relations that occur at that frequency. This mathematical function has a vertex frequency identifying a transition point in the mathematical function. The facility then determines the level of salience of unique semantic relations of the group such that the level of salience of unique semantic relations increases with the frequency of occurrence of the unique semantic relations approaches the vertex frequency of the mathematical function with which the relation between the frequency of occurrence of the unique semantic relations and the number of unique semantic relations occurring at that frequency is modeled. The facility is then able to determine the level of salience of a particular path between two words by combining the levels of salience determined for the semantic relations in the path.

Journal ArticleDOI
01 Jun 1997
TL;DR: The definition of a view integration algorithm enhanced by the use of linguistic knowledge is presented, which mainly consists of a semantic unification of views which are described using an extended entity-relationship model.
Abstract: This paper addresses the problem of view integration in a CASE tool environment which is aiming at the elaboration of a conceptual schema of an application. The previous integration tools were mainly based on syntax and structure comparisons. A new generation of intelligent tools is now arising, assuming that view integration algorithms must also capture the deep semantics of the objects represented in the views. Dealing with the semantics of the objects is now a realistic objective, thanks to the research results obtained in the natural language area. This paper presents the definition of a view integration algorithm enhanced by the use of linguistic knowledge. This algorithm mainly consists of a semantic unification of views which are described using an extended entity-relationship model. It is combined with natural language techniques such as Fillmore's semantic cases and Sowa's conceptual graphs, supported by semantic dictionaries.

Book ChapterDOI
20 Nov 1997
TL;DR: A computationally feasible method for measuring the context-sensitive semantic distance between words by adaptive scaling of a semantic space, which successfully extracts the context of a text.
Abstract: This paper proposes a computationally feasible method for measuring the context-sensitive semantic distance between words. The distance is computed by adaptive scaling of a semantic space. In the semantic space, each word in the vocabulary V is represented by a multidimensional vector which is extracted from an English dictionary through principal component analysis. Given a word set C which speci es a context, each dimension of the semantic space is scaled up or down according to the distribution of C in the semantic space. In the space thus transformed, the distance between words in V becomes dependent on the context C. An evaluation through a word prediction task shows that the proposed measurement successfully extracts the context of a text.

Proceedings ArticleDOI
02 Jul 1997
TL;DR: This paper shows how semantic structuring of documents can be efficiently defined using SGML syntax and has been presented as a relevant example for handling semantic structured documents.
Abstract: This paper presents a formal model for an explicit description of the semantic structure which implicitly exists with documents. This model relies on content meaning description of document elements. Meaning representation is distributed in the overall architecture model: it binds a semantic structure, a logical structure of documents and a domain model. The semantic structure contains two levels of description: meaning representation of information units. The description logic formalism is used to represent semantics of document elements and document rhetorical organisation. This paper shows how semantic structuring of documents can be efficiently defined using SGML syntax. Using this documents structuring norm, one can define two levels of description: generic semantic structure (vs. Document Type Definition) and specific semantic structure (vs. instantiated document) in order to define an abstract interface to the information stored in documents. The medical patient record has been presented as a relevant example for handling semantic structured documents.

Proceedings Article
23 Aug 1997
TL;DR: A network model of the mental lexicon and its formation is presented, similar to semantic networks first described by [Collins and Loftus, 1975], but is created automatically.
Abstract: This paper presents a network model of the mental lexicon and its formation. Models of word meaning typically postulate a network of nodes with connection strengths, or distances, that reflect semantic similarity, but seldom explain how the network is formed or how it could be represented in the brain. The model presented here is an attempt to address these questions. The network organizes semantically similar words into clusters when exposed to sequentially presented text. Lexical co-occurrence information is calculated and used to create a hierarchical semantic representation. The output is similar to semantic networks first described by [Collins and Loftus, 1975], but is created automatically.

Posted Content
TL;DR: The paper defends the notion that semantic tagging should be viewed as more than disambiguation between senses and develops a new type of semantic lexicon that supports underspecified semantic tagging through a design based on systematic polysemous classes and a class-based acquisition of lexical knowledge for specific domains.
Abstract: The paper defends the notion that semantic tagging should be viewed as more than disambiguation between senses. Instead, semantic tagging should be a first step in the interpretation process by assigning each lexical item a representation of all of its systematically related senses, from which further semantic processing steps can derive discourse dependent interpretations. This leads to a new type of semantic lexicon (CoreLex) that supports underspecified semantic tagging through a design based on systematic polysemous classes and a class-based acquisition of lexical knowledge for specific domains.

Book
01 Jan 1997
TL;DR: This dissertation presents techniques to automatically assign weights to network edges and determine semantic distance between arbitrary nodes, which allows word sense disambiguation during document and query indexing by minimizing mutual distance among word senses within a window of words with one or more senses each.
Abstract: It is well known that semantic networks are relevant to information retrieval. In particular, the notion of semantic distance can be applied to both the indexing and retrieval phases of processing. This dissertation presents techniques to automatically assign weights to network edges and determine semantic distance between arbitrary nodes. This allows word sense disambiguation during document and query indexing by minimizing mutual distance among word senses within a window of words with one or more senses each. A number of insights have led to improved semantic retrieval, where query/document relatedness is inferred, when compared to a baseline. In addition several performance enhancements have been discovered which reduce run-time by three orders of magnitude. Semantics has also been combined with more traditional lexical approaches such as the vector space model. Preliminary experiments using semantics have not yet produced significant improvement over strictly lexical approaches, but they have led to a method for obtaining both better recall and precision over the standard vector space approach.

Proceedings Article
01 Jan 1997
TL;DR: Experimental results show that wordsense analogy based on contexts of use compares favourably with classical word-sense similarity defined in terms of thesaural proximity.
Abstract: The paper describes an analogy-based measure of word-sense proximity grounded on distributional evidence in typical contexts, and illustrates a computational system which makes use of this measure for purposes of lexical disambiguation. Experimental results show that wordsense analogy based on contexts of use compares favourably with classical word-sense similarity defined in terms of thesaural proximity.

Journal Article
TL;DR: When tested on the MUC-4 terrorism domain, the approach is shown to outperform the most frequent heuristic substan~lly and achieve comparable accuracy with human judges, and compares favourably with two supervised learning algorithm.
Abstract: This paper presents an approach which exploits general-purpose algori.t~m~ and resources for domain-specific semantic class dis~mhiguation, thus facilitating the generalization of semautic patterns fTom word-based to class-based representations. Through the mapping of the donza£uspecific semantic hierarchy onto WordNet and the application of general-purpose word sense disambiguation and semantic distance metrics, the approach proposes a portable, wide-coverage method for disambiguating semantic classes. Unlike existing methods, the approach does not require annotated corpora. When tested on the MUC-4 terrorism domain, the approach is shown to outperform the most frequent heuristic substan~lly and achieve comparable accuracy with human judges. Its p~fo£~ance also compares favourably with two supervised learning algorithm.q.

Journal ArticleDOI
01 Mar 1997
TL;DR: A second evaluation method based on measuring user's performance using hypertext using a semantic similarity measure and a correlations between shortest paths in the hypertext structure is proposed.
Abstract: We present two methods for evaluating automatically generated hypertext links. The first method is based on correlations between shortest paths in the hypertext structure and a semantic similarity measure. Experimental results with the first method show the degree to which the hypertext conversion process approximates semantic similarity. The semantic measure is in turn only an approximation of a user's internal model of the corpus. Therefore we propose a second evaluation method based on measuring user's performance using hypertext. Finally, we discuss the advantages and disadvantages of computer versus human evaluation, respectively.