scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 1999"


Journal ArticleDOI
TL;DR: In this paper, a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content is presented, and experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge counting approach.
Abstract: This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge-counting approach. The article presents algorithms that take advantage of taxonomic similarity in resolving syntactic and semantic ambiguity, along with experimental results demonstrating their effectiveness.

2,190 citations


Journal ArticleDOI
TL;DR: In this article, Dutch-English bilinguals were tested with English words varying in their degree of orthographic, phonological, and semantic overlap with Dutch words, and the results were interpreted within an interactive activation model for monolingual and bilingual word recognition.

602 citations


Patent
01 Nov 1999
TL;DR: The meaning-based search as mentioned in this paper allows users to locate information that is close in meaning to the concepts they are searching by determining a semantic distance between the first and second meaning differentiates, wherein this distance represents their closeness in meaning.
Abstract: The present invention relies on the idea of a meaning-based search, allowing users to locate information that is close in meaning to the concepts they are searching. A semantic space is created by a lexicon of concepts and relations between concepts. A query is mapped to a first meaning differentiator, representing the location of the query in the semantic space. Similarly, each data element in the target data set being searched is mapped to a second meaning differentiator, representing the location of the data element in the semantic space. Searching is accomplished by determining a semantic distance between the first and second meaning differentiator, wherein this distance represents their closeness in meaning. Search results on the input query are presented where the target data elements that are closest in meaning, based on their determined semantic distance, are ranked higher.

288 citations


Journal ArticleDOI
TL;DR: This paper provides rating norms for a set of symbols and icons selected from a wide variety of sources that have been quantified and include concreteness, complexity, meaningfulness, familiarity, and semantic distance.
Abstract: This paper provides rating norms for a set of symbols and icons selected from a wide variety of sources. These ratings enable the effects of symbol characteristics on user performance to be systematically investigated. The symbol characteristics that have been quantified are considered to be of central relevance to symbol usability research and include concreteness, complexity, meaningfulness, familiarity, and semantic distance. The interrelationships between each of these dimensions is examined and the importance of using normative ratings for experimental research is discussed.

207 citations


Proceedings Article
01 Jan 1999
TL;DR: A new composite similarity metric is presented that combines information from multiple linguistic indicators to measure semantic distance between pairs of small textual units and is evaluated against standard information retrieval techniques, establishing that the new method is more effective in identifying closely related textual units.
Abstract: We present a new composite similarity metric that combines information from multiple linguistic indicators to measure semantic distance between pairs of small textual units. Several potential features are investigated and an optimal combination is selected via machine learning. We discuss a more restrictive definition of similarity than traditional, document-level and information retrieval-oriented, notions of similarity, and motivate it by showing its relevance to the multi-document text summarization problem. Results from our system are evaluated against standard information retrieval techniques, establishing that the new method is more effective in identifying closely related textual units.

205 citations


Patent
14 Jun 1999
TL;DR: In this article, a method and apparatus for determining when electronic documents stored in a large collection of documents are similar to one another is provided for determining the similarity of documents stored on the same server.
Abstract: A method and apparatus are provided for determining when electronic documents stored in a large collection of documents are similar to one another. A plurality of similarity information is derived from the documents. The similarity information may be based on a variety of factors, including hyperlinks in the documents, text similarity, user click-through information, similarity in the titles of the documents or their location identifiers, and patterns of user viewing. The similarity information is fed to a combination function that synthesizes the various measures of similarity information into combined similarity information. Using the combined similarity information, an objective function is iteratively maximized in order to yield a generalized similarity value that expresses the similarity of particular pairs of documents. In an embodiment, the generalized similarity value is used to determine the proper category, among a taxonomy of categories in an index, cache or search system, into which certain documents belong.

184 citations


Book ChapterDOI
01 Jan 1999
TL;DR: In this article, a method for automatic sense disambiguation of nouns appearing within sets of related nouns is presented, based on the kind of data one finds in on-line thesauri, or as the output of distributional clustering algorithms.
Abstract: Word groupings useful for language processing tasks are increasingly available, as thesauri appear on-line, and as distributional word clustering techniques improve. However, for many tasks, one is interested in relationships among word senses, not words. This paper presents a method for automatic sense disambiguation of nouns appearing within sets of related nouns — the kind of data one finds in on-line thesauri, or as the output of distributional clustering algorithms. Disambiguation is performed with respect to WordNet senses, which are fairly fine-grained; however, the method also permits the assignment of higher-level WordNet categories rather than sense labels. The method is illustrated primarily by example, though results of a more rigorous evaluation are also presented.

183 citations


Patent
26 Oct 1999
TL;DR: In this paper, the intention of a user is extracted by referring to semantic pattern information stored in a Semantic Pattern Information Storing (SPI) and a semantic expression is generated.
Abstract: PROBLEM TO BE SOLVED: To provide a natural language interactive system for generating a proper response by estimating the intention of a user without inquiring for any incomplete part to a user even in the case of any incomplete input sentence. SOLUTION: In an analyzing means 3, the intention of a user is extracted by referring to semantic pattern information stored in a semantic pattern information storing means 33, and semantic expression is generated. The semantic expression as the analyzed result is returned to an interaction controlling means 2. The interaction controlling means 2 generates a proper response based on the semantic expression.

180 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A dual probability model is constructed for the Latent Semantic Indexing using the cosine similarity measure, establishing a statistical framework for LSI and leading to a statistical criterion for the optimal semantic dimensions.
Abstract: A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justi ed by the statistical signi cance of latent semantic vectors as measured by the likelihood of the model. This leads to a statistical criterion for the optimal semantic dimensions, answering a critical open question in LSI with practical importance. Thus the model establishes a statistical framework for LSI. Ambiguities related to statistical modeling of LSI are clari ed.

152 citations


Journal ArticleDOI
TL;DR: Results clearly contradict the widespread belief stating that semantic similarity hinders the short-term recall of order information and suggest that long-term knowledge is accessed to support the interpretation of degraded phonological traces.
Abstract: Four experiments investigated the disruptive effect of semantic similarity on short-term ordered recall. Experiments 1 and 2 contrasted immediate serial recall performance for lists of semantically similar items, drawn from the same semantic category, with performance for lists that contained items from different categories. Experiments 1 and 2 showed the usual similarity advantage for item information recall, but, contrary to expectations, there was no similarity disadvantage for the recall of order information, even when the level of item recall was controlled. Experiments 3 and 4 replicate and extend these findings by using an order reconstruction task or a limited word pool strategy, both of which yield alternate measures of order retention. These findings clearly contradict the widespread belief stating that semantic similarity hinders the short-term recall of order information. Results are discussed in the light of a retrieval-based account where the effects of semantic similarity reflect the proces...

125 citations


Journal ArticleDOI
TL;DR: A novel method for automatic hypertext generation that is based on a technique called lexical chaining, a method for discovering sequences of related words in a text, and attempts to take into account the effects of synonymy and polysemy.
Abstract: Most current automatic hypertext generation systems rely on term repetition to calculate the relatedness of two documents. There are well-recognized problems with such approaches, most notably, a vulnerability to the effects of synonymy (many words for the same concept) and polysemy (many concepts for the same word). We propose a novel method for automatic hypertext generation that is based on a technique called lexical chaining, a method for discovering sequences of related words in a text. This method uses a more general notion of document relatedness, and attempts to take into account the effects of synonymy and polysemy. We also present the results of an empirical study designed to test this method in the context of a question answering task from a database of newspaper articles.

Journal ArticleDOI
TL;DR: Patterns of facilitation in three lexical decision studies reveal several properties of morphologically complex words that influence processing, including semantic transparency but not base morpheme position constrains morphological processing.

Book ChapterDOI
TL;DR: A framework for ontology-based geographic data set integration, an ontology being a collection of shared concepts, is explored, formalized in the Prolog language, illustrated with a fictitious example, and tested on a practical example.
Abstract: In order to develop a system to propagate updates we investigate the semantic and spatial relationships between independently produced geographic data sets of the same region (data set integration). The goal of this system is to reduce operator intervention in update operations between corresponding (semantically similar) geographic object instances. Crucial for this reduction is certainty about the semantic similarity of different object representations. In this paper we explore a framework for ontology-based geographic data set integration, an ontology being a collection of shared concepts. Components of this formal approach are an ontology for topographic mapping (a domain ontology), an ontology for every geographic data set involved (the application ontologies), and abstraction rules (or capture criteria). Abstraction rules define at the class level the relationships between domain ontology and application ontology. Using these relationships, it is possible to locate semantic similarity at the object instance level with methods from computational geometry (like overlay operations). The components of the framework are formalized in the Prolog language, illustrated with a fictitious example, and tested on a practical example.

Patent
31 Aug 1999
TL;DR: In this article, a method for performing a semantic analysis process on a computer system including a storage unit and an interface includes the steps of: receiving a syntactic tree generated from a natural language sentence text; determining whether an analysis object, which is one of nodes of the syntactical tree, is a verb phrase class which has a verb as a head or a non-verb phrase class with mainly a noun as the head on the basis of subdivided type information of a phrase of the node with reference to first data stored in the storage unit; analyzing a relation between a
Abstract: A method for performing a semantic analysis process on a computer system including a storage unit and an interface includes the steps of: receiving a syntactic tree generated from a natural language sentence text; determining whether an analysis object, which is one of nodes of the syntactic tree, is a verb phrase class which has a verb as a head or a non-verb phrase class which has mainly a noun as the head on the basis of subdivided type information of a phrase of the node with reference to first data stored in the storage unit; analyzing a relation between a verb in the analysis object and a deep case of the verb when the analysis object is the verb phrase class; analyzing a modificative relation in the analysis object when the analysis object is the non-verb phrase class; generating a semantic structure of the natural language sentence text wherein the semantic structure comprises semantic frames corresponding to nodes of the syntactic tree, at least two semantic frames of the semantic frames being linked by a head relation or a deep case relation or a modificative relation, and storing the semantic structure in the storage unit or displaying the semantic structure on a display which is connectable to the computer system via the interface.

Journal ArticleDOI
TL;DR: Experiments 1 and 2 showed that strong free associates of the to-be-remembered items disrupted serial recall to a greater extent than words that were dissimilar to the to be-rem recalled items.
Abstract: Irrelevant speech disrupts immediate recall of a short sequence of items. Salame and Baddeley (1982) found a very small and nonsignificant increase in the irrelevant speech effect when the speech comprised items semantically identical to the to-be-remembered items, leading subsequent researchers to conclude that semantic similarity plays no role in the irrelevant speech effect. Experiment 1 showed that strong free associates of the to-be-remembered items disrupted serial recall to a greater extent than words that were dissimilar to the to-be-remembered items. Experiment 2 showed that this same pattern of disruption in a free recall task. Theoretical implications of these findings are discussed.

Book ChapterDOI
TL;DR: This paper presents an innovative approach to semantic similarity assessment by combining the advantages of two different strategies: feature-matching process and semantic distance calculation.
Abstract: The assessment of semantic similarity among objects is a basic requirement for semantic interoperability. This paper presents an innovative approach to semantic similarity assessment by combining the advantages of two different strategies: feature-matching process and semantic distance calculation. The model involves a knowledge base of spatial concepts that consists of semantic relations (is-a and part-whole) and distinguishing features (functions, parts, and attributes). By taking into consideration cognitive properties of similarity assessments, this model represents a cognitively plausible and computationally achievable method for measuring the degree of interoperability.

Journal ArticleDOI
01 Jun 1999
TL;DR: A method for indexing market basket data efficiently for similarity search is discussed, which is likely to be very useful in applications which utilize the similarity in customer buying behavior in order to make peer recommendations.
Abstract: In recent years, many data mining methods have been proposed for finding useful and structured information from market basket data. The association rule model was recently proposed in order to discover useful patterns and dependencies in such data. This paper discusses a method for indexing market basket data efficiently for similarity search. The technique is likely to be very useful in applications which utilize the similarity in customer buying behavior in order to make peer recommendations. We propose an index called the signature table, which is very flexible in supporting a wide range of similarity functions. The construction of the index structure is independent of the similarity function, which can be specified at query time. The resulting similarity search algorithm shows excellent scalability with increasing memory availability and database size.

Book ChapterDOI
01 Jan 1999
TL;DR: Interoperability between different data sources and software systems is difficult to achieve due to the complexity of geodata.
Abstract: Interoperability between different data sources and software systems is difficult to achieve due to the complexity of geodata. This complexity is caused by various factors, such as the underlying digital formats imposed by a particular software or acquisition method and the complexity of higher level descriptions, conventions, and rules imposed by individuals, organizations, and disciplines using the software (Buehler and McKee 1996).

Journal ArticleDOI
TL;DR: Automatic methods can be used to construct a semantic lexicon from existing UMLS sources that can aid natural language processing programs that analyze medical narrative, provided that lexemes with multiple semantic types are kept to a minimum.

Proceedings Article
01 Jan 1999
TL;DR: This paper describes the development of a prototype system to answer questions by selecting sentences from the documents in which the answers occur, and demonstrates the viability of using structural information about the sentences in a document toanswer questions.
Abstract: This paper describes the development of a prototype system to answer questions by selecting sentences from the documents in which the answers occur. After parsing each sentence in these documents, databases are constructed by extracting relational triples from the parse output. The triples consist of discourse entities, semantic relations, and the governing words to which the entities are bound in the sentence. Database triples are also generated for the questions. Question-answering consists of matching the question database records with the records for the documents. The prototype system was developed specifically to respond to the TREC-8 Q&A track, with an existing parser and some existing capability for analyzing parse output. The system was designed to investigate the viability of using structural information about the sentences in a document to answer questions. The CL Research system achieved an overall score of 0.281 (i.e., on average, providing a sentence containing a correct answer as the fourth selection). The score demonstrates the viability of the approach. Post-hoc analysis suggests that this score understates the performance of the prototype and estimates that a more accurate score is approximately 0.482. This analysis also suggests several further improvements and the potential for investigating other avenues that make use of semantic networks and computational lexicology.

Journal ArticleDOI
TL;DR: Two views of the structure of semantic space are reviewed and an experiment is described that attempts to adjudicate between these two views, providing evidence that supports the notion that the semantic lexicon is arranged more by association than by categories or features.

Journal ArticleDOI
TL;DR: A corpus-based bootstrapping algorithm is presented that assists users in creating domain-specific semantic lexicons quickly and was used to generate a semantic lexicon for eleven semantic classes associated with the MUC-4 terrorism domain.
Abstract: Many applications need a lexicon that represents semantic information but acquiring lexical information is time consuming. We present a corpus-based bootstrapping algorithm that assists users in creating domain-specific semantic lexicons quickly. Our algorithm uses a representative text corpus for the domain and a small set of ‘seed words’ that belong to a semantic class of interest. The algorithm hypothesizes new words that are also likely to belong to the semantic class because they occur in the same contexts as the seed words. The best hypotheses are added to the seed word list dynamically, and the process iterates in a bootstrapping fashion. When the bootstrapping process halts, a ranked list of hypothesized category words is presented to a user for review. We used this algorithm to generate a semantic lexicon for eleven semantic classes associated with the MUC-4 terrorism domain.

Proceedings Article
01 Jan 1999
TL;DR: The UMLS contains over 30 semantic types and most of the semantic relations that are essential for representing the underlying genomic knowledge, but some of the concepts critical to the genomic domain were found to be missing.
Abstract: Genomics research has a significant impact on the understanding and treatment of human hereditary diseases, and biomedical literature concerning the genome project is becoming more and more important for clinicians The Unified Medical Language System (UMLS) is designed to facilitate the retrieval and integration of information from multiple-readable biomedical information resources This paper describes our efforts to integrate concepts important to genomics research with the UMLS semantic network We found that the UMLS contains over 30 semantic types and most of the semantic relations that are essential for representing the underlying genomic knowledge In addition, we observed that the organization of the network was appropriate for representing the hierarchical organization of the concepts Because some of the concepts critical to the genomic domain were found to be missing, we propose to extend the network by adding six new semantic types and sixteen new semantic relations

Proceedings ArticleDOI
TL;DR: By the approach, in the keyword query of the user, relevant images will be returned to the user by the help of semantic template association even those images are not annotated by keyword.
Abstract: Content-based multimedia information retrieval is the hot point of researchers in many domains. But traditional featurevector based retrieval method can not provide retrieval on the semantic level. Integrated with our image retrieval system,we propose a new approach to generate semantic template automatically in the process of relevance feedback, and constructa network of semantic template with the support of WordNetTM j the retrieval process, which helps the user to do retrievalon the semantic level. By our approach, in the keyword query of the user, relevant images will be returned to the user bythe help of semantic template association even those images are not annotated by keyword. This paper introduces thisapproach in detail and presents an experiment result at the end of this paper.Keywords: Multimedia, Information retrieval, Relevance feedback, Semantic template

Journal ArticleDOI
TL;DR: A measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content is presented, and an experimental evaluation against a benchmark set of human similarity jud...
Abstract: This article presents a measure of semantic similarity in an IS-A taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity jud...

Journal ArticleDOI
TL;DR: This work contends that while ontologies are useful in semantic reconciliation, they do not guarantee correct classification of semantic conflicts, nor do they provide the capability to handle evolving semantics or a mechanism to support a dynamic reconciliation process.
Abstract: Shared ontologies describe concepts and relationships to resolve semantic conflicts amongst users accessing multiple autonomous and heterogeneous information sources. We contend that while ontologies are useful in semantic reconciliation, they do not guarantee correct classification of semantic conflicts, nor do they provide the capability to handle evolving semantics or a mechanism to support a dynamic reconciliation process. Their limitations are illustrated through a conceptual analysis of several prominent examples used in heterogeneous database systems and in natural language processing. We view semantic reconciliation as a nonmonotonic query-dependent process that requires flexible interpretation of query context, and as a mechanism to coordinate knowledge elicitation while constructing the query context. We propose a system that is based on these characteristics, namely the SCOPES (Semantic Coordinator Over Parallel Exploration Spaces) system. SCOPES takes advantage of ontologies to constrain exploration of a remote database during the incremental discovery and refinement of the context within which a query can be answered. It uses an Assumption-based Truth Maintenance System (ATMS) to manage the multiple plausible contexts which coexist while the semantic reconciliation process is unfolding, and the Dempster-Shafer (DS) theory of belief to model the likelihood of these plausible contexts.

Proceedings Article
18 Jul 1999
TL;DR: Experimental results are presented demonstrating WOLFIE's ability to learn useful lexicons for a database interface in four different natural languages.
Abstract: This paper describes a system, WOLFIE (WOrd Learning From Interpreted Examples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The lexicon learned consists of words paired with meaning representations. WOLFIE is part of an integrated system that learns to parse novel sentences into semantic representations, such as logical database queries. Experimental results are presented demonstrating WOLFIE's ability to learn useful lexicons for a database interface in four different natural languages. The lexicons learned by WOLFIE are compared to those acquired by a similar system developed by Siskind (1996).

Journal ArticleDOI
TL;DR: The results suggest hemisphere asymmetries in accessing lexical knowledge of Chinese characters in the form of LVF advantage effect and a significant phonological similarity effect in the RVF.
Abstract: The lateralisation of lexical knowledge of Chinese characters is investigated in this study. Three experiments were conducted in which stimuli were presented unilaterally to a visual field for recognition tests. The orthographic similarity of two alternative items for choice in Experiment 1 was manipulated, and the results showed an LVF advantage effect for legal characters in the visually similar condition and a more prominent LVF than RVF character-superiority effect. The phonological similarity of two alternative items for choice was manipulated in Experiment 2. The results showed a prominent RVF advantage effect and a significant phonological similarity effect in the RVF. In Experiment 3, the semantic similarity was manipulated, and the semantic similarity effect was observed in the RVF. These results suggest hemisphere asymmetries in accessing lexical knowledge of Chinese characters.

Proceedings Article
31 Jul 1999
TL;DR: Two abstraction mechanisms for streamlining the process of semantic interpretation are introduced: Configurational descriptions of dependency graphs increase the linguistic generality of interpretation schemata, while interfacing them to lexical and conceptual inheritance hierarchies reduces the amount and complexity of semantic specifications.
Abstract: We introduce two abstraction mechanisms for streamlining the process of semantic interpretation. Configurational descriptions of dependency graphs increase the linguistic generality of interpretation schemata, while interfacing them to lexical and conceptual inheritance hierarchies reduces the amount and complexity of semantic specifications.

Journal ArticleDOI
TL;DR: In conclusion both visual similarity and semantic proximity contributed to the identification errors of DAT patients.
Abstract: Identification deficits in dementia of the Alzheimer Type (DAT) often target specific classes of objects, sparing others. Using line drawings to uncover the etiology of such category-specific deficits may be untenable because the underlying shape primitives used to differentiate one line drawing from another are unspecified, and object form is yoked to object meaning. We used computer generated stimuli with empirically specifiable properties in a paradigm that decoupled form and meaning. In Experiment 1 visually similar or distinct blobs were paired with semantically close or disparate labels, and participants attempted to learn these pairings. By having the same blobs stand for semantically close and disparate objects and looking at shape–label confusion rates for each type of set, form and meaning were independently assessed. Overall, visual similarity of shapes and semantic similarity of labels each exacerbated object confusions. For controls, the effects were small but significant. For DAT patients more substantial visual and semantic proximity effects were obtained. Experiment 2 demonstrated that even small changes in semantic proximity could effect significant changes in DAT task performance. Labeling 3 blobs with “lion,” “tiger,” and “leopard” significantly elevated DAT confusion rates compared to exactly the same blobs labeled with “lion,” “tiger,” and “zebra.” In conclusion both visual similarity and semantic proximity contributed to the identification errors of DAT patients. (JINS, 1999, 5, 330–345.)