scispace - formally typeset
Search or ask a question
Topic

Semantic similarity

About: Semantic similarity is a research topic. Over the lifetime, 14605 publications have been published within this topic receiving 364659 citations. The topic is also known as: semantic relatedness.


Papers
More filters
Journal ArticleDOI
TL;DR: AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input, and returns answers drawn from one or more knowledge bases (KBs) because the configuration time required to customize the system for a particular ontology is negligible.

224 citations

Proceedings ArticleDOI
29 Oct 2012
TL;DR: A novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases is developed, which improves the quality of prior link-based models, and also eliminates the need for explicit interlinkage between entities.
Abstract: Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.

224 citations

Proceedings ArticleDOI
10 May 2005
TL;DR: An information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology is defined, and an experimental study shows that this measure improves significantly on the traditional taxonomy-based approach.
Abstract: Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata --- namely topical directories --- to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking.

223 citations

Journal ArticleDOI
TL;DR: In simulations, DevLex develops topographically organized representations for linguistic categories over time, models lexical confusion as a function of word density and semantic similarity, and shows age-of-acquisition effects in the course of learning a growing lexicon.

223 citations

Book ChapterDOI
01 Jan 2002
TL;DR: This paper illustrates that the large-scale structure of this representation has statistical properties that corre- spond well with those of semantic networks produced by humans, and trace this to the fidelity with which it reproduces the natural statistics of language.
Abstract: A probabilistic approach to semantic representation Thomas L. Griffiths & Mark Steyvers {gruffydd,msteyver}@psych.stanford.edu Department of Psychology Stanford University Stanford, CA 94305-2130 USA Abstract Semantic networks produced from human data have statistical properties that cannot be easily captured by spatial representations. We explore a probabilis- tic approach to semantic representation that explic- itly models the probability with which words occur in different contexts, and hence captures the proba- bilistic relationships between words. We show that this representation has statistical properties consis- tent with the large-scale structure of semantic net- works constructed by humans, and trace the origins of these properties. Contemporary accounts of semantic representa- tion suggest that we should consider words to be either points in a high-dimensional space (eg. Lan- dauer & Dumais, 1997), or interconnected nodes in a semantic network (eg. Collins & Loftus, 1975). Both of these ways of representing semantic information provide important insights, but also have shortcom- ings. Spatial approaches illustrate the importance of dimensionality reduction and employ simple al- gorithms, but are limited by Euclidean geometry. Semantic networks are less constrained, but their graphical structure lacks a clear interpretation. In this paper, we view the function of associa- tive semantic memory to be efficient prediction of the concepts likely to occur in a given context. We take a probabilistic approach to this problem, mod- eling documents as expressing information related to a small number of topics (cf. Blei, Ng, & Jordan, 2002). The topics of a language can then be learned from the words that occur in different documents. We illustrate that the large-scale structure of this representation has statistical properties that corre- spond well with those of semantic networks produced by humans, and trace this to the fidelity with which it reproduces the natural statistics of language. Approaches to semantic representation Spatial approaches Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) is a procedure for finding a high-dimensional spatial representation for words. LSA uses singular value decomposition to factorize a word-document co-occurrence matrix. An approximation to the original matrix can be ob- tained by choosing to use less singular values than its rank. One component of this approximation is a matrix that gives each word a location in a high di- mensional space. Distances in this space are predic- tive in many tasks that require the use of semantic information. Performance is best for approximations that used less singular values than the rank of the matrix, illustrating that reducing the dimensional- ity of the representation can reduce the effects of statistical noise and increase efficiency. While the methods behind LSA were novel in scale and subject, the suggestion that similarity relates to distance in psychological space has a long history (Shepard, 1957). Critics have argued that human similarity judgments do not satisfy the properties of Euclidean distances, such as symmetry or the tri- angle inequality. Tversky and Hutchinson (1986) pointed out that Euclidean geometry places strong constraints on the number of points to which a par- ticular point can be the nearest neighbor, and that many sets of stimuli violate these constraints. The number of nearest neighbors in similarity judgments has an analogue in semantic representation. Nelson, McEvoy and Schreiber (1999) had people perform a word association task in which they named an as- sociated word in response to a set of target words. Steyvers and Tenenbaum (submitted) noted that the number of unique words produced for each target fol- lows a power law distribution: if k is the number of words, P (k) ∝ k γ . For reasons similar to those of Tversky and Hutchinson, it is difficult to produce a power law distribution by thresholding cosine or dis- tance in Euclidean space. This is shown in Figure 1. Power law distributions appear linear in log-log co- ordinates. LSA produces curved log-log plots, more consistent with an exponential distribution. Semantic networks Semantic networks were pro- posed by Collins and Quillian (1969) as a means of storing semantic knowledge. The original net- works were inheritance hierarchies, but Collins and Loftus (1975) generalized the notion to cover arbi- trary graphical structures. The interpretation of this graphical structure is vague, being based on connect- ing nodes that “activate” one another. Steyvers and Tenenbaum (submitted) constructed a semantic net- work from the word association norms of Nelson et

222 citations


Network Information
Related Topics (5)
Web page
50.3K papers, 975.1K citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Unsupervised learning
22.7K papers, 1M citations
83% related
Feature vector
48.8K papers, 954.4K citations
83% related
Web service
57.6K papers, 989K citations
82% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023202
2022522
2021641
2020837
2019866
2018787