scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2007"


Proceedings Article
06 Jan 2007
TL;DR: This work proposes Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia that results in substantial improvements in correlation of computed relatedness scores with human judgments.
Abstract: Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r = 0.56 to 0.75 for individual words and from r = 0.60 to 0.72 for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

2,285 citations


Journal ArticleDOI
TL;DR: A novel method to encode a GO term's semantics into a numeric value by aggregating the semantic contributions of their ancestor terms in the GO graph is proposed and, in turn, an algorithm is designed to measure the semantic similarity of GO terms.
Abstract: Motivation: Although controlled biochemical or biological vocabularies, such as Gene Ontology (GO) ( http://www.geneontology.org), address the need for consistent descriptions of genes in different data sources, there is still no effective method to determine the functional similarities of genes based on gene annotation information from heterogeneous data sources. Results: To address this critical need, we proposed a novel method to encode a GO term's semantics (biological meanings) into a numeric value by aggregating the semantic contributions of their ancestor terms (including this specific term) in the GO graph and, in turn, designed an algorithm to measure the semantic similarity of GO terms. Based on the semantic similarities of GO terms used for gene annotation, we designed a new algorithm to measure the functional similarity of genes. The results of using our algorithm to measure the functional similarities of genes in pathways retrieved from the saccharomyces genome database (SGD), and the outcomes of clustering these genes based on the similarity values obtained by our algorithm are shown to be consistent with human perspectives. Furthermore, we developed a set of online tools for gene similarity measurement and knowledge discovery. Availability: The online tools are available at: http://bioinformatics.clemson.edu/G-SESAME Contact: jzwang@cs.clemson.edu Supplementary information: http://bioinformatics.clemson.edu/Publication/Supplement/gsp.htm

1,067 citations


Journal ArticleDOI
TL;DR: The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost and to be fairly robust to parameter tuning.
Abstract: A probabilistic formulation for semantic image annotation and retrieval is proposed. Annotation and retrieval are posed as classification problems where each class is defined as the group of database images labeled with a common semantic label. It is shown that, by establishing this one-to-one correspondence between semantic labels and semantic classes, a minimum probability of error annotation and retrieval are feasible with algorithms that are 1) conceptually simple, 2) computationally efficient, and 3) do not require prior semantic segmentation of training images. In particular, images are represented as bags of localized feature vectors, a mixture density estimated for each image, and the mixtures associated with all images annotated with a common semantic label pooled into a density estimate for the corresponding semantic class. This pooling is justified by a multiple instance learning argument and performed efficiently with a hierarchical extension of expectation-maximization. The benefits of the supervised formulation over the more complex, and currently popular, joint modeling of semantic label and visual feature distributions are illustrated through theoretical arguments and extensive experiments. The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost. Finally, the proposed method is shown to be fairly robust to parameter tuning

962 citations


Journal ArticleDOI
TL;DR: This article presents a novel framework for constructing semantic spaces that takes syntactic relations into account, and introduces a formalization for this class of models, which allows linguistic knowledge to guide the construction process.
Abstract: Traditionally, vector-based semantic space models use word co-occurrence counts from large corpora to represent lexical meaning. In this article we present a novel framework for constructing semantic spaces that takes syntactic relations into account. We introduce a formalization for this class of models, which allows linguistic knowledge to guide the construction process. We evaluate our framework on a range of tasks relevant for cognitive science and natural language processing: semantic priming, synonymy detection, and word sense disambiguation. In all cases, our framework obtains results that are comparable or superior to the state of the art.

696 citations


Proceedings ArticleDOI
01 Jan 2007
TL;DR: A robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities and a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets is proposed.
Abstract: Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We deflne various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These difierent similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coe‐cient of 0:834. Moreover, the proposed semantic similarity measure signiflcantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.

601 citations


Journal ArticleDOI
TL;DR: There is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

572 citations


Proceedings Article
22 Jul 2007
TL;DR: A large scale taxonomy containing a large amount of subsumption is derived using methods based on connectivity in the network and lexicosyntactic matching to label the semantic relations between categories in Wikipedia.
Abstract: We take the category system in Wikipedia as a conceptual network. We label the semantic relations between categories using methods based on connectivity in the network and lexicosyntactic matching. As a result we are able to derive a large scale taxonomy containing a large amount of subsumption, i.e. isa, relations. We evaluate the quality of the created resource by comparing it with ResearchCyc, one of the largest manually annotated ontologies, as well as computing semantic similarity between words in benchmarking datasets.

502 citations


Proceedings ArticleDOI
07 Nov 2007
TL;DR: This paper proposes a data preprocessing model to add semantic information to trajectories in order to facilitate trajectory data analysis in different application domains and shows that the query complexity for the semantic analysis of trajectories will be significantly reduced.
Abstract: The collection of moving object data is becoming more and more common, and therefore there is an increasing need for the efficient analysis and knowledge extraction of these data in different application domains. Trajectory data are normally available as sample points, and do not carry semantic information, which is of fundamental importance for the comprehension of these data. Therefore, the analysis of trajectory data becomes expensive from a computational point of view and complex from a user's perspective. Enriching trajectories with semantic geographical information may simplify queries, analysis, and mining of moving object data. In this paper we propose a data preprocessing model to add semantic information to trajectories in order to facilitate trajectory data analysis in different application domains. The model is generic enough to represent the important parts of trajectories that are relevant to the application, not being restricted to one specific application. We present an algorithm to compute the important parts and show that the query complexity for the semantic analysis of trajectories will be significantly reduced with the proposed model.

434 citations


Journal ArticleDOI
TL;DR: A novel image representation is presented that renders it possible to access natural scenes by local semantic description by using a perceptually plausible distance measure that leads to a high correlation between the human and the automatically obtained typicality ranking.
Abstract: In this paper, we present a novel image representation that renders it possible to access natural scenes by local semantic description. Our work is motivated by the continuing effort in content-based image retrieval to extract and to model the semantic content of images. The basic idea of the semantic modeling is to classify local image regions into semantic concept classes such as water, rocks, or foliage. Images are represented through the frequency of occurrence of these local concepts. Through extensive experiments, we demonstrate that the image representation is well suited for modeling the semantic content of heterogenous scene categories, and thus for categorization and retrieval. The image representation also allows us to rank natural scenes according to their semantic similarity relative to certain scene categories. Based on human ranking data, we learn a perceptually plausible distance measure that leads to a high correlation between the human and the automatically obtained typicality ranking. This result is especially valuable for content-based image retrieval where the goal is to present retrieval results in descending semantic similarity from the query.

433 citations


Proceedings Article
01 Jun 2007
TL;DR: This work introduces a general framework for answer extraction which exploits semantic role annotations in the FrameNet paradigm and views semantic role assignment as an optimization problem in a bipartite graph and answer extraction as an instance of graph matching.
Abstract: Shallow semantic parsing, the automatic identification and labeling of sentential constituents, has recently received much attention. Our work examines whether semantic role information is beneficial to question answering. We introduce a general framework for answer extraction which exploits semantic role annotations in the FrameNet paradigm. We view semantic role assignment as an optimization problem in a bipartite graph and answer extraction as an instance of graph matching. Experimental results on the TREC datasets demonstrate improvements over state-of-the-art models.

429 citations


Book ChapterDOI
02 Apr 2007
TL;DR: This work formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log, and provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.
Abstract: Measuring the similarity between documents and queries has been extensively studied in information retrieval However, there are a growing number of tasks that require computing the similarity between two very short segments of text These tasks include query reformulation, sponsored search, and image retrieval Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency

Proceedings ArticleDOI
28 Oct 2007
TL;DR: A novel local probabilistic graphical model method that can scale to large graphs to estimate the joint co-occurrence probability of two nodes and is demonstrated to be effective both in isolation and in combination with other topological and semantic features for predicting co-authorship collaborations on real datasets.
Abstract: One of the core tasks in social network analysis is to predict the formation of links (i.e. various types of relationships) over time. Previous research has generally represented the social network in the form of a graph and has leveraged topological and semantic measures of similarity between two nodes to evaluate the probability of link formation. Here we introduce a novel local probabilistic graphical model method that can scale to large graphs to estimate the joint co-occurrence probability of two nodes. Such a probability measure captures information that is not captured by either topological measures or measures of semantic similarity, which are the dominant measures used for link prediction. We demonstrate the effectiveness of the co-occurrence probability feature by using it both in isolation and in combination with other topological and semantic features for predicting co-authorship collaborations on real datasets.

Patent
04 May 2007
TL;DR: In this article, a semantic network comprising a plurality of lemmas that are grouped into synsets representing concepts, each of the synset having a corresponding sense, and links connected between the synnsets that represent semantic relations between the Synsets is described.
Abstract: A method and system for automatically extracting relations between concepts included in electronic text is described. Aspects the exemplary embodiment include a semantic network comprising a plurality of lemmas that are grouped into synsets representing concepts, each of the synsets having a corresponding sense, and a plurality of links connected between the synsets that represent semantic relations between the synsets. The semantic network further includes semantic information comprising at least one of: 1) an expanded set of semantic relation links representing: hierarchical semantic relations, synset/corpus semantic relations verb/subject semantic relations, verb/direct object semantic relations, and fine grain/coarse grain semantic relationship; 2) a hierarchical category tree having a plurality of categories, wherein each of the categories contains a group of one or more synsets and a set of attributes, wherein the set of attributes of each of the categories are associated with each of the synsets in the respective category; and 3) a plurality of domains, wherein one or more of the domains is associated with at least a portion of the synsets, wherein each domain adds information regarding a linguistic context in which the corresponding synset is used in a language. A linguistic engine uses the semantic network to performing semantic disambiguation on the electronic text using one or more of the expanded set of semantic relation links, the hierarchical category tree, and the plurality of domains to assign a respective one of the senses to elements in the electronic text independently from contextual reference.

Book ChapterDOI
TL;DR: Semantic matching as discussed by the authors is an operator that takes two graph-like structures (e.g., classifications, XML schemas) and produces a mapping between the nodes of these graphs that correspond semantically to each other.
Abstract: We view match as an operator that takes two graph-like structures (e.g., classifications, XML schemas) and produces a mapping between the nodes of these graphs that correspond semantically to each other. Semantic matching is based on two ideas: (i) we discover mappings by computing semantic relations (e.g., equivalence, more general); (ii) we determine semantic relations by analyzing the meaning (concepts, not labels) which is codified in the elements and the structures of schemas. In this paper we present basic and optimized algorithms for semantic matching, and we discuss their implementation within the S-Match system. We evaluate S-Match against three state of the art matching systems, thereby justifying empirically the strength of our approach.

Proceedings ArticleDOI
17 Sep 2007
TL;DR: The results indicate that the right combination of similarity metrics and graph centrality algorithms can lead to a performance competing with the state-of-the-art in unsupervised word sense disambiguation, as measured on standard data sets.
Abstract: This paper describes an unsupervised graph-based method for word sense disambiguation, and presents comparative evaluations using several measures of word semantic similarity and several algorithms for graph centrality. The results indicate that the right combination of similarity metrics and graph centrality algorithms can lead to a performance competing with the state-of-the-art in unsupervised word sense disambiguation, as measured on standard data sets.

Journal ArticleDOI
TL;DR: Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts, and it is shown that Wikipedia outperforms WordNet on some datasets.
Abstract: Wikipedia provides a semantic network for computing semantic relatedness in a more structured fashion than a search engine and with more coverage than WordNet. We present experiments on using Wikipedia for computing semantic relatedness and compare it to WordNet on various benchmarking datasets. Existing relatedness measures perform better using Wikipedia than a baseline given by Google counts, and we show that Wikipedia outperforms WordNet on some datasets. We also address the question whether and how Wikipedia can be integrated into NLP applications as a knowledge base. Including Wikipedia improves the performance of a machine learning based coreference resolution system, indicating that it represents a valuable resource for NLP applications. Finally, we show that our method can be easily used for languages other than English by computing semantic relatedness for a German dataset.

Journal ArticleDOI
01 Apr 2007
TL;DR: GraSM, a novel method that uses all the information in the graph structure of the Gene Ontology, instead of considering it as a hierarchical tree, gives a consistently higher family similarity correlation on all aspects of GO than the original semantic similarity measures.
Abstract: Many bioinformatics applications would benefit from comparing proteins based on their biological role rather than their sequence. This paper adds two new contributions. First, a study of the correlation between Gene Ontology (GO) terms and family similarity demonstrates that protein families constitute an appropriate baseline for validating GO similarity. Secondly, we introduce GraSM, a novel method that uses all the information in the graph structure of the Gene Ontology, instead of considering it as a hierarchical tree. GraSM gives a consistently higher family similarity correlation on all aspects of GO than the original semantic similarity measures.

Journal ArticleDOI
TL;DR: AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input, and returns answers drawn from one or more knowledge bases (KBs) because the configuration time required to customize the system for a particular ontology is negligible.

Proceedings Article
01 Jun 2007
TL;DR: A new model of lexical semantic relatedness is proposed that incorporates information from every explicit or implicit path connecting the two words in the entire graph and is scored by a novel divergence measure, ZKL, that outperforms existing measures on certain classes of distributions.
Abstract: Many systems for tasks such as question answering, multi-document summarization, and information retrieval need robust numerical measures of lexical relatedness. Standard thesaurus-based measures of word pair similarity are based on only a single path between those words in the thesaurus graph. By contrast, we propose a new model of lexical semantic relatedness that incorporates information from every explicit or implicit path connecting the two words in the entire graph. Our model uses a random walk over nodes and edges derived from WordNet links and corpus statistics. We treat the graph as a Markov chain and compute a word-specific stationary distribution via a generalized PageRank algorithm. Semantic relatedness of a word pair is scored by a novel divergence measure, ZKL, that outperforms existing measures on certain classes of distributions. In our experiments, the resulting relatedness measure is the WordNet-based measure most highly correlated with human similarity judgments by rank ordering at = .90.

Proceedings ArticleDOI
23 Jun 2007
TL;DR: An evaluation task designed to provide a framework for comparing different approaches to classifying semantic relations between nominals in a sentence as part of SemEval, the 4th edition of the semantic evaluation event previously known as SensEval.
Abstract: The NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of relations between pairs of words in a text. We present an evaluation task designed to provide a framework for comparing different approaches to classifying semantic relations between nominals in a sentence. This is part of SemEval, the 4th edition of the semantic evaluation event previously known as SensEval. We define the task, describe the training/test data and their creation, list the participating systems and discuss their results. There were 14 teams who submitted 15 systems.


Journal ArticleDOI
01 Jul 2007
TL;DR: A novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations is proposed, able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed.
Abstract: Motivation: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). Results: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97%, recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11 000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43–58%) can be achieved for the human GO Annotation file dated 2003. Availability: The program is available on request. The 97 732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset and other supplementary information is available at http://phenos.bsd.uchicago.edu/ITSS/ Contact: Lussier@uchicago.edu Supplementary information: Supplementary data are available atBioinformatics online.

Journal ArticleDOI
TL;DR: A flow-based modularization algorithm to efficiently identify overlapping modules in the weighted interaction networks and shows that the semantic similarity and semantic interactivity of interacting pairs were positively correlated with functional co-occurrence.
Abstract: The systematic analysis of protein-protein interactions can enable a better understanding of cellular organization, processes and functions. Functional modules can be identified from the protein interaction networks derived from experimental data sets. However, these analyses are challenging because of the presence of unreliable interactions and the complex connectivity of the network. The integration of protein-protein interactions with the data from other sources can be leveraged for improving the effectiveness of functional module detection algorithms. We have developed novel metrics, called semantic similarity and semantic interactivity, which use Gene Ontology (GO) annotations to measure the reliability of protein-protein interactions. The protein interaction networks can be converted into a weighted graph representation by assigning the reliability values to each interaction as a weight. We presented a flow-based modularization algorithm to efficiently identify overlapping modules in the weighted interaction networks. The experimental results show that the semantic similarity and semantic interactivity of interacting pairs were positively correlated with functional co-occurrence. The effectiveness of the algorithm for identifying modules was evaluated using functional categories from the MIPS database. We demonstrated that our algorithm had higher accuracy compared to other competing approaches. The integration of protein interaction networks with GO annotation data and the capability of detecting overlapping modules substantially improve the accuracy of module identification.

Proceedings Article
01 Dec 2007
TL;DR: A new, simple model for the automatic induction of selectional preferences, using corpus-based semantic similarity metrics, focuses on the task of semantic role labeling and shows lower error rates than both Resnik's WordNet-based model and the EM-based clustering model.
Abstract: We propose a new, simple model for the automatic induction of selectional preferences, using corpus-based semantic similarity metrics. Focusing on the task of semantic role labeling, we compute selectional preferences for semantic roles. In evaluations the similarity-based model shows lower error rates than both Resnik’s WordNet-based model and the EM-based clustering model, but has coverage problems.

Proceedings Article
18 Mar 2007
TL;DR: A graphtheoretic analysis of the category graph is performed, and it is shown that it is a scale-free, small world graph like other well-known lexical semantic networks.
Abstract: In this paper, we discuss two graphs in Wikipedia (i) the article graph, and (ii) the category graph. We perform a graphtheoretic analysis of the category graph, and show that it is a scale-free, small world graph like other well-known lexical semantic networks. We substantiate our findings by transferring semantic relatedness algorithms defined on WordNet to the Wikipedia category graph. To assess the usefulness of the category graph as an NLP resource, we analyze its coverage and the performance of the transferred semantic relatedness algorithms.

Journal ArticleDOI
TL;DR: Unlike patients with classical "semantic access impairment", the authors' semantically impaired stroke patients showed significant test-retest consistency, indicating that their difficulties did not result from an unpredictable failure of semantic access--instead, their deficits were interpreted as arising from failures of semantic control.

Journal ArticleDOI
TL;DR: Three ERP studies contrasting the processing of antonym relations with that of related and unrelated word pairs revealed that the P300 effect is not only a function of stimulus constraints and experimental task, but that it is also crucially influenced by individual processing strategies used to achieve successful task performance.
Abstract: We report a series of event-related potential experiments designed to dissociate the functionally distinct processes involved in the comprehension of highly restricted lexical-semantic relations (antonyms). We sought to differentiate between influences of semantic relatedness (which are independent of the experimental setting) and processes related to predictability (which differ as a function of the experimental environment). To this end, we conducted three ERP studies contrasting the processing of antonym relations (black-white) with that of related (black-yellow) and unrelated (black-nice) word pairs. Whereas the lexical-semantic manipulation was kept constant across experiments, the experimental environment and the task demands varied: Experiment 1 presented the word pairs in a sentence context of the form The opposite of X is Y and used a sensicality judgment. Experiment 2 used a word pair presentation mode and a lexical decision task. Experiment 3 also examined word pairs, but with an antonymy judgment task. All three experiments revealed a graded N400 response (unrelated > related > antonyms), thus supporting the assumption that semantic associations are processed automatically. In addition, the experiments revealed that, in highly constrained task environments, the N400 gradation occurs simultaneously with a P300 effect for the antonym condition, thus leading to the superficial impression of an extremely “reduced” N400 for antonym pairs. Comparisons across experiments and participant groups revealed that the P300 effect is not only a function of stimulus constraints (i.e., sentence context) and experimental task, but that it is also crucially influenced by individual processing strategies used to achieve successful task performance.

Proceedings ArticleDOI
19 Aug 2007
TL;DR: This paper seeks to establish a mathematical formulation of this problem and suggests a method for generation of several terms from a seed keyword by using a web based kernel function to establish semantic similarity between terms.
Abstract: An important problem in search engine advertising is key-word1 generation. In the past, advertisers have preferred to bid for keywords that tend to have high search volumes and hence are more expensive. An alternate strategy involves bidding for several related but low volume, inexpensive terms that generate the same amount of traffic cumulatively but are much cheaper. This paper seeks to establish a mathematical formulation of this problem and suggests a method for generation of several terms from a seed keyword. This approach uses a web based kernel function to establish semantic similarity between terms. The similarity graph is then traversed to generate keywords that are related but cheaper.

Proceedings Article
01 Jan 2007
TL;DR: This paper shows that, despite the ad hoc and informal language of tagging, tags define a low-dimensional semantic space that is extremely well-behaved at the track level, in particular being highly organised by artist and musical genre.
Abstract: In this paper we investigate social tags as a novel highvolume source of semantic metadata for music, using techniques from the fields of information retrieval and multivariate data analysis. We show that, despite the ad hoc and informal language of tagging, tags define a low-dimensional semantic space that is extremely well-behaved at the track level, in particular being highly organised by artist and musical genre. We introduce the use of Correspondence Analysis to visualise this semantic space, and show how it can be applied to create a browse-by-mood interface for a psychologically-motivated two-dimensional subspace representing musical emotion.

Proceedings Article
Wen-tau Yih1, Christopher Meek1
22 Jul 2007
TL;DR: A Web-relevance similarity measure is introduced and it is shown that one can further improve the accuracy of similarity measures by using a machine learning approach.
Abstract: In this paper we improve previous work on measuring the similarity of short segments of text in two ways. First, we introduce a Web-relevance similarity measure and demonstrate its effectiveness. This measure extends the Web-kernel similarity function introduced by Sahami and Heilman (2006) by using relevance weighted inner-product of term occurrences rather than TF×IDF. Second, we show that one can further improve the accuracy of similarity measures by using a machine learning approach. Our methods outperform other state-of-the-art methods in a general query suggestion task for multiple evaluation metrics.