scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2013"


Proceedings ArticleDOI
27 Oct 2013
TL;DR: A series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them are developed.
Abstract: Latent semantic models, such as LSA, intend to map a query to its relevant documents at the semantic level where keyword-based matching often fails In this study we strive to develop a series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them The proposed deep structured semantic models are discriminatively trained by maximizing the conditional likelihood of the clicked documents given a query using the clickthrough data To make our models applicable to large-scale Web search applications, we also use a technique called word hashing, which is shown to effectively scale up our semantic models to handle large vocabularies which are common in such tasks The new models are evaluated on a Web document ranking task using a real-world data set Results show that our best model significantly outperforms other latent semantic models, which were considered state-of-the-art in the performance prior to the work presented in this paper

1,935 citations


Journal ArticleDOI
TL;DR: This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities, and samples of combination between these similarities are presented.
Abstract: Measuring the similarity between words, sentences, paragraphs and documents is an important component in various tasks such as information retrieval, document clustering, word-sense disambiguation, automatic essay scoring, short answer grading, machine translation and text summarization. This survey discusses the existing works on text similarity through partitioning them into three approaches; String-based, Corpus-based and Knowledge-based similarities. Furthermore, samples of combination between these similarities are presented. General Terms Text Mining, Natural Language Processing. Keywords BasedText Similarity, Semantic Similarity, String-Based Similarity, Corpus-Based Similarity, Knowledge-Based Similarity. NeedlemanWunsch 1. INTRODUCTION Text similarity measures play an increasingly important role in text related research and applications in tasks Nsuch as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Finding similarity between words is a fundamental part of text similarity which is then used as a primary stage for sentence, paragraph and document similarities. Words can be similar in two ways lexically and semantically. Words are similar lexically if they have a similar character sequence. Words are similar semantically if they have the same thing, are opposite of each other, used in the same way, used in the same context and one is a type of another. DistanceLexical similarity is introduced in this survey though different String-Based algorithms, Semantic similarity is introduced through Corpus-Based and Knowledge-Based algorithms. String-Based measures operate on string sequences and character composition. A string metric is a metric that measures similarity or dissimilarity (distance) between two text strings for approximate string matching or comparison. Corpus-Based similarity is a semantic similarity measure that determines the similarity between words according to information gained from large corpora. Knowledge-Based similarity is a semantic similarity measure that determines the degree of similarity between words using information derived from semantic networks. The most popular for each type will be presented briefly. This paper is organized as follows: Section two presents String-Based algorithms by partitioning them into two types character-based and term-based measures. Sections three and four introduce Corpus-Based and knowledge-Based algorithms respectively. Samples of combinations between similarity algorithms are introduced in section five and finally section six presents conclusion of the survey.

718 citations


Proceedings Article
01 Oct 2013
TL;DR: A method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence is proposed, which significantly out-perform baselines in word semantic similarity.
Abstract: We introduce bilingual word embeddings: semantic embeddings associated across two languages in the context of neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence. The new embeddings significantly out-perform baselines in word semantic similarity. A single semantic similarity feature induced with bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machine translation task.

608 citations


Proceedings Article
13 Jun 2013
TL;DR: Three semantic text similarity systems developed for the *SEM 2013 STS shared task used a simple term alignment algorithm augmented with penalty terms, and two used support vector regression models to combine larger sets of features.
Abstract: We describe three semantic text similarity systems developed for the *SEM 2013 STS shared task and the results of the corresponding three runs. All of them shared a word similarity feature that combined LSA word similarity and WordNet knowledge. The first, which achieved the best mean score of the 89 submitted runs, used a simple term alignment algorithm augmented with penalty terms. The other two runs, ranked second and fourth, used support vector regression models to combine larger sets of features.

386 citations


Journal ArticleDOI
TL;DR: The Wikipedia Miner toolkit is introduced, an open-source software system that allows researchers and developers to integrate Wikipedia's rich semantics into their own applications, and creates databases that contain summarized versions of Wikipedia's content and structure.

382 citations


Journal ArticleDOI
TL;DR: Gradient Field HOG is described; an adapted form of the HOG descriptor suitable for Sketch Based Image Retrieval (SBIR) and incorporated into a Bag of Visual Words retrieval framework, and shown to consistently outperform retrieval versus SIFT, multi-resolution HOG, Self Similarity, Shape Context and Structure Tensor.

363 citations


Proceedings Article
01 Aug 2013
TL;DR: This paper presents smatch, a metric that calculates the degree of overlap between two semantic feature structures, and gives an efficient algorithm to compute the metric and shows the results of an inter-annotator agreement study.
Abstract: The evaluation of whole-sentence semantic structures plays an important role in semantic parsing and large-scale semantic structure annotation. However, there is no widely-used metric to evaluate wholesentence semantic structures. In this paper, we present smatch, a metric that calculates the degree of overlap between two semantic feature structures. We give an efficient algorithm to compute the metric and show the results of an inter-annotator agreement study.

327 citations


01 Mar 2013
TL;DR: Evaluation on three data sets shows that the distributional-based measures outperform the state-of-the-art approach for this task.
Abstract: This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collect frequencies. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information (PMI). Topic coherence is determined by measuring the distance between these vectors computed using a variety of metrics. Evaluation on three data sets shows that the distributional-based measures outperform the state-of-the-art approach for this task.

255 citations


Journal ArticleDOI
TL;DR: This work describes an alternative, computationally derived measure of ambiguity based on the proposal that the meanings of words vary continuously as a function of their contexts, and suggests that this approach provides an objective way of quantifying the subtle, context-dependent variations in word meaning.
Abstract: Semantic ambiguity is typically measured by summing the number of senses or dictionary definitions that a word has. Such measures are somewhat subjective and may not adequately capture the full extent of variation in word meaning, particularly for polysemous words that can be used in many different ways, with subtle shifts in meaning. Here, we describe an alternative, computationally derived measure of ambiguity based on the proposal that the meanings of words vary continuously as a function of their contexts. On this view, words that appear in a wide range of contexts on diverse topics are more variable in meaning than those that appear in a restricted set of similar contexts. To quantify this variation, we performed latent semantic analysis on a large text corpus to estimate the semantic similarities of different linguistic contexts. From these estimates, we calculated the degree to which the different contexts associated with a given word vary in their meanings. We term this quantity a word's semantic diversity (SemD). We suggest that this approach provides an objective way of quantifying the subtle, context-dependent variations in word meaning that are often present in language. We demonstrate that SemD is correlated with other measures of ambiguity and contextual variability, as well as with frequency and imageability. We also show that SemD is a strong predictor of performance in semantic judgments in healthy individuals and in patients with semantic deficits, accounting for unique variance beyond that of other predictors. SemD values for over 30,000 English words are provided as supplementary materials.

235 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper creates 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions and thoroughly analyzes this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.
Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of semantically similar real images would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract scenes with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity.

231 citations


Journal ArticleDOI
TL;DR: It is proposed that a brain region can be considered to represent amodal conceptual object knowledge if it is supramodal and plays a role in distinguishing among the conceptual representations of different objects.
Abstract: To what extent do the brain regions implicated in semantic processing contribute to the representation of amodal conceptual content rather than modality-specific mechanisms or mechanisms of semantic access and manipulation? Here, we propose that a brain region can be considered to represent amodal conceptual object knowledge if it is supramodal and plays a role in distinguishing among the conceptual representations of different objects. In an fMRI study, human participants made category typicality judgments about pictured objects or their names drawn from five different categories. Crossmodal multivariate pattern analysis revealed a network of six left-lateralized regions largely outside of category-selective visual cortex that showed a supramodal representation of object categories. These were located in the posterior middle/inferior temporal gyrus (pMTG/ITG), angular gyrus, ventral temporal cortex, posterior cingulate/precuneus (PC), and lateral and dorsomedial prefrontal cortex. Representational similarity analysis within these regions determined that the similarity between category-specific patterns of neural activity in the pMTG/ITG and the PC was consistent with the semantic similarity between these categories. This finding supports the PC and pMTG/ITG as candidate regions for the amodal representation of the conceptual properties of objects.

01 Jan 2013
TL;DR: The paper contains a review of the state of art measures, including path Based measures, information based measures, feature based measures and hybrid measures, and the area of future research is described.
Abstract: Semantic similarity has attracted great concern for a long time in artificial intelligence, psychology and cognitive science. In recent years the measures based on WordNet have shown its talents and attracted great concern. Many measures have been proposed. The paper contains a review of the state of art measures, including path based measures, information based measures, feature based measures and hybrid measures. The features, performance, advantages, disadvantages and related issues of different measures are discussed. Finally the area of future research is described..

Proceedings Article
01 Aug 2013
TL;DR: This work presents a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents, and leverages a common probabilistic representation over word senses in order to compare different types of linguistic data.
Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.

Journal ArticleDOI
TL;DR: The results show that the multiple-response procedure results in a more heterogeneous set of responses, which lead to better predictions of lexical access and semantic relatedness than do single-response procedures.
Abstract: In this article, we describe the most extensive set of word associations collected to date. The database contains over 12,000 cue words for which more than 70,000 participants generated three responses in a multiple-response free association task. The goal of this study was (1) to create a semantic network that covers a large part of the human lexicon, (2) to investigate the implications of a multiple-response procedure by deriving a weighted directed network, and (3) to show how measures of centrality and relatedness derived from this network predict both lexical access in a lexical decision task and semantic relatedness in similarity judgment tasks. First, our results show that the multiple-response procedure results in a more heterogeneous set of responses, which lead to better predictions of lexical access and semantic relatedness than do single-response procedures. Second, the directed nature of the network leads to a decomposition of centrality that primarily depends on the number of incoming links or in-degree of each node, rather than its set size or number of outgoing links. Both studies indicate that adequate representation formats and sufficiently rich data derived from word associations represent a valuable type of information in both lexical and semantic processing.

Proceedings Article
01 Oct 2013
TL;DR: A new discriminative term-weighting metric called TF-KLD is designed, which outperforms TF-IDF and it is shown that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy.
Abstract: Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.

Posted Content
TL;DR: The authors proposed a method for learning distributed representations in a multilingual setup, which learns to assign similar embeddings to aligned sentences and dissimilar ones to sentences which are not aligned while not requiring word alignments.
Abstract: Distributed representations of meaning are a natural way to encode covariance relationships between words and phrases in NLP. By overcoming data sparsity problems, as well as providing information about semantic relatedness which is not available in discrete representations, distributed representations have proven useful in many NLP tasks. Recent work has shown how compositional semantic representations can successfully be applied to a number of monolingual applications such as sentiment analysis. At the same time, there has been some initial success in work on learning shared word-level representations across languages. We combine these two approaches by proposing a method for learning distributed representations in a multilingual setup. Our model learns to assign similar embeddings to aligned sentences and dissimilar ones to sentence which are not aligned while not requiring word alignments. We show that our representations are semantically informative and apply them to a cross-lingual document classification task where we outperform the previous state of the art. Further, by employing parallel corpora of multiple language pairs we find that our model learns representations that capture semantic relationships across languages for which no parallel data was used.

Proceedings ArticleDOI
21 Oct 2013
TL;DR: Experimental results show that the proposed A2 SH can characterize the semantic affinities among images accurately and can shape user search intent precisely and quickly, leading to more accurate search results as compared to state-of-the-art CBIR solutions.
Abstract: This paper presents a novel Attribute-augmented Semantic Hierarchy (A2 SH) and demonstrates its effectiveness in bridging both the semantic and intention gaps in Content-based Image Retrieval (CBIR). A2 SH organizes the semantic concepts into multiple semantic levels and augments each concept with a set of related attributes, which describe the multiple facets of the concept and act as the intermediate bridge connecting the concept and low-level visual content. A hierarchical semantic similarity function is learnt to characterize the semantic similarities among images for retrieval. To better capture user search intent, a hybrid feedback mechanism is developed, which collects hybrid feedbacks on attributes and images. These feedbacks are then used to refine the search results based on A2 SH. We develop a content-based image retrieval system based on the proposed A2 SH. We conduct extensive experiments on a large-scale data set of over one million Web images. Experimental results show that the proposed A2 SH can characterize the semantic affinities among images accurately and can shape user search intent precisely and quickly, leading to more accurate search results as compared to state-of-the-art CBIR solutions.

Journal ArticleDOI
TL;DR: In both tasks, PD patients' performance was selectively impaired for action verbs (relative to controls), indicating that the motor system plays a more central role in the processing of action verbs than in theprocessing of abstract verbs, arguing for a causal role of sensory-motor systems in semantic processing.

Proceedings Article
01 Jan 2013
TL;DR: This paper proposes to combine term generalization approaches such as word clustering and latent semantic analysis (LSA) and structured kernels to improve the adaptability of relation extractors to new text genres/domains.
Abstract: Relation Extraction (RE) is the task of extracting semantic relationships between entities in text. Recent studies on relation extraction are mostly supervised. The clear drawback of supervised methods is the need of training data: labeled data is expensive to obtain, and there is often a mismatch between the training data and the data the system will be applied to. This is the problem of domain adaptation. In this paper, we propose to combine (i) term generalization approaches such as word clustering and latent semantic analysis (LSA) and (ii) structured kernels to improve the adaptability of relation extractors to new text genres/domains. The empirical evaluation on ACE 2005 domains shows that a suitable combination of syntax and lexical generalization is very promising for domain adaptation.

Journal ArticleDOI
TL;DR: Devising a mechanism for computing the semantic similarity of the OSM geographic classes can help alleviate this semantic gap, and empirical evidence supports the usage of co-citation algorithms—SimRank showing the highest plausibility—to compute concept similarity in a crowdsourced semantic network.
Abstract: In recent years, a web phenomenon known as Volunteered Geographic Information (VGI) has produced large crowdsourced geographic data sets OpenStreetMap (OSM), the leading VGI project, aims at building an open-content world map through user contributions OSM semantics consists of a set of properties (called ‘tags’) describing geographic classes, whose usage is defined by project contributors on a dedicated Wiki website Because of its simple and open semantic structure, the OSM approach often results in noisy and ambiguous data, limiting its usability for analysis in information retrieval, recommender systems and data mining Devising a mechanism for computing the semantic similarity of the OSM geographic classes can help alleviate this semantic gap The contribution of this paper is twofold It consists of (1) the development of the OSM Semantic Network by means of a web crawler tailored to the OSM Wiki website; this semantic network can be used to compute semantic similarity through co-citation measures, providing a novel semantic tool for OSM and GIS communities; (2) a study of the cognitive plausibility (ie the ability to replicate human judgement) of co-citation algorithms when applied to the computation of semantic similarity of geographic concepts Empirical evidence supports the usage of co-citation algorithms—SimRank showing the highest plausibility—to compute concept similarity in a crowdsourced semantic network

Proceedings Article
01 Aug 2013
TL;DR: SEMILAR implements a number of algorithms for assessing the semantic similarity between two texts and offers facilities for manual se-mantic similarity annotation by experts through its component SEMILAT (a SEMantic simILarity Annotation Tool).
Abstract: We present in this paper SEMILAR, the SEMantic simILARity toolkit. SEMILAR implements a number of algorithms for assessing the semantic similarity between two texts. It is available as a Java library and as a Java standalone ap-plication offering GUI-based access to the implemented semantic similarity methods. Furthermore, it offers facilities for manual se-mantic similarity annotation by experts through its component SEMILAT (a SEMantic simILarity Annotation Tool).

Proceedings ArticleDOI
08 Oct 2013
TL;DR: Experiments suggest this interface is an effective alternative for novices performing tasks with high-level design goals, enabling rapid, in-situ exploration of candidate designs.
Abstract: We present AttribIt, an approach for people to create visual content using relative semantic attributes expressed in linguistic terms. During an off-line processing step, AttribIt learns semantic attributes for design components that reflect the high-level intent people may have for creating content in a domain (e.g. adjectives such as "dangerous", "scary" or "strong") and ranks them according to the strength of each learned attribute. Then, during an interactive design session, a person can explore different combinations of visual components using commands based on relative attributes (e.g. "make this part more dangerous"). Novel designs are assembled in real-time as the strengths of selected attributes are varied, enabling rapid, in-situ exploration of candidate designs. We applied this approach to 3D modeling and web design. Experiments suggest this interface is an effective alternative for novices performing tasks with high-level design goals.

Journal ArticleDOI
TL;DR: An event-related functional magnetic resonance imaging (fMRI) experiment was determined how cosine similarity between fMRI response patterns to concrete words and pictures reflects semantic clustering and semantic distances between the represented entities within a single category.
Abstract: How verbal and nonverbal visuoperceptual input connects to semantic knowledge is a core question in visual and cognitive neuroscience, with significant clinical ramifications. In an event-related functional magnetic resonance imaging (fMRI) experiment we determined how cosine similarity between fMRI response patterns to concrete words and pictures reflects semantic clustering and semantic distances between the represented entities within a single category. Semantic clustering and semantic distances between 24 animate entities were derived from a concept-feature matrix based on feature generation by >1000 subjects. In the main fMRI study, 19 human subjects performed a property verification task with written words and pictures and a low-level control task. The univariate contrast between the semantic and the control task yielded extensive bilateral occipitotemporal activation from posterior cingulate to anteromedial temporal cortex. Entities belonging to a same semantic cluster elicited more similar fMRI activity patterns in left occipitotemporal cortex. When words and pictures were analyzed separately, the effect reached significance only for words. The semantic similarity effect for words was localized to left perirhinal cortex. According to a representational similarity analysis of left perirhinal responses, semantic distances between entities correlated inversely with cosine similarities between fMRI response patterns to written words. An independent replication study in 16 novel subjects confirmed these novel findings. Semantic similarity is reflected by similarity of functional topography at a fine-grained level in left perirhinal cortex. The word specificity excludes perceptually driven confounds as an explanation and is likely to be task dependent.

Proceedings ArticleDOI
11 Aug 2013
TL;DR: This work presents WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia that substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover "interesting" relationships between table columns.
Abstract: Knowledge bases extracted automatically from the Web present new opportunities for data mining and exploration. Given a large, heterogeneous set of extracted relations, new tools are needed for searching the knowledge and uncovering relationships of interest. We present WikiTables, a Web application that enables users to interactively explore tabular knowledge extracted from Wikipedia.In experiments, we show that WikiTables substantially outperforms baselines on the novel task of automatically joining together disparate tables to uncover "interesting" relationships between table columns. We find that a "Semantic Relatedness" measure that leverages the Wikipedia link structure accounts for a majority of this improvement. Further, on the task of keyword search for tables, we show that WikiTables performs comparably to Google Fusion Tables despite using an order of magnitude fewer tables. Our work also includes the release of a number of public resources, including over 15 million tuples of extracted tabular data, manually annotated evaluation sets, and public APIs.

Journal ArticleDOI
TL;DR: An information-theoretic framework is proposed that uses a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein’s function and proposes a single statistic, referred to as semantic distance, that can be used to rank classification models.
Abstract: Motivation: The development of effective methods for the prediction of ontological annotations is an important goal in computational biology, with protein function prediction and disease gene prioritization gaining wide recognition. Although various algorithms have been proposed for these tasks, evaluating their performance is difficult owing to problems caused both by the structure of biomedical ontologies and biased or incomplete experimental annotations of genes and gene products. Results: We propose an information-theoretic framework to evaluate the performance of computational protein function prediction. We use a Bayesian network, structured according to the underlying ontology, to model the prior probability of a protein’s function. We then define two concepts, misinformation and remaining uncertainty, that can be seen as information-theoretic analogs of precision and recall. Finally, we propose a single statistic, referred to as semantic distance, that can be used to rank classification models. We evaluate our approach by analyzing the performance of three protein function predictors of Gene Ontology terms and provide evidence that it addresses several weaknesses of currently used metrics. We believe this framework provides useful insights into the performance of protein function prediction tools. Contact: ude.anaidni@garderp Supplementary information: Supplementary data are available at Bioinformatics online.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features and examines the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates.
Abstract: This paper presents a nonparametric approach to semantic parsing using small patches and simple gradient, color and location features. We learn the relevance of individual feature channels at test time using a locally adaptive distance metric. To further improve the accuracy of the nonparametric approach, we examine the importance of the retrieval set used to compute the nearest neighbours using a novel semantic descriptor to retrieve better candidates. The approach is validated by experiments on several datasets used for semantic parsing demonstrating the superiority of the method compared to the state of art approaches.

Journal ArticleDOI
TL;DR: These data represent the largest behavioral database on semantic priming and are available to researchers to aid in selecting stimuli, testing theories, and reducing potential confounds in their studies.
Abstract: Speeded naming and lexical decision data for 1,661 target words following related and unrelated primes were collected from 768 subjects across four different universities. These behavioral measures have been integrated with demographic information for each subject and descriptive characteristics for every item. Subjects also completed portions of the Woodcock–Johnson reading battery, three attentional control tasks, and a circadian rhythm measure. These data are available at a user-friendly Internet-based repository ( http://spp.montana.edu ). This Web site includes a search engine designed to generate lists of prime–target pairs with specific characteristics (e.g., length, frequency, associative strength, latent semantic similarity, priming effect in standardized and raw reaction times). We illustrate the types of questions that can be addressed via the Semantic Priming Project. These data represent the largest behavioral database on semantic priming and are available to researchers to aid in selecting stimuli, testing theories, and reducing potential confounds in their studies.

Journal ArticleDOI
TL;DR: This paper utilized Wikipedia features (articles, categories, Wikipedia category graph and redirection) in a system combining this Wikipedia semantic information in its different components to quantify better as possible the semantic relatedness between words.
Abstract: Measuring semantic relatedness is a critical task in many domains such as psychology, biology, linguistics, cognitive science and artificial intelligence. In this paper, we propose a novel system for computing semantic relatedness between words. Recent approaches have exploited Wikipedia as a huge semantic resource that showed good performances. Therefore, we utilized the Wikipedia features (articles, categories, Wikipedia category graph and redirection) in a system combining this Wikipedia semantic information in its different components. The approach is preceded by a pre-processing step to provide for each category pertaining to the Wikipedia category graph a semantic description vector including the weights of stems extracted from articles assigned to the target category. Next, for each candidate word, we collect its categories set using an algorithm for categories extraction from the Wikipedia category graph. Then, we compute the semantic relatedness degree using existing vector similarity metrics (Dice, Overlap and Cosine) and a new proposed metric that performed well as cosine formula. The basic system is followed by a set of modules in order to exploit Wikipedia features to quantify better as possible the semantic relatedness between words. We evaluate our measure based on two tasks: comparison with human judgments using five datasets and a specific application ''solving choice problem''. Our result system shows a good performance and outperforms sometimes ESA (Explicit Semantic Analysis) and TSA (Temporal Semantic Analysis) approaches.

Proceedings ArticleDOI
03 Oct 2013
TL;DR: This paper presents an expansion method for a concept based information retrieval that uses semantic relatedness to extend user query through an undirected graph of concepts.
Abstract: Concept based search is a method that enhances information retrieval systems using semantic relationships. The recall in concept based search is relatively low. That low recall comes from the fact that it is not easy to represent a concept completely. Query expansion intends to fill a gap because concept representation is always partial. Query expansion improves the recall. In this paper we present an expansion method for a concept based information retrieval. Our method uses semantic relatedness to extend user query through an undirected graph of concepts.

Posted Content
TL;DR: This work introduces a novel tree representation, and uses it to train predictive models with tree kernels using support vector machines, and shows that features derived from semantic frame parsing have significantly better performance across years on the polarity task.
Abstract: Semantic frames are a rich linguistic resource. There has been much work on semantic frame parsers, but less that applies them to general NLP problems. We address a task to predict change in stock price from financial news. Semantic frames help to generalize from specific sentences to scenarios, and to detect the (positive or negative) roles of specific companies. We introduce a novel tree representation, and use it to train predictive models with tree kernels using support vector machines. Our experiments test multiple text representations on two binary classification tasks, change of price and polarity. Experiments show that features derived from semantic frame parsing have significantly better performance across years on the polarity task.