scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2017"


Proceedings ArticleDOI
TL;DR: The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.
Abstract: Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

1,124 citations


Proceedings ArticleDOI
Yair Movshovitz-Attias1, Alexander Toshev1, Thomas Leung1, Sergey Ioffe1, Saurabh Singh1 
01 Oct 2017
TL;DR: This paper proposes to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well, and proposes a proxy-based loss which improves on state-of-art results for three standard zero-shot learning datasets.
Abstract: We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship – an anchor point x is similar to a set of positive points Y , and dissimilar to a set of negative points Z, and a loss defined over these distances is minimized. While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc. Even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.

389 citations


Posted Content
Yair Movshovitz-Attias1, Alexander Toshev1, Thomas Leung1, Sergey Ioffe1, Saurabh Singh1 
TL;DR: In this article, the authors proposed to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well.
Abstract: We address the problem of distance metric learning (DML), defined as learning a distance consistent with a notion of semantic similarity. Traditionally, for this problem supervision is expressed in the form of sets of points that follow an ordinal relationship -- an anchor point $x$ is similar to a set of positive points $Y$, and dissimilar to a set of negative points $Z$, and a loss defined over these distances is minimized. While the specifics of the optimization differ, in this work we collectively call this type of supervision Triplets and all methods that follow this pattern Triplet-Based methods. These methods are challenging to optimize. A main issue is the need for finding informative triplets, which is usually achieved by a variety of tricks such as increasing the batch size, hard or semi-hard triplet mining, etc. Even with these tricks, the convergence rate of such methods is slow. In this paper we propose to optimize the triplet loss on a different space of triplets, consisting of an anchor data point and similar and dissimilar proxy points which are learned as well. These proxies approximate the original data points, so that a triplet loss over the proxies is a tight upper bound of the original loss. This proxy-based loss is empirically better behaved. As a result, the proxy-loss improves on state-of-art results for three standard zero-shot learning datasets, by up to 15% points, while converging three times as fast as other triplet-based losses.

369 citations


Journal ArticleDOI
TL;DR: It is argued that a new class of prediction-based models that are trained on a text corpus and that measure semantic similarity between words bridge the gap between traditional approaches to distributional semantics and psychologically plausible learning principles.

277 citations


Proceedings ArticleDOI
19 Aug 2017
TL;DR: This paper presents a novel approach for entity alignment via joint knowledge embeddings that jointly encodes both entities and relations of various KGs into a unified low-dimensional semantic space according to a small seed set of aligned entities.
Abstract: Entity alignment aims to link entities and their counterparts among multiple knowledge graphs (KGs). Most existing methods typically rely on external information of entities such as Wikipedia links and require costly manual feature construction to complete alignment. In this paper, we present a novel approach for entity alignment via joint knowledge embeddings. Our method jointly encodes both entities and relations of various KGs into a unified low-dimensional semantic space according to a small seed set of aligned entities. During this process, we can align entities according to their semantic distance in this joint semantic space. More specifically, we present an iterative and parameter sharing method to improve alignment performance. Experiment results on realworld datasets show that, as compared to baselines, our method achieves significant improvements on entity alignment, and can further improve knowledge graph completion performance on various KGs with the favor of joint knowledge embeddings.

272 citations


Proceedings ArticleDOI
14 May 2017
TL;DR: It is shown how well the AE is capable of automatically learning a reasonable notion of semantic similarity among input features, and how the scheme can reduce the dimensionality of the features thereby signicantly minimising the memory requirements.
Abstract: This paper presents a novel feature learning model for cyber security tasks. We propose to use Auto-encoders (AEs), as a generative model, to learn latent representation of different feature sets. We show how well the AE is capable of automatically learning a reasonable notion of semantic similarity among input features. Specifically, the AE accepts a feature vector, obtained from cyber security phenomena, and extracts a code vector that captures the semantic similarity between the feature vectors. This similarity is embedded in an abstract latent representation. Because the AE is trained in an unsupervised fashion, the main part of this success comes from appropriate original feature set that is used in this paper. It can also provide more discriminative features in contrast to other feature engineering approaches. Furthermore, the scheme can reduce the dimensionality of the features thereby signicantly minimising the memory requirements. We selected two different cyber security tasks: networkbased anomaly intrusion detection and Malware classication. We have analysed the proposed scheme with various classifiers using publicly available datasets for network anomaly intrusion detection and malware classifications. Several appropriate evaluation metrics show improvement compared to prior results.

264 citations


Journal ArticleDOI
TL;DR: Case studies further demonstrated the feasibility of the method to discover potential miRNA-disease associations and highlighted three limitations commonly associated with previous computational methods.
Abstract: Since the discovery of the regulatory function of microRNA (miRNA), increased attention has focused on identifying the relationship between miRNA and disease. It has been suggested that computational method is an efficient way to identify potential disease-related miRNAs for further confirmation using biological experiments. In this paper, we first highlighted three limitations commonly associated with previous computational methods. To resolve these limitations, we established disease similarity subnetwork and miRNA similarity subnetwork by integrating multiple data sources, where the disease similarity is composed of disease semantic similarity and disease functional similarity, and the miRNA similarity is calculated using the miRNA-target gene and miRNA-lncRNA (long non-coding RNA) associations. Then, a heterogeneous network was constructed by connecting the disease similarity subnetwork and the miRNA similarity subnetwork using the known miRNA-disease associations. We extended random walk with restart to predict miRNA-disease associations in the heterogeneous network. The leave-one-out cross-validation achieved an average area under the curve (AUC) of $0.8049$ across $341$ diseases and $476$ miRNAs. For five-fold cross-validation, our method achieved an AUC from $0.7970$ to $0.9249$ for $15$ human diseases. Case studies further demonstrated the feasibility of our method to discover potential miRNA-disease associations. An online service for prediction is freely available at http://ifmda.aliapp.com .

248 citations


Proceedings ArticleDOI
07 Apr 2017
TL;DR: This paper revisit bilingual pivoting in the context of neural machine translation and presents a paraphrasing model based purely on neural networks, which represents paraphrases in a continuous space, estimates the degree of semantic relatedness between text segments of arbitrary length, and generates candidate paraphrase for any source input.
Abstract: Recognizing and generating paraphrases is an important component in many natural language processing applications. A well-established technique for automatically extracting paraphrases leverages bilingual corpora to find meaning-equivalent phrases in a single language by “pivoting” over a shared translation in another language. In this paper we revisit bilingual pivoting in the context of neural machine translation and present a paraphrasing model based purely on neural networks. Our model represents paraphrases in a continuous space, estimates the degree of semantic relatedness between text segments of arbitrary length, and generates candidate paraphrases for any source input. Experimental results across tasks and datasets show that neural paraphrases outperform those obtained with conventional phrase-based pivoting approaches.

246 citations


Journal ArticleDOI
TL;DR: The wpath semantic similarity method has produced a statistically significant improvement over other semantic similarity methods, and in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score.
Abstract: This paper presents a method for measuring the semantic similarity between concepts in Knowledge Graphs (KGs) such as WordNet and DBpedia. Previous work on semantic similarity methods have focused on either the structure of the semantic network between concepts (e.g., path length and depth), or only on the Information Content (IC) of concepts. We propose a semantic similarity method, namely wpath, to combine these two approaches, using IC to weight the shortest path length between concepts. Conventional corpus-based IC is computed from the distributions of concepts over textual corpus, which is required to prepare a domain corpus containing annotated concepts and has high computational cost. As instances are already extracted from textual corpus and annotated by concepts in KGs, graph-based IC is proposed to compute IC based on the distributions of concepts over instances. Through experiments performed on well known word similarity datasets, we show that the wpath semantic similarity method has produced a statistically significant improvement over other semantic similarity methods. Moreover, in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score.

179 citations


Journal ArticleDOI
TL;DR: This paper proposed an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources. But the method can make use of existing cross-lingual lexicons to construct high-quality vector spaces for a plethora of different languages, facilitating semantic transfer from high-to lower-resource ones.
Abstract: We present Attract-Repel, an algorithm for improving the semantic quality of word vectors by injecting constraints extracted from lexical resources Attract-Repel facilitates the use of constraints from mono- and cross-lingual resources, yielding semantically specialized cross-lingual vector spaces Our evaluation shows that the method can make use of existing cross-lingual lexicons to construct high-quality vector spaces for a plethora of different languages, facilitating semantic transfer from high- to lower-resource ones The effectiveness of our approach is demonstrated with state-of-the-art results on semantic similarity datasets in six languages We next show that Attract-Repel-specialized vectors boost performance in the downstream task of dialogue state tracking (DST) across multiple languages Finally, we show that cross-lingual vector spaces produced by our algorithm facilitate the training of multilingual DST models, which brings further performance improvements

177 citations


Proceedings ArticleDOI
04 Aug 2017
TL;DR: Results show that systems that combine statistical knowledge from text corpora, in the form of word embeddings, and external knowledge from lexical resources are best performers in both subtasks.
Abstract: This paper introduces a new task on Multilingual and Cross-lingual Semantic Word Similarity which measures the semantic similarity of word pairs within and across five languages: English, Farsi, German, Italian and Spanish. High quality datasets were manually curated for the five languages with high inter-annotator agreements (consistently in the 0.9 ballpark). These were used for semi-automatic construction of ten cross-lingual datasets. 17 teams participated in the task, submitting 24 systems in subtask 1 and 14 systems in subtask 2. Results show that systems that combine statistical knowledge from text corpora, in the form of word embeddings, and external knowledge from lexical resources are best performers in both subtasks. More information can be found on the task website: http://alt.qcri. org/semeval2017/task2.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a stagewise bidirectional latent embedding framework of two subsequent learning stages for zero-shot visual recognition, where the bottom-up stage explores the topological and labeling information underlying training data of known classes via a proper supervised subspace learning algorithm and the latent embeddings of training data are used to form landmarks that guide embedding semantics underlying unseen classes into this learned latent space.
Abstract: Zero-shot learning for visual recognition, e.g., object and action recognition, has recently attracted a lot of attention. However, it still remains challenging in bridging the semantic gap between visual features and their underlying semantics and transferring knowledge to semantic categories unseen during learning. Unlike most of the existing zero-shot visual recognition methods, we propose a stagewise bidirectional latent embedding framework of two subsequent learning stages for zero-shot visual recognition. In the bottom---up stage, a latent embedding space is first created by exploring the topological and labeling information underlying training data of known classes via a proper supervised subspace learning algorithm and the latent embedding of training data are used to form landmarks that guide embedding semantics underlying unseen classes into this learned latent space. In the top---down stage, semantic representations of unseen-class labels in a given label vocabulary are then embedded to the same latent space to preserve the semantic relatedness between all different classes via our proposed semi-supervised Sammon mapping with the guidance of landmarks. Thus, the resultant latent embedding space allows for predicting the label of a test instance with a simple nearest-neighbor rule. To evaluate the effectiveness of the proposed framework, we have conducted extensive experiments on four benchmark datasets in object and action recognition, i.e., AwA, CUB-200-2011, UCF101 and HMDB51. The experimental results under comparative studies demonstrate that our proposed approach yields the state-of-the-art performance under inductive and transductive settings.

Journal ArticleDOI
TL;DR: The associative theory of creativity states that creativity is associated with differences in the structure of semantic memory, whereas the executive theory emphasises the role of top-down control for creative thought as discussed by the authors.
Abstract: The associative theory of creativity states that creativity is associated with differences in the structure of semantic memory, whereas the executive theory of creativity emphasises the role of top-down control for creative thought. For a powerful test of these accounts, individual semantic memory structure was modelled with a novel method based on semantic relatedness judgements and different criteria for network filtering were compared. The executive account was supported by a correlation between creative ability and broad retrieval ability. The associative account was independently supported, when network filtering was based on a relatedness threshold, but not when it was based on a fixed edge number or on the analysis of weighted networks. In the former case, creative ability was associated with shorter average path lengths and higher clustering of the network, suggesting that the semantic networks of creative people show higher small-worldness.

Journal ArticleDOI
TL;DR: This work proposes several approaches for sentence‐level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus.
Abstract: Motivation The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. Methods We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. Results The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. Availability and implementation A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ . Contact gizemsogancioglu@gmail.com or arzucan.ozgur@boun.edu.tr.

Posted Content
TL;DR: This article proposed relevance-based word embedding models that learn word representations based on query-document relevance information and classify each term as belonging to the relevant or non-relevant class for each query.
Abstract: Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.

Proceedings ArticleDOI
07 Aug 2017
TL;DR: Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding model, such as word2vec and GloVe.
Abstract: Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this article, the role of semantics in zero-shot learning is considered and the effectiveness of previous approaches is analyzed according to the form of supervision provided, while some learn semantics independently, others only supervise the semantic subspace explained by training classes.
Abstract: The role of semantics in zero-shot learning is considered. The effectiveness of previous approaches is analyzed according to the form of supervision provided. While some learn semantics independently, others only supervise the semantic subspace explained by training classes. Thus, the former is able to constrain the whole space but lacks the ability to model semantic correlations. The latter addresses this issue but leaves part of the semantic space unsupervised. This complementarity is exploited in a new convolutional neural network (CNN) framework, which proposes the use of semantics as constraints for recognition. Although a CNN trained for classification has no transfer ability, this can be encouraged by learning an hidden semantic layer together with a semantic code for classification. Two forms of semantic constraints are then introduced. The first is a loss-based regularizer that introduces a generalization constraint on each semantic predictor. The second is a codeword regularizer that favors semantic-to-class mappings consistent with prior semantic knowledge while allowing these to be learned from data. Significant improvements over the state-of-the-art are achieved on several datasets.

Proceedings ArticleDOI
Albert Gordo1, Diane Larlus1
21 Jul 2017
TL;DR: This work shows that, despite its subjective nature, the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators and forms a good computable surrogate for semantic image retrieval in complex scenes.
Abstract: Querying with an example image is a simple and intuitive interface to retrieve information from a visual database. Most of the research in image retrieval has focused on the task of instance-level image retrieval, where the goal is to retrieve images that contain the same object instance as the query image. In this work we move beyond instance-level retrieval and consider the task of semantic image retrieval in complex scenes, where the goal is to retrieve images that share the same semantics as the query image. We show that, despite its subjective nature, the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators. We also show that a similarity based on human-annotated region-level captions is highly correlated with the human ranking and constitutes a good computable surrogate. Following this observation, we learn a visual embedding of the images where the similarity in the visual space is correlated with their semantic similarity surrogate. We further extend our model to learn a joint embedding of visual and textual cues that allows one to query the database using a text modifier in addition to the query image, adapting the results to the modifier. Finally, our model can ground the ranking decisions by showing regions that contributed the most to the similarity between pairs of images, providing a visual explanation of the similarity.

Journal ArticleDOI
TL;DR: This work proposes a novel approach to computing semantic distance, based on network science methodology, and demonstrates how this approach addresses key issues in cognitive theory, namely the breadth of the spreading activation process and the effect of semantic distance on memory retrieval.
Abstract: Semantic distance is a determining factor in cognitive processes, such as semantic priming, operating upon semantic memory. The main computational approach to compute semantic distance is through latent semantic analysis (LSA). However, objections have been raised against this approach, mainly in its failure at predicting semantic priming. We propose a novel approach to computing semantic distance, based on network science methodology. Path length in a semantic network represents the amount of steps needed to traverse from 1 word in the network to the other. We examine whether path length can be used as a measure of semantic distance, by investigating how path length affect performance in a semantic relatedness judgment task and recall from memory. Our results show a differential effect on performance: Up to 4 steps separating between word-pairs, participants exhibit an increase in reaction time (RT) and decrease in the percentage of word-pairs judged as related. From 4 steps onward, participants exhibit a significant decrease in RT and the word-pairs are dominantly judged as unrelated. Furthermore, we show that as path length between word-pairs increases, success in free- and cued-recall decreases. Finally, we demonstrate how our measure outperforms computational methods measuring semantic distance (LSA and positive pointwise mutual information) in predicting participants RT and subjective judgments of semantic strength. Thus, we provide a computational alternative to computing semantic distance. Furthermore, this approach addresses key issues in cognitive theory, namely the breadth of the spreading activation process and the effect of semantic distance on memory retrieval. (PsycINFO Database Record

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper proposes a simple, yet effective generalized hashing framework which can work for all the different scenarios, while preserving the semantic distance between the data points, and learns the optimum hash codes for the two modalities simultaneously.
Abstract: Due to availability of large amounts of multimedia data, cross-modal matching is gaining increasing importance. Hashing based techniques provide an attractive solution to this problem when the data size is large. Different scenarios of cross-modal matching are possible, for example, data from the different modalities can be associated with a single label or multiple labels, and in addition may or may not have one-to-one correspondence. Most of the existing approaches have been developed for the case where there is one-to-one correspondence between the data of the two modalities. In this paper, we propose a simple, yet effective generalized hashing framework which can work for all the different scenarios, while preserving the semantic distance between the data points. The approach first learns the optimum hash codes for the two modalities simultaneously, so as to preserve the semantic similarity between the data points, and then learns the hash functions to map from the features to the hash codes. Extensive experiments on single label dataset like Wiki and multi-label datasets like NUS-WIDE, Pascal and LabelMe under all the different scenarios and comparisons with the state-of-the-art shows the effectiveness of the proposed approach.

02 Oct 2017
TL;DR: This article evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants, on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.
Abstract: Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing systems. In this paper, we evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants. We trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. We evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks. The obtained results suggest that word analogies are not appropriate for word embedding evaluation instead task-specific evaluations may be a better option; Wang2Vec appears to be a robust model; the increase in performance in our evaluations with bigger models is not worth the increase in memory usage for models with more than 300 dimensions.

Journal ArticleDOI
TL;DR: It is found that semantic relatedness, as quantified by these models, is able to provide a good measure of the associations involved in judgment, and, in turn, predict responses in a large number of existing and novel judgment tasks.
Abstract: I study associative processing in high-level judgment using vector space semantic models. I find that semantic relatedness, as quantified by these models, is able to provide a good measure of the associations involved in judgment, and, in turn, predict responses in a large number of existing and novel judgment tasks. My results shed light on the representations underlying judgment, and highlight the close relationship between these representations and those at play in language and in the assessment of word meaning. In doing so, they show how one of the best-known and most studied theories in decision making research can be formalized to make quantitative a priori predictions, and how this theory can be rigorously tested on a wide range of natural language judgment problems. (PsycINFO Database Record

Journal ArticleDOI
TL;DR: The experimental results show that the proposed binary-oriented, obfuscation-resilient binary code similarity comparison method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.
Abstract: Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. Identifying similar or identical code fragments among programs is very important in some applications. For example, one application is to detect illegal code reuse. In the code theft cases, emerging obfuscation techniques have made automated detection increasingly difficult. Another application is to identify cryptographic algorithms which are widely employed by modern malware to circumvent detection, hide network communications, and protect payloads among other purposes. Due to diverse coding styles and high programming flexibility, different implementation of the same algorithm may appear very distinct, causing automatic detection to be very hard, let alone code obfuscations are sometimes applied. In this paper, we propose a binary-oriented, obfuscation-resilient binary code similarity comparison method based on a new concept, longest common subsequence of semantically equivalent basic blocks , which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantic equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantic similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype. The experimental results show that our method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.

Posted Content
TL;DR: XGAN as discussed by the authors is a dual adversarial autoencoder that captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions.
Abstract: Style transfer usually refers to the task of applying color and texture information from a specific style image to a given content image while preserving the structure of the latter. Here we tackle the more generic problem of semantic style transfer: given two unpaired collections of images, we aim to learn a mapping between the corpus-level style of each collection, while preserving semantic content shared across the two domains. We introduce XGAN ("Cross-GAN"), a dual adversarial autoencoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. We exploit ideas from the domain adaptation literature and define a semantic consistency loss which encourages the model to preserve semantics in the learned embedding space. We report promising qualitative results for the task of face-to-cartoon translation. The cartoon dataset, CartoonSet, we collected for this purpose is publicly available at this http URL as a new benchmark for semantic style transfer.

Posted Content
TL;DR: This article evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants, on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.
Abstract: Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing sys- tems. In this paper, we evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants. We trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. We evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks. The obtained results suggest that word analogies are not appropriate for word embedding evaluation; task-specific evaluations appear to be a better option.

Journal ArticleDOI
TL;DR: This book proposes an in-depth characterization of existing proposals for semantic similarity estimation by discussing their features, the assumptions on which they are based and empirical results regarding their performance in particular applications, and provides a detailed discussion on the foundations of semantic measures.
Abstract: Artificial Intelligence federates numerous scientific fields in the aim of developing machines able to assist human operators performing complex treatments -- most of which demand high cognitive skills (e.g. learning or decision processes). Central to this quest is to give machines the ability to estimate the likeness or similarity between things in the way human beings estimate the similarity between stimuli. In this context, this book focuses on semantic measures: approaches designed for comparing semantic entities such as units of language, e.g. words, sentences, or concepts and instances defined into knowledge bases. The aim of these measures is to assess the similarity or relatedness of such semantic entities by taking into account their semantics, i.e. their meaning -- intuitively, the words tea and coffee, which both refer to stimulating beverage, will be estimated to be more semantically similar than the words toffee (confection) and coffee, despite that the last pair has a higher syntactic similarity. The two state-of-the-art approaches for estimating and quantifying semantic similarities/relatedness of semantic entities are presented in detail: the first one relies on corpora analysis and is based on Natural Language Processing techniques and semantic models while the second is based on more or less formal, computer-readable and workable forms of knowledge such as semantic networks, thesaurus or ontologies. (...) Beyond a simple inventory and categorization of existing measures, the aim of this monograph is to convey novices as well as researchers of these domains towards a better understanding of semantic similarity estimation and more generally semantic measures.

Journal ArticleDOI
TL;DR: Reanalysing electroencephalography and functional magnetic resonance imaging data from studies in which participants comprehend naturalistic stimuli indicates that both predictability and similarity play a role during natural language comprehension and modulate distinct cortical regions.
Abstract: We investigate the effects of two types of relationship between the words of a sentence or text – predictability and semantic similarity – by reanalysing electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) data from studies in which participants comprehend naturalistic stimuli. Each content word's predictability given previous words is quantified by a probabilistic language model, and semantic similarity to previous words is quantified by a distributional semantics model. Brain activity time-locked to each word is regressed on the two model-derived measures. Results show that predictability and semantic similarity have near identical N400 effects but are dissociated in the fMRI data, with word predictability related to activity in, among others, the visual word-form area, and semantic similarity related to activity in areas associated with the semantic network. This indicates that both predictability and similarity play a role during natural language comprehension and mod...

Journal ArticleDOI
TL;DR: This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets that adopts several phases of text processing, features extraction and text classification.
Abstract: The rapid growth in digital information has raised considerable challenges in particular when it comes to automated content analysis. Social media such as twitter share a lot of its users information about their events, opinions, personalities, etc. Paraphrase Identification (PI) is concerned with recognizing whether two texts have the same/similar meaning, whereas the Semantic Text Similarity (STS) is concerned with the degree of that similarity. This research proposes a state-of-the-art approach for paraphrase identification and semantic text similarity analysis in Arabic news tweets. The approach adopts several phases of text processing, features extraction and text classification. Lexical, syntactic, and semantic features are extracted to overcome the weakness and limitations of the current technologies in solving these tasks for the Arabic language. Maximum Entropy (MaxEnt) and Support Vector Regression (SVR) classifiers are trained using these features and are evaluated using a dataset prepared for this research. The experimentation results show that the approach achieves good results in comparison to the baseline results.

Journal ArticleDOI
TL;DR: A new theoretical and methodological framework for cognitive divergent-thinking studies is presented and it is shown that the semantic distance of responses significantly predicted the average creativity rating given to the response, with significant variation in average levels of creativity across participants.
Abstract: Divergent thinking has often been used as a proxy measure of creative thinking, but this practice lacks a foundation in modern cognitive psychological theory. This article addresses several issues with the classic divergent-thinking methodology and presents a new theoretical and methodological framework for cognitive divergent-thinking studies. A secondary analysis of a large dataset of divergent-thinking responses is presented. Latent semantic analysis was used to examine the potential changes in semantic distance between responses and the concept represented by the divergent-thinking prompt across successive response iterations. The results of linear growth modeling showed that although there is some linear increase in semantic distance across response iterations, participants high in fluid intelligence tended to give more distant initial responses than those with lower fluid intelligence. Additional analyses showed that the semantic distance of responses significantly predicted the average creativity rating given to the response, with significant variation in average levels of creativity across participants. Finally, semantic distance does not seem to be related to participants' choices of their own most creative responses. Implications for cognitive theories of creativity are discussed, along with the limitations of the methodology and directions for future research.

Proceedings Article
24 Apr 2017
TL;DR: The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks.
Abstract: We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.