Showing papers on "Semantic similarity published in 2016"

PDF

Open Access

Proceedings Article•

Siamese recurrent architectures for learning sentence similarity

[...]

Jonas Mueller¹, Aditya Thyagarajan²•Institutions (2)

Massachusetts Institute of Technology¹, M. S. Ramaiah Institute of Technology²

12 Feb 2016

TL;DR: A siamese adaptation of the Long Short-Term Memory network for labeled data comprised of pairs of variable-length sequences is presented, which compel the sentence representations learned by the model to form a highly structured space whose geometry reflects complex semantic relationships.

...read moreread less

Abstract: We present a siamese adaptation of the Long Short-Term Memory (LSTM) network for labeled data comprised of pairs of variable-length sequences. Our model is applied to assess semantic similarity between sentences, where we exceed state of the art, outperforming carefully handcrafted features and recently proposed neural network systems of greater complexity. For these applications, we provide word-embedding vectors supplemented with synonymic information to the LSTMs, which use a fixed size vector to encode the underlying meaning expressed in a sentence (irrespective of the particular wording/syntax). By restricting subsequent operations to rely on a simple Manhattan metric, we compel the sentence representations learned by our model to form a highly structured space whose geometry reflects complex semantic relationships. Our results are the latest in a line of findings that showcase LSTMs as powerful language models capable of tasks requiring intricate understanding.

...read moreread less

839 citations

Proceedings Article•

Effective LSTMs for Target-Dependent Sentiment Classification

[...]

Duyu Tang¹, Bing Qin², Xiaocheng Feng², Ting Liu²•Institutions (2)

Microsoft¹, Harbin Institute of Technology²

01 Dec 2016

TL;DR: Two target dependent long short-term memory models, where target information is automatically taken into account, are developed, which achieve state-of-the-art performances without using syntactic parser or external sentiment lexicons.

...read moreread less

Abstract: Target-dependent sentiment classification remains a challenge: modeling the semantic relatedness of a target with its context words in a sentence. Different context words have different influences on determining the sentiment polarity of a sentence towards the target. Therefore, it is desirable to integrate the connections between target word and context words when building a learning system. In this paper, we develop two target dependent long short-term memory (LSTM) models, where target information is automatically taken into account. We evaluate our methods on a benchmark dataset from Twitter. Empirical results show that modeling sentence representation with standard LSTM does not perform well. Incorporating target information into LSTM can significantly boost the classification accuracy. The target-dependent LSTM models achieve state-of-the-art performances without using syntactic parser or external sentiment lexicons.

...read moreread less

543 citations

Proceedings Article•DOI•

SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation

[...]

Eneko Agirre¹, Carmen Banea², Daniel Cer³, Mona Diab⁴, Aitor Gonzalez-Agirre¹, Rada Mihalcea⁵, German Rigau¹, Janyce Wiebe⁶ - Show less +4 more•Institutions (6)

University of the Basque Country¹, University of Michigan², Google³, George Washington University⁴, Carnegie Mellon University⁵, University of Pittsburgh⁶

01 Jan 2016

TL;DR: Comunicacio presentada al 10th International Workshop on Semantic Evaluation (SemEval-2016), celebrat els dies 16 i 17 de juny de 2016 a San Diego, California.

...read moreread less

Abstract: Comunicacio presentada al 10th International Workshop on Semantic Evaluation (SemEval-2016), celebrat els dies 16 i 17 de juny de 2016 a San Diego, California.

...read moreread less

529 citations

Proceedings Article•DOI•

Counter-fitting word vectors to linguistic constraints

[...]

Nikola Mrkšić¹, Diarmuid Ó Séaghdha¹, Blaise Thomson¹, Milica Gasic¹, Lina Maria Rojas-Barahona¹, Pei-Hao Su¹, David Vandyke¹, Tsung-Hsien Wen¹, Steve Young¹ - Show less +5 more•Institutions (1)

University of Cambridge¹

01 Jan 2016

TL;DR: The authors injects antonymy and synonymy constraints into vector space representations in order to improve the vectors' capability for judging semantic similarity, leading to a new state-of-the-art performance on the SimLex-999 dataset.

...read moreread less

Abstract: ©2016 Association for Computational Linguistics.In this work, we present a novel counter-fitting method which injects antonymy and synonymy constraints into vector space representations in order to improve the vectors' capability for judging semantic similarity. Applying this method to publicly available pre-trained word vectors leads to a new state of the art performance on the SimLex-999 dataset. We also show how the method can be used to tailor the word vector space for the downstream task of dialogue state tracking, resulting in robust improvements across different dialogue domains.

...read moreread less

332 citations

Posted Content•

Universal Correspondence Network

[...]

Christopher Choy, JunYoung Gwak, Silvio Savarese, Manmohan Chandraker

11 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A convolutional spatial transformer to mimic patch normalization in traditional features like SIFT is proposed, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations.

...read moreread less

Abstract: We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation through the use of thousands of examples per image pair and faster testing with $O(n)$ feed forward passes for $n$ keypoints, instead of $O(n^2)$ for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations. Extensive experiments on KITTI, PASCAL, and CUB-2011 datasets demonstrate the significant advantages of our features over prior works that use either hand-constructed or learned features.

...read moreread less

303 citations

Proceedings Article•DOI•

Topic Modeling for Short Texts with Auxiliary Word Embeddings

[...]

Chenliang Li¹, Haoran Wang¹, Zhiqian Zhang¹, Aixin Sun², Zongyang Ma² - Show less +1 more•Institutions (2)

Wuhan University¹, Nanyang Technological University²

07 Jul 2016

TL;DR: A simple, fast, and effective topic model for short texts, named GPU-DMM, based on the Dirichlet Multinomial Mixture model, which achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence.

...read moreread less

Abstract: For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts. On the other hand, when a human being interprets a piece of short text, the understanding is not solely based on its content words, but also her background knowledge (e.g., semantically related words). The recent advances in word embedding offer effective learning of word semantic relations from a large corpus. Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper. To this end, we propose a simple, fast, and effective topic model for short texts, named GPU-DMM. Based on the Dirichlet Multinomial Mixture (DMM) model, GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Polya urn (GPU) model. In this sense, the background knowledge about word semantic relatedness learned from millions of external documents can be easily exploited to improve topic modeling for short texts. Through extensive experiments on two real-world short text collections in two languages, we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to the best accuracy in text classification task, which is used as an indirect evaluation.

...read moreread less

293 citations

Journal Article•DOI•

Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification

[...]

Peng Wang¹, Bo Xu¹, Jiaming Xu¹, Guanhua Tian¹, Cheng-Lin Liu, Hongwei Hao¹ - Show less +2 more•Institutions (1)

Chinese Academy of Sciences¹

22 Jan 2016-Neurocomputing

TL;DR: A unified framework to expand short texts based on word embedding clustering and convolutional neural network and semantic cliques via fast clustering is proposed, which validates the effectiveness of the proposed method on two open benchmarks.

...read moreread less

268 citations

Proceedings Article•DOI•

Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement.

[...]

Hua He¹, Jimmy Lin²•Institutions (2)

University of Maryland, College Park¹, University of Waterloo²

01 Jun 2016

TL;DR: This work proposes to explicitly model pairwise word interactions and present a novel similarity focus mechanism to identify important correspondences for better similarity measurement.

...read moreread less

Abstract: Textual similarity measurement is a challenging problem, as it requires understanding the semantics of input sentences. Most previous neural network models use coarse-grained sentence modeling, which has difficulty capturing fine-grained word-level information for semantic comparisons. As an alternative, we propose to explicitly model pairwise word interactions and present a novel similarity focus mechanism to identify important correspondences for better similarity measurement. Our ideas are implemented in a novel neural network architecture that demonstrates state-ofthe-art accuracy on three SemEval tasks and two answer selection tasks.

...read moreread less

257 citations

Journal Article•DOI•

Toward a brain-based componential semantic representation

[...]

Jeffrey R. Binder¹, Lisa L. Conant¹, Colin Humphries¹, Leonardo Fernandino¹, Stephen B. Simons², Mario Aguilar², Rutvik H. Desai³ - Show less +3 more•Institutions (3)

Medical College of Wisconsin¹, Durham University², University of South Carolina³

16 Jun 2016-Cognitive Neuropsychology

TL;DR: This study proposes a basic set of approximately 65 experiential attributes based on neurobiological considerations, comprising sensory, motor, spatial, temporal, affective, social, and cognitive experiences, and discusses how this representation might deal with various longstanding problems in semantic theory, such as feature selection and weighting, representation of abstract concepts, effects of context on semantic retrieval, and conceptual combination.

...read moreread less

Abstract: Componential theories of lexical semantics assume that concepts can be represented by sets of features or attributes that are in some sense primitive or basic components of meaning. The binary features used in classical category and prototype theories are problematic in that these features are themselves complex concepts, leaving open the question of what constitutes a primitive feature. The present availability of brain imaging tools has enhanced interest in how concepts are represented in brains, and accumulating evidence supports the claim that these representations are at least partly "embodied" in the perception, action, and other modal neural systems through which concepts are experienced. In this study we explore the possibility of devising a componential model of semantic representation based entirely on such functional divisions in the human brain. We propose a basic set of approximately 65 experiential attributes based on neurobiological considerations, comprising sensory, motor, spatial, temporal, affective, social, and cognitive experiences. We provide normative data on the salience of each attribute for a large set of English nouns, verbs, and adjectives, and show how these attribute vectors distinguish a priori conceptual categories and capture semantic similarity. Robust quantitative differences between concrete object categories were observed across a large number of attribute dimensions. A within- versus between-category similarity metric showed much greater separation between categories than representations derived from distributional (latent semantic) analysis of text. Cluster analyses were used to explore the similarity structure in the data independent of a priori labels, revealing several novel category distinctions. We discuss how such a representation might deal with various longstanding problems in semantic theory, such as feature selection and weighting, representation of abstract concepts, effects of context on semantic retrieval, and conceptual combination. In contrast to componential models based on verbal features, the proposed representation systematically relates semantic content to large-scale brain networks and biologically plausible accounts of concept acquisition.

...read moreread less

217 citations

Journal Article•DOI•

Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities

[...]

Jose Camacho-Collados¹, Mohammad Taher Pilehvar², Roberto Navigli¹•Institutions (2)

Sapienza University of Rome¹, University of Cambridge²

01 Nov 2016-Artificial Intelligence

TL;DR: A novel multilingual vector representation, called Nasari, is put forward, which not only enables accurate representation of word senses in different languages, but it also provides two main advantages over existing approaches: high coverage and comparability across languages and linguistic levels.

...read moreread less

215 citations

Posted Content•

Problems With Evaluation of Word Embeddings Using Word Similarity Tasks

[...]

Manaal Faruqui¹, Yulia Tsvetkov¹, Pushpendre Rastogi¹, Chris Dyer²•Institutions (2)

Carnegie Mellon University¹, Johns Hopkins University²

08 May 2016-arXiv: Computation and Language

TL;DR: The authors suggests that the use of word similarity tasks for evaluation of word vectors is not sustainable and calls for further research on evaluation methods for vector representations of words, and presents several problems associated with word vectors on word similarity datasets, and summarizes existing solutions.

...read moreread less

Abstract: Lacking standardized extrinsic evaluation methods for vector representations of words, the NLP community has relied heavily on word similarity tasks as a proxy for intrinsic evaluation of word vectors Word similarity evaluation, which correlates the distance between vectors and human judgments of semantic similarity is attractive, because it is computationally inexpensive and fast In this paper we present several problems associated with the evaluation of word vectors on word similarity datasets, and summarize existing solutions Our study suggests that the use of word similarity tasks for evaluation of word vectors is not sustainable and calls for further research on evaluation methods

...read moreread less

Journal Article•DOI•

Representation learning for very short texts using weighted word embedding aggregation

[...]

Cedric De Boom¹, Steven Van Canneyt¹, Thomas Demeester¹, Bart Dhoedt¹•Institutions (1)

Ghent University¹

01 Sep 2016-Pattern Recognition Letters

TL;DR: A weight-based model and a learning procedure based on a novel median-based loss function designed to mitigate the negative effect of outliers are designed and found that the method outperforms the baseline approaches in the experiments, and that it generalizes well on different word embeddings without retraining.

...read moreread less

Proceedings Article•DOI•

Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer

[...]

Yuxing Tang¹, Josiah Wang², Boyang Gao¹, Boyang Gao³, Emmanuel Dellandréa¹, Robert Gaizauskas², Liming Chen - Show less +3 more•Institutions (3)

École centrale de Lyon¹, University of Sheffield², Istituto Italiano di Tecnologia³

27 Jun 2016

TL;DR: Strong evidence is found that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.

...read moreread less

Abstract: Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both imagelevel and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.

...read moreread less

Proceedings Article•DOI•

Problems With Evaluation of Word Embeddings Using Word Similarity Tasks

[...]

Manaal Faruqui¹, Yulia Tsvetkov¹, Pushpendre Rastogi¹, Chris Dyer²•Institutions (2)

Carnegie Mellon University¹, Johns Hopkins University²

01 May 2016

TL;DR: This article presented several problems associated with the evaluation of word vectors on word similarity datasets, and summarized existing solutions, and suggested that the use of word similarity tasks for evaluating word vectors is not sustainable and calls for further research on evaluation methods.

...read moreread less

Abstract: Lacking standardized extrinsic evaluation methods for vector representations of words, the NLP community has relied heavily on word similarity tasks as a proxy for intrinsic evaluation of word vectors. Word similarity evaluation, which correlates the distance between vectors and human judgments of “semantic similarity” is attractive, because it is computationally inexpensive and fast. In this paper we present several problems associated with the evaluation of word vectors on word similarity datasets, and summarize existing solutions. Our study suggests that the use of word similarity tasks for evaluation of word vectors is not sustainable and calls for further research on evaluation methods.

...read moreread less

Proceedings Article•DOI•

Predicting semantically linkable knowledge in developer online forums via convolutional neural network

[...]

Bowen Xu¹, Deheng Ye², Zhenchang Xing², Xin Xia¹, Guibin Chen², Shanping Li¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Nanyang Technological University²

25 Aug 2016

TL;DR: This paper forms the problem of predicting semantically linkable knowledge units as a multiclass classification problem, and solves the problem using deep learning techniques, and adopts neural language model (word embeddings) and convolutional neural network (CNN) to capture word- and document-level semantics of knowledge units.

...read moreread less

Abstract: Consider a question and its answers in Stack Overflow as a knowledge unit. Knowledge units often contain semantically relevant knowledge, and thus linkable for different purposes, such as duplicate questions, directly linkable for problem solving, indirectly linkable for related information. Recognising different classes of linkable knowledge would support more targeted information needs when users search or explore the knowledge base. Existing methods focus on binary relatedness (i.e., related or not), and are not robust to recognize different classes of semantic relatedness when linkable knowledge units share few words in common (i.e., have lexical gap). In this paper, we formulate the problem of predicting semantically linkable knowledge units as a multiclass classification problem, and solve the problem using deep learning techniques. To overcome the lexical gap issue, we adopt neural language model (word embeddings) and convolutional neural network (CNN) to capture word- and document-level semantics of knowledge units. Instead of using human-engineered classifier features which are hard to design for informal user-generated content, we exploit large amounts of different types of user-created knowledge-unit links to train the CNN to learn the most informative wordlevel and document-level features for the multiclass classification task. Our evaluation shows that our deep-learning based approach significantly and consistently outperforms traditional methods using traditional word representations and human-engineered classifier features.

...read moreread less

Proceedings Article•DOI•

Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder

[...]

Soroush Vosoughi¹, Prashanth Vijayaraghavan¹, Deb Roy¹•Institutions (1)

Massachusetts Institute of Technology¹

26 Jul 2016-arXiv: Computation and Language

TL;DR: Tweet2vec as discussed by the authors uses character-level CNN-LSTM encoder-decoder to generate vector representations for English-language tweets, which can be used to learn tweet embeddings for different languages.

...read moreread less

Abstract: We present Tweet2Vec, a novel method for generating general-purpose vector representation of tweets. The model learns tweet embeddings using character-level CNN-LSTM encoder-decoder. We trained our model on 3 million, randomly selected English-language tweets. The model was evaluated using two methods: tweet semantic similarity and tweet sentiment categorization, outperforming the previous state-of-the-art in both tasks. The evaluations demonstrate the power of the tweet embeddings generated by our model for various tweet categorization tasks. The vector representations generated by our model are generic, and hence can be applied to a variety of tasks. Though the model presented in this paper is trained on English-language tweets, the method presented can be used to learn tweet embeddings for different languages.

...read moreread less

Book•

Semantic Search on Text and Knowledge Bases

[...]

Hannah Bast¹, Buchhold Björn¹, Elmar Haussmann¹•Institutions (1)

University of Freiburg¹

07 Jun 2016

TL;DR: This article provides a comprehensive overview of the broad area of semantic search on text and knowledge bases according to two dimensions: the type of data text, knowledge bases, combinations of these and the kind of search keyword, structured, natural language.

...read moreread less

Abstract: This article provides a comprehensive overview of the broad area of semantic search on text and knowledge bases. In a nutshell, semantic search is "search with meaning". This "meaning" can refer to various parts of the search process: understanding the query instead of just finding matches of its components in the data, understanding the data instead of just searching it for such matches, or representing knowledge in a way suitable for meaningful retrieval.Semantic search is studied in a variety of different communities with a variety of different views of the problem. In this survey, we classify this work according to two dimensions: the type of data text, knowledge bases, combinations of these and the kind of search keyword, structured, natural language. We consider all nine combinations. The focus is on fundamental techniques, concrete systems, and benchmarks. The survey also considers advanced issues: ranking, indexing, ontology matching and merging, and inference. It also provides a succinct overview of fundamental natural language processing techniques: POS-tagging, named-entity recognition and disambiguation, sentence parsing, and distributional semantics.The survey is as self-contained as possible, and should thus also serve as a good tutorial for newcomers to this fascinating and highly topical field.

...read moreread less

Proceedings Article•DOI•

Embedding-based Query Language Models

[...]

Hamed Zamani¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

12 Sep 2016

TL;DR: This paper proposes to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms, and develops an embedding-based relevance model, an extension of the effective and robust relevance model approach.

...read moreread less

Abstract: Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.

...read moreread less

Proceedings Article•DOI•

Sentic LDA: Improving on LDA with semantic similarity for aspect-based sentiment analysis

[...]

Soujanya Poria¹, Iti Chaturvedi¹, Erik Cambria¹, Federica Bisio²•Institutions (2)

Nanyang Technological University¹, University of Genoa²

24 Jul 2016

TL;DR: Sentic LDA exploits common-sense reasoning to shift LDA clustering from a syntactic to a semantic level and leverages on the semantics associated with words and multi-word expressions to improve clustering and, hence, outperform state-of-the-art techniques for aspect extraction.

...read moreread less

Abstract: The advent of the Social Web has provided netizens with new tools for creating and sharing, in a time- and cost-efficient way, their contents, ideas, and opinions with virtually the millions of people connected to the World Wide Web. This huge amount of information, however, is mainly unstructured as specifically produced for human consumption and, hence, it is not directly machine-processable. In order to enable a more efficient passage from unstructured information to structured data, aspect-based opinion mining models the relations between opinion targets contained in a document and the polarity values associated with these. Because aspects are often implicit, however, spotting them and calculating their respective polarity is an extremely difficult task, which is closer to natural language understanding rather than natural language processing. To this end, Sentic LDA exploits common-sense reasoning to shift LDA clustering from a syntactic to a semantic level. Rather than looking at word co-occurrence frequencies, Sentic LDA leverages on the semantics associated with words and multi-word expressions to improve clustering and, hence, outperform state-of-the-art techniques for aspect extraction.

...read moreread less

Proceedings Article•DOI•

Discovering interpretable geo-social communities for user behavior prediction

[...]

Hongzhi Yin¹, Zhiting Hu², Xiaofang Zhou¹, Hao Wang³, Kai Zheng¹, Quoc Viet Hung Nguyen¹, Shazia Sadiq¹ - Show less +3 more•Institutions (3)

University of Queensland¹, Carnegie Mellon University², Chinese Academy of Sciences³

16 May 2016

TL;DR: A unified probabilistic generative model, User-Community-Geo-Topic (UCGT), is proposed to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity.

...read moreread less

Abstract: Social community detection is a growing field of interest in the area of social network applications, and many approaches have been developed, including graph partitioning, latent space model, block model and spectral clustering. Most existing work purely focuses on network structure information which is, however, often sparse, noisy and lack of interpretability. To improve the accuracy and interpretability of community discovery, we propose to infer users' social communities by incorporating their spatiotemporal data and semantic information. Technically, we propose a unified probabilistic generative model, User-Community-Geo-Topic (UCGT), to simulate the generative process of communities as a result of network proximities, spatiotemporal co-occurrences and semantic similarity. With a well-designed multi-component model structure and a parallel inference implementation to leverage the power of multicores and clusters, our UCGT model is expressive while remaining efficient and scalable to growing large-scale geo-social networking data. We deploy UCGT to two application scenarios of user behavior predictions: check-in prediction and social interaction prediction. Extensive experiments on two large-scale geo-social networking datasets show that UCGT achieves better performance than existing state-of-the-art comparison methods.

...read moreread less

Proceedings Article•DOI•

Statistical similarity of binaries

[...]

Yaniv David¹, Nimrod Partush¹, Eran Yahav¹•Institutions (1)

Technion – Israel Institute of Technology¹

02 Jun 2016

TL;DR: A new statistical approach for measuring the similarity between two procedures is presented, using similarity by composition: decompose the code into smaller comparable fragments, define semantic similarity between fragments, and use statistical reasoning to lift fragment similarity into similarity between procedures.

...read moreread less

Abstract: We address the problem of finding similar procedures in stripped binaries. We present a new statistical approach for measuring the similarity between two procedures. Our notion of similarity allows us to find similar code even when it has been compiled using different compilers, or has been modified. The main idea is to use similarity by composition: decompose the code into smaller comparable fragments, define semantic similarity between fragments, and use statistical reasoning to lift fragment similarity into similarity between procedures. We have implemented our approach in a tool called Esh, and applied it to find various prominent vulnerabilities across compilers and versions, including Heartbleed, Shellshock and Venom. We show that Esh produces high accuracy results, with few to no false positives -- a crucial factor in the scenario of vulnerability search in stripped binaries.

...read moreread less

Proceedings Article•DOI•

Are Word Embedding-based Features Useful for Sarcasm Detection?

[...]

Aditya Joshi¹, Vaibhav Tripathi¹, Kevin Patel¹, Pushpak Bhattacharyya¹, Mark J. Carman² - Show less +1 more•Institutions (2)

Indian Institute of Technology Bombay¹, Monash University²

01 Nov 2016

TL;DR: This article explored if prior work can be enhanced using semantic similarity/discordance between word embeddings, and augmented word embedding-based features to four feature sets reported in the past.

...read moreread less

Abstract: This paper makes a simple increment to state-of-the-art in sarcasm detection research. Existing approaches are unable to capture subtle forms of context incongruity which lies at the heart of sarcasm. We explore if prior work can be enhanced using semantic similarity/discordance between word embeddings. We augment word embedding-based features to four feature sets reported in the past. We also experiment with four types of word embeddings. We observe an improvement in sarcasm detection, irrespective of the word embedding used or the original feature set to which our features are augmented. For example, this augmentation results in an improvement in F-score of around 4\% for three out of these four feature sets, and a minor degradation in case of the fourth, when Word2Vec embeddings are used. Finally, a comparison of the four embeddings shows that Word2Vec and dependency weight-based features outperform LSA and GloVe, in terms of their benefit to sarcasm detection.

...read moreread less

Proceedings Article•DOI•

Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding

[...]

Xiang Ren¹, Wenqi He¹, Meng Qu¹, Clare R. Voss², Heng Ji³, Jiawei Han¹ - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, United States Army Research Laboratory², Rensselaer Polytechnic Institute³

13 Aug 2016

TL;DR: The authors proposed a label noise reduction in entity typing (LNR) task to identify correct type labels for training examples, given the set of candidate type labels obtained by distant supervision with a given type hierarchy.

...read moreread less

Abstract: Current systems of fine-grained entity typing use distant supervision in conjunction with existing knowledge bases to assign categories (type labels) to entity mentions. However, the type labels so obtained from knowledge bases are often noisy (i.e., incorrect for the entity mention's local context). We define a new task, Label Noise Reduction in Entity Typing (LNR), to be the automatic identification of correct type labels (type-paths) for training examples, given the set of candidate type labels obtained by distant supervision with a given type hierarchy. The unknown type labels for individual entity mentions and the semantic similarity between entity types pose unique challenges for solving the LNR task. We propose a general framework, called PLE, to jointly embed entity mentions, text features and entity types into the same low-dimensional space where, in that space, objects whose types are semantically close have similar representations. Then we estimate the type-path for each training example in a top-down manner using the learned embeddings. We formulate a global objective for learning the embeddings from text corpora and knowledge bases, which adopts a novel margin-based loss that is robust to noisy labels and faithfully models type correlation derived from knowledge bases. Our experiments on three public typing datasets demonstrate the effectiveness and robustness of PLE, with an average of 25% improvement in accuracy compared to next best method.

...read moreread less

Posted Content•

Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

[...]

William L. Hamilton¹, Jure Leskovec¹, Dan Jurafsky²•Institutions (2)

Stanford University¹, Carnegie Mellon University²

30 May 2016-arXiv: Computation and Language

TL;DR: This article developed a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes and used this methodology to reveal statistical laws of semantic evolution.

...read moreread less

Abstract: Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Word embeddings show promise as a diachronic tool, but have not been carefully evaluated. We develop a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes. We then use this methodology to reveal statistical laws of semantic evolution. Using six historical corpora spanning four languages and two centuries, we propose two quantitative laws of semantic change: (i) the law of conformity---the rate of semantic change scales with an inverse power-law of word frequency; (ii) the law of innovation---independent of frequency, words that are more polysemous have higher rates of semantic change.

...read moreread less

Journal Article•DOI•

A comparative evaluation of off-the-shelf distributed semantic representations for modelling behavioural data

[...]

Francisco Pereira¹, Samuel J. Gershman², Samuel Ritter³, Matthew Botvinick⁴•Institutions (4)

Siemens¹, Harvard University², Princeton University³, Google⁴

11 Aug 2016-Cognitive Neuropsychology

TL;DR: This comparison of many off-the-shelf distributed semantic vectors representations of words provides a guide for how vector similarity computations can be used to make predictions about behavioural results or human annotations of data.

...read moreread less

Abstract: In this paper we carry out an extensive comparison of many off-the-shelf distributed semantic vectors representations of words, for the purpose of making predictions about behavioural results or human annotations of data. In doing this comparison we also provide a guide for how vector similarity computations can be used to make such predictions, and introduce many resources available both in terms of datasets and of vector representations. Finally, we discuss the shortcomings of this approach and future research directions that might address them.

...read moreread less

Posted Content•

Counter-fitting Word Vectors to Linguistic Constraints

[...]

University of Cambridge¹

02 Mar 2016-arXiv: Computation and Language

TL;DR: A novel counter-fitting method is presented which injects antonymy and synonymy constraints into vector space representations in order to improve the vectors' capability for judging semantic similarity.

...read moreread less

Abstract: In this work, we present a novel counter-fitting method which injects antonymy and synonymy constraints into vector space representations in order to improve the vectors' capability for judging semantic similarity. Applying this method to publicly available pre-trained word vectors leads to a new state of the art performance on the SimLex-999 dataset. We also show how the method can be used to tailor the word vector space for the downstream task of dialogue state tracking, resulting in robust improvements across different dialogue domains.

...read moreread less

Journal Article•DOI•

Multi-View 3D Object Retrieval With Deep Embedding Network

[...]

Haiyun Guo¹, Jinqiao Wang¹, Yue Gao², Jianqiang Li³, Hanqing Lu¹ - Show less +1 more•Institutions (3)

Chinese Academy of Sciences¹, Tsinghua University², Beijing University of Technology³

01 Dec 2016-IEEE Transactions on Image Processing

TL;DR: A deep embedding network jointly supervised by classification loss and triplet loss is proposed to map the high-dimensional image space into a low-dimensional feature space, where the Euclidean distance of features directly corresponds to the semantic similarity of images.

...read moreread less

Abstract: In multi-view 3D object retrieval, each object is characterized by a group of 2D images captured from different views. Rather than using hand-crafted features, in this paper, we take advantage of the strong discriminative power of convolutional neural network to learn an effective 3D object representation tailored for this retrieval task. Specifically, we propose a deep embedding network jointly supervised by classification loss and triplet loss to map the high-dimensional image space into a low-dimensional feature space, where the Euclidean distance of features directly corresponds to the semantic similarity of images. By effectively reducing the intra-class variations while increasing the inter-class ones of the input images, the network guarantees that similar images are closer than dissimilar ones in the learned feature space. Besides, we investigate the effectiveness of deep features extracted from different layers of the embedding network extensively and find that an efficient 3D object representation should be a tradeoff between global semantic information and discriminative local characteristics. Then, with the set of deep features extracted from different views, we can generate a comprehensive description for each 3D object and formulate the multi-view 3D object retrieval as a set-to-set matching problem. Extensive experiments on SHREC’15 data set demonstrate the superiority of our proposed method over the previous state-of-the-art approaches with over 12% performance improvement.

...read moreread less

Proceedings Article•DOI•

Capturing Semantic Similarity for Entity Linking with Convolutional Neural Networks

[...]

Matthew Francis-Landau¹, Greg Durrett¹, Dan Klein¹•Institutions (1)

University of California, Berkeley¹

01 Jun 2016

TL;DR: The authors proposed a model that uses convolutional neural networks to capture semantic correspondence between a mention's context and a proposed target entity and achieved state-of-the-art performance on multiple entity linking datasets.

...read moreread less

Abstract: A key challenge in entity linking is making effective use of contextual information to disambiguate mentions that might refer to different entities in different contexts. We present a model that uses convolutional neural networks to capture semantic correspondence between a mention’s context and a proposed target entity. These convolutional networks operate at multiple granularities to exploit various kinds of topic information, and their rich parameterization gives them the capacity to learn which n-grams characterize different topics. We combine these networks with a sparse linear model to achieve state-of-the-art performance on multiple entity linking datasets, outperforming the prior systems of Durrett and Klein (2014) and Nguyen et al. (2014). 1

...read moreread less

Proceedings Article•

Improving opinion aspect extraction using semantic similarity and aspect associations

[...]

Qian Liu¹, Bing Liu², Yuanlin Zhang³, Doo Soon Kim⁴, Zhiqiang Gao¹ - Show less +1 more•Institutions (4)

Chinese Ministry of Education¹, University of Illinois at Chicago², Texas Tech University³, Bosch⁴

12 Feb 2016

TL;DR: A novel unsupervised approach to make a major improvement in aspect extraction is proposed based on the framework of lifelong learning and is implemented with two forms of recommendations that are based on semantic similarity and aspect associations respectively.

...read moreread less

Abstract: Aspect extraction is a key task of fine-grained opinion mining. Although it has been studied by many researchers, it remains to be highly challenging. This paper proposes a novel unsupervised approach to make a major improvement. The approach is based on the framework of lifelong learning and is implemented with two forms of recommendations that are based on semantic similarity and aspect associations respectively. Experimental results using eight review datasets show the effectiveness of the proposed approach.

...read moreread less

Journal Article•DOI•

Cloud-FuSeR

[...]

Le Sun¹, Jiangang Ma¹, Yanchun Zhang¹, Hai Dong², Farookh Khadeer Hussain³ - Show less +1 more•Institutions (3)

Victoria University, Australia¹, RMIT University², University of Technology, Sydney³

01 Apr 2016-Future Generation Computer Systems

TL;DR: A novel fuzzy decision-making framework to model uncertain relationships between objects in databases for service matching, and a novel analytic hierarchy process approach to calculate the semantic similarity between concepts is presented.

...read moreread less

Collapse