scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2018"


Journal ArticleDOI
TL;DR: A novel model of Inductive Matrix Completion for MiRNA‐Disease Association prediction (IMCMDA) to complete the missing miRNA‐disease association based on the known associations and the integrated miRNA similarity and disease similarity.
Abstract: Motivation It has been shown that microRNAs (miRNAs) play key roles in variety of biological processes associated with human diseases. In Consideration of the cost and complexity of biological experiments, computational methods for predicting potential associations between miRNAs and diseases would be an effective complement. Results This paper presents a novel model of Inductive Matrix Completion for MiRNA-Disease Association prediction (IMCMDA). The integrated miRNA similarity and disease similarity are calculated based on miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity. The main idea is to complete the missing miRNA-disease association based on the known associations and the integrated miRNA similarity and disease similarity. IMCMDA achieves AUC of 0.8034 based on leave-one-out-cross-validation and improved previous models. In addition, IMCMDA was applied to five common human diseases in three types of case studies. In the first type, respectively, 42, 44, 45 out of top 50 predicted miRNAs of Colon Neoplasms, Kidney Neoplasms, Lymphoma were confirmed by experimental reports. In the second type of case study for new diseases without any known miRNAs, we chose Breast Neoplasms as the test example by hiding the association information between the miRNAs and Breast Neoplasms. As a result, 50 out of top 50 predicted Breast Neoplasms-related miRNAs are verified. In the third type of case study, IMCMDA was tested on HMDD V1.0 to assess the robustness of IMCMDA, 49 out of top 50 predicted Esophageal Neoplasms-related miRNAs are verified. Availability and implementation The code and dataset of IMCMDA are freely available at https://github.com/IMCMDAsourcecode/IMCMDA. Supplementary information Supplementary data are available at Bioinformatics online.

362 citations


Journal ArticleDOI
TL;DR: A triplet-based deep hashing (TDH) network for cross-modal retrieval using the triplet labels, which describe the relative relationships among three instances as supervision in order to capture more general semantic correlations between cross- modal instances.
Abstract: Given the benefits of its low storage requirements and high retrieval efficiency, hashing has recently received increasing attention. In particular, cross-modal hashing has been widely and successfully used in multimedia similarity search applications. However, almost all existing methods employing cross-modal hashing cannot obtain powerful hash codes due to their ignoring the relative similarity between heterogeneous data that contains richer semantic information, leading to unsatisfactory retrieval performance. In this paper, we propose a triplet-based deep hashing (TDH) network for cross-modal retrieval. First, we utilize the triplet labels, which describe the relative relationships among three instances as supervision in order to capture more general semantic correlations between cross-modal instances. We then establish a loss function from the inter-modal view and the intra-modal view to boost the discriminative abilities of the hash codes. Finally, graph regularization is introduced into our proposed TDH method to preserve the original semantic similarity between hash codes in Hamming space. Experimental results show that our proposed method outperforms several state-of-the-art approaches on two popular cross-modal data sets.

312 citations


Journal ArticleDOI
TL;DR: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News, and the intrinsic quantitative evaluation verifies that the semantic similarity captured by the wordEmbedded is closer to human experts' judgments on all four tested datasets.

287 citations


Book ChapterDOI
08 Sep 2018
TL;DR: This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.
Abstract: We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

282 citations


Journal ArticleDOI
TL;DR: This review presents the state of the art in distributional semantics, focusing on its assets and limits as a model of meaning and as a method for semantic analysis.
Abstract: Distributional semantics is a usage-based model of meaning, based on the assumption that the statistical distribution of linguistic items in context plays a key role in characterizing their semantic behavior. Distributional models build semantic representations by extracting co-occurrences from corpora and have become a mainstream research paradigm in computational linguistics. In this review, I present the state of the art in distributional semantics, focusing on its assets and limits as a model of meaning and as a method for semantic analysis.

251 citations


Journal ArticleDOI
Wen Zhang1, Xiang Yue1, Weiran Lin1, Wenjian Wu1, Ruoqi Liu1, Feng Huang1, Feng Liu1 
TL;DR: A user-friendly web server is developed by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD, which makes use of known drug-disease associations, drug features and disease semantic information.
Abstract: Drug-disease associations provide important information for the drug discovery. Wet experiments that identify drug-disease associations are time-consuming and expensive. However, many drug-disease associations are still unobserved or unknown. The development of computational methods for predicting unobserved drug-disease associations is an important and urgent task. In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarities and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD takes the biological context of the problem into account. In computational experiments, the proposed method can produce high-accuracy performances on benchmark datasets, and outperform existing state-of-the-art prediction methods when evaluated by five-fold cross validation and independent testing. We developed a user-friendly web server by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD/ . The case studies show that the server can find out novel associations, which are not included in the CTD database.

169 citations


Journal ArticleDOI
TL;DR: In this work, some novel process to measure the similarity between picture fuzzy sets is presented and some similarity measures, such as, cosine similarity measure, weighted cosine Similarity measure, set-theoretic similarityMeasure, weighted set- theoretic cosine similarities measure, grey similarity measure and weightedgrey similarity measure are developed.
Abstract: In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similarity measure, weighted set-theoretic cosine similarity measure, grey similarity measure and weighted grey similarity measure. Then, we apply these similarity measures between picture fuzzy sets to building material recognition and minerals field recognition. Finally, two illustrative examples are given to demonstrate the efficiency of the similarity measures for building material recognition and minerals field recognition.

158 citations


Proceedings ArticleDOI
13 Jul 2018
TL;DR: This work designs a deep architecture and a pair-wise loss function to preserve the semantic structure of the semantic relationships between points in unsupervised settings, and shows that SSDH significantly outperforms current state-of-the-art methods.
Abstract: Hashing is becoming increasingly popular for approximate nearest neighbor searching in massive databases due to its storage and search efficiency. Recent supervised hashing methods, which usually construct semantic similarity matrices to guide hash code learning using label information, have shown promising results. However, it is relatively difficult to capture and utilize the semantic relationships between points in unsupervised settings. To address this problem, we propose a novel unsupervised deep framework called Semantic Structure-based unsupervised Deep Hashing (SSDH). We first empirically study the deep feature statistics, and find that the distribution of the cosine distance for point pairs can be estimated by two half Gaussian distributions. Based on this observation, we construct the semantic structure by considering points with distances obviously smaller than the others as semantically similar and points with distances obviously larger than the others as semantically dissimilar. We then design a deep architecture and a pair-wise loss function to preserve this semantic structure in Hamming space. Extensive experiments show that SSDH significantly outperforms current state-of-the-art methods.

131 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: A novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains, and demonstrates great capability in handling multi-domain dialogues.
Abstract: Robust dialogue belief tracking is a key component in maintaining good quality dialogue systems. The tasks that dialogue systems are trying to solve are becoming increasingly complex, requiring scalability to multi-domain, semantically rich dialogues. However, most current approaches have difficulty scaling up with domains because of the dependency of the model parameters on the dialogue ontology. In this paper, a novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains. The evaluation is performed on a recently collected multi-domain dialogues dataset, one order of magnitude larger than currently available corpora. Our model demonstrates great capability in handling multi-domain dialogues, simultaneously outperforming existing state-of-the-art models in single-domain dialogue tracking tasks.

118 citations


Proceedings Article
Haiyun Guo1, Chaoyang Zhao1, Zhiwei Liu1, Jinqiao Wang1, Hanqing Lu1 
27 Apr 2018
TL;DR: This paper learns a structured feature embedding for vehicle re-ID with a novel coarse-to-fine ranking loss to pull images of the same vehicle as close as possible and achieve discrimination between images from different vehicles as well as vehicles from different vehicle models.
Abstract: Vehicle re-identification (re-ID) is to identify the same vehicle across different cameras. It’s a significant but challenging topic, which has received little attention due to the complex intra-class and inter-class variation of vehicle images and the lack of large-scale vehicle re-ID dataset. Previous methods focus on pulling images from different vehicles apart but neglect the discrimination between vehicles from different vehicle models, which is actually quite important to obtain a correct ranking order for vehicle re-ID. In this paper, we learn a structured feature embedding for vehicle re-ID with a novel coarse-to-fine ranking loss to pull images of the same vehicle as close as possible and achieve discrimination between images from different vehicles as well as vehicles from different vehicle models. In the learnt feature space, both intra-class compactness and inter-class distinction are well guaranteed and the Euclidean distance between features directly reflects the semantic similarity of vehicle images. Furthermore, we build so far the largest vehicle re-ID dataset "Vehicle-1M," which involves nearly 1 million images captured in various surveillance scenarios. Experimental results on "Vehicle-1M" and "VehicleID" demonstrate the superiority of our proposed approach.

117 citations


Proceedings ArticleDOI
20 Apr 2018
TL;DR: The authors presented a novel approach to learn representations for sentence-level semantic similarity using conversational data, which achieved the best performance among all neural models on the Semantic Textual Similarity (STS) Benchmark and SemEval 2017's Community Question Answering (CQA) question similarity subtask.
Abstract: We present a novel approach to learn representations for sentence-level semantic similarity using conversational data. Our method trains an unsupervised model to predict conversational responses. The resulting sentence embeddings perform well on the Semantic Textual Similarity (STS) Benchmark and SemEval 2017’s Community Question Answering (CQA) question similarity subtask. Performance is further improved by introducing multitask training, combining conversational response prediction and natural language inference. Extensive experiments show the proposed model achieves the best performance among all neural models on the STS Benchmark and is competitive with the state-of-the-art feature engineered and mixed systems for both tasks.

Posted Content
TL;DR: Notably, this simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
Abstract: An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep learning dynamics to give rise to these regularities.

Proceedings ArticleDOI
10 Apr 2018
TL;DR: This work introduces and addresses the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables, and introduces various similarity measures for matching those semantic representations.
Abstract: We introduce and address the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables. This task is not only interesting on its own account, but is also being used as a core component in many other table-based information access scenarios, such as table completion or table mining. The main novel contribution of this work is a method for performing semantic matching between queries and tables. Specifically, we (i) represent queries and tables in multiple semantic spaces (both discrete sparse and continuous dense vector representations) and (ii) introduce various similarity measures for matching those semantic representations. We consider all possible combinations of semantic representations and similarity measures and use these as features in a supervised learning model. Using a purpose-built test collection based on Wikipedia tables, we demonstrate significant and substantial improvements over a state-of-the-art baseline.

Journal ArticleDOI
TL;DR: The Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies, has the potential to significantly outperform the state‐of‐the‐art in several predictive applications in which ontologies are involved.
Abstract: Motivation Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications Results We propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes First, we demonstrate that Onto2Vec-generated feature vectors can significantly improve prediction of protein-protein interactions in human and yeast We then illustrate how Onto2Vec representations provide the means for constructing data-driven, trainable semantic similarity measures that can be used to identify particular relations between proteins Finally, we use an unsupervised clustering approach to identify protein families based on their Enzyme Commission numbers Our results demonstrate that Onto2Vec can generate high quality feature vectors from biological entities and ontologies Onto2Vec has the potential to significantly outperform the state-of-the-art in several predictive applications in which ontologies are involved Availability and implementation https://githubcom/bio-ontology-research-group/onto2vec Supplementary information Supplementary data are available at Bioinformatics online

Journal ArticleDOI
08 May 2018-PLOS ONE
TL;DR: The supervised link prediction approach proved to be promising for potential DDI prediction and may facilitate the identification of potential DDIs in clinical research.
Abstract: Drug-drug interaction (DDI) is a change in the effect of a drug when patient takes another drug. Characterizing DDIs is extremely important to avoid potential adverse drug reactions. We represent DDIs as a complex network in which nodes refer to drugs and links refer to their potential interactions. Recently, the problem of link prediction has attracted much consideration in scientific community. We represent the process of link prediction as a binary classification task on networks of potential DDIs. We use link prediction techniques for predicting unknown interactions between drugs in five arbitrary chosen large-scale DDI databases, namely DrugBank, KEGG, NDF-RT, SemMedDB, and Twosides. We estimated the performance of link prediction using a series of experiments on DDI networks. We performed link prediction using unsupervised and supervised approach including classification tree, k-nearest neighbors, support vector machine, random forest, and gradient boosting machine classifiers based on topological and semantic similarity features. Supervised approach clearly outperforms unsupervised approach. The Twosides network gained the best prediction performance regarding the area under the precision-recall curve (0.93 for both random forests and gradient boosting machine). The applied methodology can be used as a tool to help researchers to identify potential DDIs. The supervised link prediction approach proved to be promising for potential DDIs prediction and may facilitate the identification of potential DDIs in clinical research.

Journal ArticleDOI
TL;DR: This paper introduces Distributed Dictionary Representations (DDR), a method that applies psychological dictionaries using semantic similarity rather than word counts, which allows for the measurement of the similarity between dictionaries and spans of text.
Abstract: Theory-driven text analysis has made extensive use of psychological concept dictionaries, leading to a wide range of important results. These dictionaries have generally been applied through word count methods which have proven to be both simple and effective. In this paper, we introduce Distributed Dictionary Representations (DDR), a method that applies psychological dictionaries using semantic similarity rather than word counts. This allows for the measurement of the similarity between dictionaries and spans of text ranging from complete documents to individual words. We show how DDR enables dictionary authors to place greater emphasis on construct validity without sacrificing linguistic coverage. We further demonstrate the benefits of DDR on two real-world tasks and finally conduct an extensive study of the interaction between dictionary size and task performance. These studies allow us to examine how DDR and word count methods complement one another as tools for applying concept dictionaries and where each is best applied. Finally, we provide references to tools and resources to make this method both available and accessible to a broad psychological audience.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This article propose a simple yet effective approach for incorporating side information in the form of distributional constraints over the generated responses, which help generate more content rich responses that are based on a model of syntax and topics.
Abstract: Neural conversation models tend to generate safe, generic responses for most inputs This is due to the limitations of likelihood-based decoding objectives in generation tasks with diverse outputs, such as conversation To address this challenge, we propose a simple yet effective approach for incorporating side information in the form of distributional constraints over the generated responses We propose two constraints that help generate more content rich responses that are based on a model of syntax and topics (Griffiths et al, 2005) and semantic similarity (Arora et al, 2016) We evaluate our approach against a variety of competitive baselines, using both automatic metrics and human judgments, showing that our proposed approach generates responses that are much less generic without sacrificing plausibility A working demo of our code can be found at https://githubcom/abaheti95/DC-NeuralConversation

Journal ArticleDOI
TL;DR: A computational model of Random Forest for miRNA-disease association (RFMDA) prediction based on machine learning is developed and the results of cross-validation and case studies indicated that RFMDA is a reliable model for predicting mi RNA- disease associations.
Abstract: Since the first microRNA (miRNA) was discovered, a lot of studies have confirmed the associations between miRNAs and human complex diseases. Besides, obtaining and taking advantage of association information between miRNAs and diseases play an increasingly important role in improving the treatment level for complex diseases. However, due to the high cost of traditional experimental methods, many researchers have proposed different computational methods to predict potential associations between miRNAs and diseases. In this work, we developed a computational model of Random Forest for miRNA-disease association (RFMDA) prediction based on machine learning. The training sample set for RFMDA was constructed according to the human microRNA disease database (HMDD) version (v.)2.0, and the feature vectors to represent miRNA-disease samples were defined by integrating miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity. The Random Forest algorithm was first employed to infer miRNA-disease associations. In addition, a filter-based method was implemented to select robust features from the miRNA-disease feature set, which could efficiently distinguish related miRNA-disease pairs from unrelated miRNA-disease pairs. RFMDA achieved areas under the curve (AUCs) of 0.8891, 0.8323, and 0.8818 ± 0.0014 under global leave-one-out cross-validation, local leave-one-out cross-validation, and 5-fold cross-validation, respectively, which were higher than many previous computational models. To further evaluate the accuracy of RFMDA, we carried out three types of case studies for four human complex diseases. As a result, 43 (esophageal neoplasms), 46 (lymphoma), 47 (lung neoplasms), and 48 (breast neoplasms) of the top 50 predicted disease-related miRNAs were verified by experiments in different kinds of case studies. The results of cross-validation and case studies indicated that RFMDA is a reliable model for predicting miRNA-disease associations.

Journal ArticleDOI
01 Mar 2018
TL;DR: A System for Integrating Semantic Relatedness and similarity measures, SISR, which aims to provide a variety of tools for computing the semantic similarity and relatedness and is the first to treat the topic of computing semantic relatedness with a view of integrating different key stakeholders in a parameterized way.
Abstract: Semantic similarity and relatedness measures have increasingly become core elements in the recent research within the semantic technology community. Nowadays, the search for efficient meaning-centered applications that exploit computational semantics has become a necessity. Researchers, have therefore, become increasingly interested in the development of a model that can simulate the human thinking process and capable of measuring semantic similarity/relatedness between lexical terms, including concepts and words. Knowledge resources are fundamental to quantify semantic similarity or relatedness and achieve the best expression for the semantics content. No fully developed system that is able to centralize these approaches is currently available for the research and industrial communities. In this paper, we propose a System for Integrating Semantic Relatedness and similarity measures, SISR, which aims to provide a variety of tools for computing the semantic similarity and relatedness. This system is the first to treat the topic of computing semantic relatedness with a view of integrating different key stakeholders in a parameterized way. As an instance of the proposed architecture, we propose WNetSS which is a Java API allowing the use of a wide WordNet-based semantic similarity measures pertaining to different categories including taxonomic-based, features-based and IC-based measures. It is the first API that allows the extraction of the topological parameters from the WordNet “is a” taxonomy which are used to express the semantics of concepts. Moreover, an evaluation module is proposed to assess the reproducibility of the measures accuracy that can be evaluated according to 10 widely used benchmarks through the correlations coefficients.

Proceedings ArticleDOI
31 Jul 2018
TL;DR: This paper presented an effective approach for parallel corpus mining using bilingual sentence embeddings, which is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but have some degree of semantic similarity.
Abstract: This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but have some degree of semantic similarity. The quality of the resulting embeddings are evaluated on parallel corpus reconstruction and by assessing machine translation systems trained on gold vs. mined sentence pairs. We find that the sentence embeddings can be used to reconstruct the United Nations Parallel Corpus (Ziemski et al., 2016) at the sentence-level with a precision of 48.9% for en-fr and 54.9% for en-es. When adapted to document-level matching, we achieve a parallel document matching accuracy that is comparable to the significantly more computationally intensive approach of Uszkoreit et al. (2010). Using reconstructed parallel data, we are able to train NMT models that perform nearly as well as models trained on the original data (within 1-2 BLEU).

Journal ArticleDOI
TL;DR: This work proposes SCSNED method for disambiguation based on semantic similarity between contextual words and informative words of entities in KGs, and proposes a Category2Vec embedding model based on joint learning of word and category embedding, in order to compute word-category similarity for entity disambigsuation.
Abstract: With the increasing popularity of large scale Knowledge Graph (KG)s, many applications such as semantic analysis, search and question answering need to link entity mentions in texts to entities in KGs. Because of the polysemy problem in natural language, entity disambiguation is thus a key problem in current research. Existing disambiguation methods have considered entity prominence, context similarity and entity-entity relatedness to discriminate ambiguous entities, which are mainly working on document or paragraph level texts containing rich contextual information, and based on lexical matching for computing context similarity. When meeting short texts containing limited contextual information, such as web queries, questions and tweets, those conventional disambiguation methods are not good at handling single entity mention and measuring context similarity. In order to enhance the performance of disambiguation methods based on context similarity with such short texts, we propose SCSNED method for disambiguation based on semantic similarity between contextual words and informative words of entities in KGs. Specially, we exploit the effectiveness of both knowledge-based and corpus-based semantic similarity methods for entity disambiguation with SCSNED. Moreover, we propose a Category2Vec embedding model based on joint learning of word and category embedding, in order to compute word-category similarity for entity disambiguation. We show the effectiveness of these proposed methods with illustrative examples, and evaluate their effectiveness in a comparative experiment for entity disambiguation in real world web queries, questions and tweets. The experimental results have identified the effectiveness of different semantic similarity methods, and demonstrated the improvement of semantic similarity methods in SCSNED and Category2Vec over the conventional context similarity baseline. We further compare the proposed approaches with the state of the art entity disambiguation systems and show the performances of the proposed approaches are among the best performing systems. In addition, one important feature of the proposed approaches using semantic similarity, is the potential application on any existing KGs since they mainly use common features of entity descriptions and categories. Another contribution of the paper is an updated survey on background of entity disambiguation in KGs and semantic similarity methods.

Posted Content
TL;DR: In this article, a novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains, and demonstrates great capability in handling multi-domain dialogues, simultaneously outperforming existing state-of-theart models in single-domain dialogue tracking tasks.
Abstract: Robust dialogue belief tracking is a key component in maintaining good quality dialogue systems. The tasks that dialogue systems are trying to solve are becoming increasingly complex, requiring scalability to multi domain, semantically rich dialogues. However, most current approaches have difficulty scaling up with domains because of the dependency of the model parameters on the dialogue ontology. In this paper, a novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains. The evaluation is performed on a recently collected multi-domain dialogues dataset, one order of magnitude larger than currently available corpora. Our model demonstrates great capability in handling multi-domain dialogues, simultaneously outperforming existing state-of-the-art models in single-domain dialogue tracking tasks.

Journal ArticleDOI
TL;DR: This work develops end-to-end architectures directly tailored to the task of mapping a disease mention to a concept in a controlled vocabulary, typically to the standard thesaurus in the Unified Medical Language System (UMLS), and develops additional semantic similarity features based on UMLS.

Journal ArticleDOI
TL;DR: Experimental results illustrate the reliability and usefulness of the computational method in terms of different validation measures, which indicates PWCDA can effectively predict potential circRNA-disease associations.
Abstract: CircRNAs have particular biological structure and have proven to play important roles in diseases. It is time-consuming and costly to identify circRNA-disease associations by biological experiments. Therefore, it is appealing to develop computational methods for predicting circRNA-disease associations. In this study, we propose a new computational path weighted method for predicting circRNA-disease associations. Firstly, we calculate the functional similarity scores of diseases based on disease-related gene annotations and the semantic similarity scores of circRNAs based on circRNA-related gene ontology, respectively. To address missing similarity scores of diseases and circRNAs, we calculate the Gaussian Interaction Profile (GIP) kernel similarity scores for diseases and circRNAs, respectively, based on the circRNA-disease associations downloaded from circR2Disease database (http://bioinfo.snnu.edu.cn/CircR2Disease/). Then, we integrate disease functional similarity scores and circRNA semantic similarity scores with their related GIP kernel similarity scores to construct a heterogeneous network made up of three sub-networks: disease similarity network, circRNA similarity network and circRNA-disease association network. Finally, we compute an association score for each circRNA-disease pair based on paths connecting them in the heterogeneous network to determine whether this circRNA-disease pair is associated. We adopt leave one out cross validation (LOOCV) and five-fold cross validations to evaluate the performance of our proposed method. In addition, three common diseases, Breast Cancer, Gastric Cancer and Colorectal Cancer, are used for case studies. Experimental results illustrate the reliability and usefulness of our computational method in terms of different validation measures, which indicates PWCDA can effectively predict potential circRNA-disease associations.

Proceedings ArticleDOI
15 Feb 2018
TL;DR: This paper proposed LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasize the asymmetric relation of lexical entailment, also known as the IS-A or hyponymy-hypernymy relation.
Abstract: We present LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of lexical entailment (LE), also known as the IS-A or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into the initial vector space, the LE specialisation procedure brings true hyponymy-hypernymy pairs closer together in the transformed Euclidean space. The proposed asymmetric distance measure adjusts the norms of word vectors to reflect the actual WordNet-style hierarchy of concepts. Simultaneously, a joint objective enforces semantic similarity using the symmetric cosine distance, yielding a vector space specialised for both lexical relations at once. LEAR specialisation achieves state-of-the-art performance in the tasks of hypernymy directionality, hypernymy detection, and graded lexical entailment, demonstrating the effectiveness and robustness of the proposed asymmetric specialisation model.

Journal ArticleDOI
TL;DR: Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities and integration of the tool into bioinformatics pipelines for large-scale calculations.
Abstract: Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik's and Wang's methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/ .

Journal ArticleDOI
TL;DR: Novel quantitative analyses are applied to a large volume of empirical data to confirm the hypothesis that, among all psychoactive substances, hallucinogen drugs elicit experiences with the highest semantic similarity to those of dreams.
Abstract: Ever since the modern rediscovery of psychedelic substances by Western society, several authors have independently proposed that their effects bear a high resemblance to the dreams and dreamlike experiences occurring naturally during the sleep-wake cycle Recent studies in humans have provided neurophysiological evidence supporting this hypothesis However, a rigorous comparative analysis of the phenomenology ("what it feels like" to experience these states) is currently lacking We investigated the semantic similarity between a large number of subjective reports of psychoactive substances and reports of high/low lucidity dreams, and found that the highest-ranking substance in terms of the similarity to high lucidity dreams was the serotonergic psychedelic lysergic acid diethylamide (LSD), whereas the highest-ranking in terms of the similarity to dreams of low lucidity were plants of the Datura genus, rich in deliriant tropane alkaloids Conversely, sedatives, stimulants, antipsychotics, and antidepressants comprised most of the lowest-ranking substances An analysis of the most frequent words in the subjective reports of dreams and hallucinogens revealed that terms associated with perception ("see," "visual," "face," "reality," "color"), emotion ("fear"), setting ("outside," "inside," "street," "front," "behind") and relatives ("mom," "dad," "brother," "parent," "family") were the most prevalent across both experiences In summary, we applied novel quantitative analyses to a large volume of empirical data to confirm the hypothesis that, among all psychoactive substances, hallucinogen drugs elicit experiences with the highest semantic similarity to those of dreams Our results and the associated methodological developments open the way to study the comparative phenomenology of different altered states of consciousness and its relationship with non-invasive measurements of brain physiology

Journal ArticleDOI
TL;DR: A formulation for multilabel learning, from the perspective of cross-view learning, that explores the correlations between the input and the output, and jointly learns a semantic common subspace and view-specific mappings within one framework.
Abstract: Embedding methods have shown promising performance in multilabel prediction, as they are able to discover the label dependence. However, most methods ignore the correlations between the input and output, such that their learned embeddings are not well aligned, which leads to degradation in prediction performance. This paper presents a formulation for multilabel learning, from the perspective of cross-view learning, that explores the correlations between the input and the output. The proposed method, called Co-Embedding (CoE), jointly learns a semantic common subspace and view-specific mappings within one framework. The semantic similarity structure among the embeddings is further preserved, ensuring that close embeddings share similar labels. Additionally, CoE conducts multilabel prediction through the cross-view $k$ nearest neighborhood ( $k$ NN) search among the learned embeddings, which significantly reduces computational costs compared with conventional decoding schemes. A hashing-based model, i.e., Co-Hashing (CoH), is further proposed. CoH is based on CoE, and imposes the binary constraint on continuous latent embeddings. CoH aims to generate compact binary representations to improve the prediction efficiency by benefiting from the efficient $k$ NN search of multiple labels in the Hamming space. Extensive experiments on various real-world data sets demonstrate the superiority of the proposed methods over the state of the arts in terms of both prediction accuracy and efficiency.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This paper proposes an approach to constrain the summary length by extending a convolutional sequence to sequence model, and shows that this approach generates high-quality summaries with user defined length, and outperforms the baselines consistently in terms of ROUGE score, length variations and semantic similarity.
Abstract: Convolutional neural networks (CNNs) have met great success in abstractive summarization, but they cannot effectively generate summaries of desired lengths. Because generated summaries are used in difference scenarios which may have space or length constraints, the ability to control the summary length in abstractive summarization is an important problem. In this paper, we propose an approach to constrain the summary length by extending a convolutional sequence to sequence model. The results show that this approach generates high-quality summaries with user defined length, and outperforms the baselines consistently in terms of ROUGE score, length variations and semantic similarity.

Proceedings ArticleDOI
10 Apr 2018
TL;DR: NEQA is presented, a continuous learning paradigm for KB-QA that periodically re-trains its underlying models, allowing it to adapt to the language used after deployment, and its viability is demonstrated.
Abstract: Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases (KB-QA). Existing methods rely on a clear separation between an offline training phase, where a model is learned, and an online phase where this model is deployed. Two major shortcomings of such methods are that (i) they require access to a large annotated training set that is not always readily available and (ii) they fail on questions from before-unseen domains. To overcome these limitations, this paper presents NEQA, a continuous learning paradigm for KB-QA. Offline, NEQA automatically learns templates mapping syntactic structures to semantic ones from a small number of training question-answer pairs. Once deployed, continuous learning is triggered on cases where templates are insufficient. Using a semantic similarity function between questions and by judicious invocation of non-expert user feedback, NEQA learns new templates that capture previously-unseen syntactic structures. This way, NEQA gradually extends its template repository. NEQA periodically re-trains its underlying models, allowing it to adapt to the language used after deployment. Our experiments demonstrate NEQA's viability, with steady improvement in answering quality over time, and the ability to answer questions from new domains.