Showing papers on "Semantic similarity published in 2018"

PDF

Open Access

Journal Article•DOI•

Predicting miRNA-disease association based on inductive matrix completion.

[...]

Xing Chen¹, Lei Wang¹, Jia Qu¹, Na-Na Guan², Jianqiang Li² - Show less +1 more•Institutions (2)

China University of Mining and Technology¹, Shenzhen University²

15 Dec 2018-Bioinformatics

TL;DR: A novel model of Inductive Matrix Completion for MiRNA‐Disease Association prediction (IMCMDA) to complete the missing miRNA‐disease association based on the known associations and the integrated miRNA similarity and disease similarity.

...read moreread less

Abstract: Motivation It has been shown that microRNAs (miRNAs) play key roles in variety of biological processes associated with human diseases. In Consideration of the cost and complexity of biological experiments, computational methods for predicting potential associations between miRNAs and diseases would be an effective complement. Results This paper presents a novel model of Inductive Matrix Completion for MiRNA-Disease Association prediction (IMCMDA). The integrated miRNA similarity and disease similarity are calculated based on miRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity. The main idea is to complete the missing miRNA-disease association based on the known associations and the integrated miRNA similarity and disease similarity. IMCMDA achieves AUC of 0.8034 based on leave-one-out-cross-validation and improved previous models. In addition, IMCMDA was applied to five common human diseases in three types of case studies. In the first type, respectively, 42, 44, 45 out of top 50 predicted miRNAs of Colon Neoplasms, Kidney Neoplasms, Lymphoma were confirmed by experimental reports. In the second type of case study for new diseases without any known miRNAs, we chose Breast Neoplasms as the test example by hiding the association information between the miRNAs and Breast Neoplasms. As a result, 50 out of top 50 predicted Breast Neoplasms-related miRNAs are verified. In the third type of case study, IMCMDA was tested on HMDD V1.0 to assess the robustness of IMCMDA, 49 out of top 50 predicted Esophageal Neoplasms-related miRNAs are verified. Availability and implementation The code and dataset of IMCMDA are freely available at https://github.com/IMCMDAsourcecode/IMCMDA. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

362 citations

Journal Article•DOI•

Triplet-Based Deep Hashing Network for Cross-Modal Retrieval

[...]

Cheng Deng¹, Zhaojia Chen¹, Xianglong Liu², Xinbo Gao¹, Dacheng Tao³ - Show less +1 more•Institutions (3)

Xidian University¹, Beihang University², University of Sydney³

04 Apr 2018-IEEE Transactions on Image Processing

TL;DR: A triplet-based deep hashing (TDH) network for cross-modal retrieval using the triplet labels, which describe the relative relationships among three instances as supervision in order to capture more general semantic correlations between cross- modal instances.

...read moreread less

Abstract: Given the benefits of its low storage requirements and high retrieval efficiency, hashing has recently received increasing attention. In particular, cross-modal hashing has been widely and successfully used in multimedia similarity search applications. However, almost all existing methods employing cross-modal hashing cannot obtain powerful hash codes due to their ignoring the relative similarity between heterogeneous data that contains richer semantic information, leading to unsatisfactory retrieval performance. In this paper, we propose a triplet-based deep hashing (TDH) network for cross-modal retrieval. First, we utilize the triplet labels, which describe the relative relationships among three instances as supervision in order to capture more general semantic correlations between cross-modal instances. We then establish a loss function from the inter-modal view and the intra-modal view to boost the discriminative abilities of the hash codes. Finally, graph regularization is introduced into our proposed TDH method to preserve the original semantic similarity between hash codes in Hamming space. Experimental results show that our proposed method outperforms several state-of-the-art approaches on two popular cross-modal data sets.

...read moreread less

312 citations

Journal Article•DOI•

A comparison of word embeddings for the biomedical natural language processing

[...]

Yanshan Wang¹, Sijia Liu¹, Naveed Afzal¹, Majid Rastegar-Mojarad¹, Liwei Wang¹, Feichen Shen¹, Paul R. Kingsbury¹, Hongfang Liu¹ - Show less +4 more•Institutions (1)

Mayo Clinic¹

01 Nov 2018-Journal of Biomedical Informatics

TL;DR: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News, and the intrinsic quantitative evaluation verifies that the semantic similarity captured by the wordEmbedded is closer to human experts' judgments on all four tested datasets.

...read moreread less

287 citations

Book Chapter•DOI•

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

[...]

Youngjae Yu¹, Jongseok Kim¹, Gunhee Kim¹•Institutions (1)

Seoul National University¹

08 Sep 2018

TL;DR: This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.

...read moreread less

Abstract: We present an approach named JSFusion (Joint Sequence Fusion) that can measure semantic similarity between any pairs of multimodal sequence data (e.g. a video clip and a language sentence). Our multimodal matching network consists of two key components. First, the Joint Semantic Tensor composes a dense pairwise representation of two sequence data into a 3D tensor. Then, the Convolutional Hierarchical Decoder computes their similarity score by discovering hidden hierarchical matches between the two sequence modalities. Both modules leverage hierarchical attention mechanisms that learn to promote well-matched representation patterns while prune out misaligned ones in a bottom-up manner. Although the JSFusion is a universal model to be applicable to any multimodal sequence data, this work focuses on video-language tasks including multimodal retrieval and video QA. We evaluate the JSFusion model in three retrieval and VQA tasks in LSMDC, for which our model achieves the best performance reported so far. We also perform multiple-choice and movie retrieval tasks for the MSR-VTT dataset, on which our approach outperforms many state-of-the-art methods.

...read moreread less

282 citations

Journal Article•DOI•

Distributional Models of Word Meaning

[...]

Alessandro Lenci¹•Institutions (1)

University of Pisa¹

17 Jan 2018-Social Science Research Network

TL;DR: This review presents the state of the art in distributional semantics, focusing on its assets and limits as a model of meaning and as a method for semantic analysis.

...read moreread less

Abstract: Distributional semantics is a usage-based model of meaning, based on the assumption that the statistical distribution of linguistic items in context plays a key role in characterizing their semantic behavior. Distributional models build semantic representations by extracting co-occurrences from corpora and have become a mainstream research paradigm in computational linguistics. In this review, I present the state of the art in distributional semantics, focusing on its assets and limits as a model of meaning and as a method for semantic analysis.

...read moreread less

251 citations

Journal Article•DOI•

Predicting drug-disease associations by using similarity constrained matrix factorization

[...]

Wen Zhang¹, Xiang Yue¹, Weiran Lin¹, Wenjian Wu¹, Ruoqi Liu¹, Feng Huang¹, Feng Liu¹ - Show less +3 more•Institutions (1)

Wuhan University¹

19 Jun 2018-BMC Bioinformatics

TL;DR: A user-friendly web server is developed by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD, which makes use of known drug-disease associations, drug features and disease semantic information.

...read moreread less

Abstract: Drug-disease associations provide important information for the drug discovery. Wet experiments that identify drug-disease associations are time-consuming and expensive. However, many drug-disease associations are still unobserved or unknown. The development of computational methods for predicting unobserved drug-disease associations is an important and urgent task. In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarities and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD takes the biological context of the problem into account. In computational experiments, the proposed method can produce high-accuracy performances on benchmark datasets, and outperform existing state-of-the-art prediction methods when evaluated by five-fold cross validation and independent testing. We developed a user-friendly web server by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD/ . The case studies show that the server can find out novel associations, which are not included in the CTD database.

...read moreread less

169 citations

Journal Article•DOI•

[...]

Guiwu Wei¹•Institutions (1)

Sichuan Normal University¹

01 Mar 2018-Iranian Journal of Fuzzy Systems

TL;DR: In this work, some novel process to measure the similarity between picture fuzzy sets is presented and some similarity measures, such as, cosine similarity measure, weighted cosine Similarity measure, set-theoretic similarityMeasure, weighted set- theoretic cosine similarities measure, grey similarity measure and weightedgrey similarity measure are developed.

...read moreread less

Abstract: In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similarity measure, weighted set-theoretic cosine similarity measure, grey similarity measure and weighted grey similarity measure. Then, we apply these similarity measures between picture fuzzy sets to building material recognition and minerals field recognition. Finally, two illustrative examples are given to demonstrate the efficiency of the similarity measures for building material recognition and minerals field recognition.

...read moreread less

158 citations

Proceedings Article•DOI•

Semantic structure-based unsupervised deep hashing

[...]

Erkun Yang¹, Cheng Deng¹, Tongliang Liu², Wei Liu³, Dacheng Tao² - Show less +1 more•Institutions (3)

Xidian University¹, University of Sydney², Tencent³

13 Jul 2018

TL;DR: This work designs a deep architecture and a pair-wise loss function to preserve the semantic structure of the semantic relationships between points in unsupervised settings, and shows that SSDH significantly outperforms current state-of-the-art methods.

...read moreread less

Abstract: Hashing is becoming increasingly popular for approximate nearest neighbor searching in massive databases due to its storage and search efficiency. Recent supervised hashing methods, which usually construct semantic similarity matrices to guide hash code learning using label information, have shown promising results. However, it is relatively difficult to capture and utilize the semantic relationships between points in unsupervised settings. To address this problem, we propose a novel unsupervised deep framework called Semantic Structure-based unsupervised Deep Hashing (SSDH). We first empirically study the deep feature statistics, and find that the distribution of the cosine distance for point pairs can be estimated by two half Gaussian distributions. Based on this observation, we construct the semantic structure by considering points with distances obviously smaller than the others as semantically similar and points with distances obviously larger than the others as semantically dissimilar. We then design a deep architecture and a pair-wise loss function to preserve this semantic structure in Hamming space. Extensive experiments show that SSDH significantly outperforms current state-of-the-art methods.

...read moreread less

131 citations

Proceedings Article•DOI•

Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

[...]

Osman Ramadan, Paweł Budzianowski¹, Milica Gasic¹•Institutions (1)

University of Cambridge¹

01 Jul 2018

TL;DR: A novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains, and demonstrates great capability in handling multi-domain dialogues.

...read moreread less

Abstract: Robust dialogue belief tracking is a key component in maintaining good quality dialogue systems. The tasks that dialogue systems are trying to solve are becoming increasingly complex, requiring scalability to multi-domain, semantically rich dialogues. However, most current approaches have difficulty scaling up with domains because of the dependency of the model parameters on the dialogue ontology. In this paper, a novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains. The evaluation is performed on a recently collected multi-domain dialogues dataset, one order of magnitude larger than currently available corpora. Our model demonstrates great capability in handling multi-domain dialogues, simultaneously outperforming existing state-of-the-art models in single-domain dialogue tracking tasks.

...read moreread less

118 citations

Proceedings Article•

Learning Coarse-to-Fine Structured Feature Embedding for Vehicle Re-Identification.

[...]

Haiyun Guo¹, Chaoyang Zhao¹, Zhiwei Liu¹, Jinqiao Wang¹, Hanqing Lu¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

27 Apr 2018

TL;DR: This paper learns a structured feature embedding for vehicle re-ID with a novel coarse-to-fine ranking loss to pull images of the same vehicle as close as possible and achieve discrimination between images from different vehicles as well as vehicles from different vehicle models.

...read moreread less

Abstract: Vehicle re-identification (re-ID) is to identify the same vehicle across different cameras. It’s a significant but challenging topic, which has received little attention due to the complex intra-class and inter-class variation of vehicle images and the lack of large-scale vehicle re-ID dataset. Previous methods focus on pulling images from different vehicles apart but neglect the discrimination between vehicles from different vehicle models, which is actually quite important to obtain a correct ranking order for vehicle re-ID. In this paper, we learn a structured feature embedding for vehicle re-ID with a novel coarse-to-fine ranking loss to pull images of the same vehicle as close as possible and achieve discrimination between images from different vehicles as well as vehicles from different vehicle models. In the learnt feature space, both intra-class compactness and inter-class distinction are well guaranteed and the Euclidean distance between features directly reflects the semantic similarity of vehicle images. Furthermore, we build so far the largest vehicle re-ID dataset "Vehicle-1M," which involves nearly 1 million images captured in various surveillance scenarios. Experimental results on "Vehicle-1M" and "VehicleID" demonstrate the superiority of our proposed approach.

...read moreread less

117 citations

Proceedings Article•DOI•

Learning Semantic Textual Similarity from Conversations

[...]

Yinfei Yang¹, Steve Yuan², Daniel Cer², Sheng-yi Kong³, Noah Constant², Petr Pilar, Heming Ge, Yun-Hsuan Sung², Brian Strope², Ray Kurzweil² - Show less +6 more•Institutions (3)

Amazon.com¹, Google², National Taiwan University³

20 Apr 2018

TL;DR: The authors presented a novel approach to learn representations for sentence-level semantic similarity using conversational data, which achieved the best performance among all neural models on the Semantic Textual Similarity (STS) Benchmark and SemEval 2017's Community Question Answering (CQA) question similarity subtask.

...read moreread less

Abstract: We present a novel approach to learn representations for sentence-level semantic similarity using conversational data. Our method trains an unsupervised model to predict conversational responses. The resulting sentence embeddings perform well on the Semantic Textual Similarity (STS) Benchmark and SemEval 2017’s Community Question Answering (CQA) question similarity subtask. Performance is further improved by introducing multitask training, combining conversational response prediction and natural language inference. Extensive experiments show the proposed model achieves the best performance among all neural models on the STS Benchmark and is competitive with the state-of-the-art feature engineered and mixed systems for both tasks.

...read moreread less

Posted Content•

A mathematical theory of semantic development in deep neural networks

[...]

Andrew M. Saxe¹, James L. McClelland², Surya Ganguli²•Institutions (2)

University of Oxford¹, Stanford University²

23 Oct 2018-arXiv: Learning

TL;DR: Notably, this simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.

...read moreread less

Abstract: An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep learning dynamics to give rise to these regularities.

...read moreread less

Proceedings Article•DOI•

Ad Hoc Table Retrieval using Semantic Similarity

[...]

Shuo Zhang¹, Krisztian Balog¹•Institutions (1)

University of Stavanger¹

10 Apr 2018

TL;DR: This work introduces and addresses the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables, and introduces various similarity measures for matching those semantic representations.

...read moreread less

Abstract: We introduce and address the problem of ad hoc table retrieval: answering a keyword query with a ranked list of tables. This task is not only interesting on its own account, but is also being used as a core component in many other table-based information access scenarios, such as table completion or table mining. The main novel contribution of this work is a method for performing semantic matching between queries and tables. Specifically, we (i) represent queries and tables in multiple semantic spaces (both discrete sparse and continuous dense vector representations) and (ii) introduce various similarity measures for matching those semantic representations. We consider all possible combinations of semantic representations and similarity measures and use these as features in a supervised learning model. Using a purpose-built test collection based on Wikipedia tables, we demonstrate significant and substantial improvements over a state-of-the-art baseline.

...read moreread less

Journal Article•DOI•

Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations.

[...]

Fatima Zohra Smaili¹, Xin Gao¹, Robert Hoehndorf¹•Institutions (1)

King Abdullah University of Science and Technology¹

01 Jul 2018-Bioinformatics

TL;DR: The Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies, has the potential to significantly outperform the state‐of‐the‐art in several predictive applications in which ontologies are involved.

...read moreread less

Abstract: Motivation Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain The structure and information contained in ontologies and their annotations make them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications Results We propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering To evaluate Onto2Vec, we use the gene ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes First, we demonstrate that Onto2Vec-generated feature vectors can significantly improve prediction of protein-protein interactions in human and yeast We then illustrate how Onto2Vec representations provide the means for constructing data-driven, trainable semantic similarity measures that can be used to identify particular relations between proteins Finally, we use an unsupervised clustering approach to identify protein families based on their Enzyme Commission numbers Our results demonstrate that Onto2Vec can generate high quality feature vectors from biological entities and ontologies Onto2Vec has the potential to significantly outperform the state-of-the-art in several predictive applications in which ontologies are involved Availability and implementation https://githubcom/bio-ontology-research-group/onto2vec Supplementary information Supplementary data are available at Bioinformatics online

...read moreread less

Journal Article•DOI•

Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning.

[...]

Andrej Kastrin¹, Polonca Ferk¹, Brane Leskošek¹•Institutions (1)

University of Ljubljana¹

08 May 2018-PLOS ONE

TL;DR: The supervised link prediction approach proved to be promising for potential DDI prediction and may facilitate the identification of potential DDIs in clinical research.

...read moreread less

Abstract: Drug-drug interaction (DDI) is a change in the effect of a drug when patient takes another drug. Characterizing DDIs is extremely important to avoid potential adverse drug reactions. We represent DDIs as a complex network in which nodes refer to drugs and links refer to their potential interactions. Recently, the problem of link prediction has attracted much consideration in scientific community. We represent the process of link prediction as a binary classification task on networks of potential DDIs. We use link prediction techniques for predicting unknown interactions between drugs in five arbitrary chosen large-scale DDI databases, namely DrugBank, KEGG, NDF-RT, SemMedDB, and Twosides. We estimated the performance of link prediction using a series of experiments on DDI networks. We performed link prediction using unsupervised and supervised approach including classification tree, k-nearest neighbors, support vector machine, random forest, and gradient boosting machine classifiers based on topological and semantic similarity features. Supervised approach clearly outperforms unsupervised approach. The Twosides network gained the best prediction performance regarding the area under the precision-recall curve (0.93 for both random forests and gradient boosting machine). The applied methodology can be used as a tool to help researchers to identify potential DDIs. The supervised link prediction approach proved to be promising for potential DDIs prediction and may facilitate the identification of potential DDIs in clinical research.

...read moreread less

Journal Article•DOI•

Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis : Distributed dictionary representation.

[...]

Justin Garten¹, Joe Hoover¹, Kate M. Johnson¹, Reihane Boghrati¹, Carol Iskiwitch¹, Morteza Dehghani¹ - Show less +2 more•Institutions (1)

University of Southern California¹

01 Feb 2018-Behavior Research Methods

TL;DR: This paper introduces Distributed Dictionary Representations (DDR), a method that applies psychological dictionaries using semantic similarity rather than word counts, which allows for the measurement of the similarity between dictionaries and spans of text.

...read moreread less

Abstract: Theory-driven text analysis has made extensive use of psychological concept dictionaries, leading to a wide range of important results. These dictionaries have generally been applied through word count methods which have proven to be both simple and effective. In this paper, we introduce Distributed Dictionary Representations (DDR), a method that applies psychological dictionaries using semantic similarity rather than word counts. This allows for the measurement of the similarity between dictionaries and spans of text ranging from complete documents to individual words. We show how DDR enables dictionary authors to place greater emphasis on construct validity without sacrificing linguistic coverage. We further demonstrate the benefits of DDR on two real-world tasks and finally conduct an extensive study of the interaction between dictionary size and task performance. These studies allow us to examine how DDR and word count methods complement one another as tools for applying concept dictionaries and where each is best applied. Finally, we provide references to tools and resources to make this method both available and accessible to a broad psychological audience.

...read moreread less

Proceedings Article•DOI•

Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints

[...]

Ashutosh Baheti¹, Alan Ritter², Jiwei Li³, Bill Dolan⁴•Institutions (4)

Indian Institute of Technology Kharagpur¹, Ohio State University², Stanford University³, Microsoft⁴

01 Jan 2018

TL;DR: This article propose a simple yet effective approach for incorporating side information in the form of distributional constraints over the generated responses, which help generate more content rich responses that are based on a model of syntax and topics.

...read moreread less

Abstract: Neural conversation models tend to generate safe, generic responses for most inputs This is due to the limitations of likelihood-based decoding objectives in generation tasks with diverse outputs, such as conversation To address this challenge, we propose a simple yet effective approach for incorporating side information in the form of distributional constraints over the generated responses We propose two constraints that help generate more content rich responses that are based on a model of syntax and topics (Griffiths et al, 2005) and semantic similarity (Arora et al, 2016) We evaluate our approach against a variety of competitive baselines, using both automatic metrics and human judgments, showing that our proposed approach generates responses that are much less generic without sacrificing plausibility A working demo of our code can be found at https://githubcom/abaheti95/DC-NeuralConversation

...read moreread less

Journal Article•DOI•

Novel Human miRNA-Disease Association Inference Based on Random Forest.

[...]

Xing Chen¹, Chun-Chun Wang¹, Jun Yin¹, Zhu-Hong You²•Institutions (2)

China University of Mining and Technology¹, Chinese Academy of Sciences²

07 Dec 2018-Molecular therapy. Nucleic acids

TL;DR: A computational model of Random Forest for miRNA-disease association (RFMDA) prediction based on machine learning is developed and the results of cross-validation and case studies indicated that RFMDA is a reliable model for predicting mi RNA- disease associations.

...read moreread less

Abstract: Since the first microRNA (miRNA) was discovered, a lot of studies have confirmed the associations between miRNAs and human complex diseases. Besides, obtaining and taking advantage of association information between miRNAs and diseases play an increasingly important role in improving the treatment level for complex diseases. However, due to the high cost of traditional experimental methods, many researchers have proposed different computational methods to predict potential associations between miRNAs and diseases. In this work, we developed a computational model of Random Forest for miRNA-disease association (RFMDA) prediction based on machine learning. The training sample set for RFMDA was constructed according to the human microRNA disease database (HMDD) version (v.)2.0, and the feature vectors to represent miRNA-disease samples were defined by integrating miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity. The Random Forest algorithm was first employed to infer miRNA-disease associations. In addition, a filter-based method was implemented to select robust features from the miRNA-disease feature set, which could efficiently distinguish related miRNA-disease pairs from unrelated miRNA-disease pairs. RFMDA achieved areas under the curve (AUCs) of 0.8891, 0.8323, and 0.8818 ± 0.0014 under global leave-one-out cross-validation, local leave-one-out cross-validation, and 5-fold cross-validation, respectively, which were higher than many previous computational models. To further evaluate the accuracy of RFMDA, we carried out three types of case studies for four human complex diseases. As a result, 43 (esophageal neoplasms), 46 (lymphoma), 47 (lung neoplasms), and 48 (breast neoplasms) of the top 50 predicted disease-related miRNAs were verified by experiments in different kinds of case studies. The results of cross-validation and case studies indicated that RFMDA is a reliable model for predicting miRNA-disease associations.

...read moreread less

Journal Article•DOI•

SISR: System for integrating semantic relatedness and similarity measures

[...]

Mohamed Ben Aouicha, Mohamed Ali Hadj Taieb, Abdelmajid Ben Hamadou

01 Mar 2018

TL;DR: A System for Integrating Semantic Relatedness and similarity measures, SISR, which aims to provide a variety of tools for computing the semantic similarity and relatedness and is the first to treat the topic of computing semantic relatedness with a view of integrating different key stakeholders in a parameterized way.

...read moreread less

Abstract: Semantic similarity and relatedness measures have increasingly become core elements in the recent research within the semantic technology community. Nowadays, the search for efficient meaning-centered applications that exploit computational semantics has become a necessity. Researchers, have therefore, become increasingly interested in the development of a model that can simulate the human thinking process and capable of measuring semantic similarity/relatedness between lexical terms, including concepts and words. Knowledge resources are fundamental to quantify semantic similarity or relatedness and achieve the best expression for the semantics content. No fully developed system that is able to centralize these approaches is currently available for the research and industrial communities. In this paper, we propose a System for Integrating Semantic Relatedness and similarity measures, SISR, which aims to provide a variety of tools for computing the semantic similarity and relatedness. This system is the first to treat the topic of computing semantic relatedness with a view of integrating different key stakeholders in a parameterized way. As an instance of the proposed architecture, we propose WNetSS which is a Java API allowing the use of a wide WordNet-based semantic similarity measures pertaining to different categories including taxonomic-based, features-based and IC-based measures. It is the first API that allows the extraction of the topological parameters from the WordNet “is a” taxonomy which are used to express the semantics of concepts. Moreover, an evaluation module is proposed to assess the reproducibility of the measures accuracy that can be evaluated according to 10 widely used benchmarks through the correlations coefficients.

...read moreread less

Proceedings Article•DOI•

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

[...]

Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil - Show less +7 more

31 Jul 2018

TL;DR: This paper presented an effective approach for parallel corpus mining using bilingual sentence embeddings, which is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but have some degree of semantic similarity.

...read moreread less

Abstract: This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but have some degree of semantic similarity. The quality of the resulting embeddings are evaluated on parallel corpus reconstruction and by assessing machine translation systems trained on gold vs. mined sentence pairs. We find that the sentence embeddings can be used to reconstruct the United Nations Parallel Corpus (Ziemski et al., 2016) at the sentence-level with a precision of 48.9% for en-fr and 54.9% for en-es. When adapted to document-level matching, we achieve a parallel document matching accuracy that is comparable to the significantly more computationally intensive approach of Uszkoreit et al. (2010). Using reconstructed parallel data, we are able to train NMT models that perform nearly as well as models trained on the original data (within 1-2 BLEU).

...read moreread less

Journal Article•DOI•

Exploiting semantic similarity for named entity disambiguation in knowledge graphs

[...]

Ganggao Zhu¹, Carlos A. Iglesias¹•Institutions (1)

Technical University of Madrid¹

01 Jul 2018-Expert Systems With Applications

TL;DR: This work proposes SCSNED method for disambiguation based on semantic similarity between contextual words and informative words of entities in KGs, and proposes a Category2Vec embedding model based on joint learning of word and category embedding, in order to compute word-category similarity for entity disambigsuation.

...read moreread less

Abstract: With the increasing popularity of large scale Knowledge Graph (KG)s, many applications such as semantic analysis, search and question answering need to link entity mentions in texts to entities in KGs. Because of the polysemy problem in natural language, entity disambiguation is thus a key problem in current research. Existing disambiguation methods have considered entity prominence, context similarity and entity-entity relatedness to discriminate ambiguous entities, which are mainly working on document or paragraph level texts containing rich contextual information, and based on lexical matching for computing context similarity. When meeting short texts containing limited contextual information, such as web queries, questions and tweets, those conventional disambiguation methods are not good at handling single entity mention and measuring context similarity. In order to enhance the performance of disambiguation methods based on context similarity with such short texts, we propose SCSNED method for disambiguation based on semantic similarity between contextual words and informative words of entities in KGs. Specially, we exploit the effectiveness of both knowledge-based and corpus-based semantic similarity methods for entity disambiguation with SCSNED. Moreover, we propose a Category2Vec embedding model based on joint learning of word and category embedding, in order to compute word-category similarity for entity disambiguation. We show the effectiveness of these proposed methods with illustrative examples, and evaluate their effectiveness in a comparative experiment for entity disambiguation in real world web queries, questions and tweets. The experimental results have identified the effectiveness of different semantic similarity methods, and demonstrated the improvement of semantic similarity methods in SCSNED and Category2Vec over the conventional context similarity baseline. We further compare the proposed approaches with the state of the art entity disambiguation systems and show the performances of the proposed approaches are among the best performing systems. In addition, one important feature of the proposed approaches using semantic similarity, is the potential application on any existing KGs since they mainly use common features of entity descriptions and categories. Another contribution of the paper is an updated survey on background of entity disambiguation in KGs and semantic similarity methods.

...read moreread less

Posted Content•

Large-Scale Multi-Domain Belief Tracking with Knowledge Sharing

[...]

Osman Ramadan, Paweł Budzianowski¹, Milica Gasic¹•Institutions (1)

University of Cambridge¹

17 Jul 2018-arXiv: Computation and Language

TL;DR: In this article, a novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains, and demonstrates great capability in handling multi-domain dialogues, simultaneously outperforming existing state-of-theart models in single-domain dialogue tracking tasks.

...read moreread less

Abstract: Robust dialogue belief tracking is a key component in maintaining good quality dialogue systems. The tasks that dialogue systems are trying to solve are becoming increasingly complex, requiring scalability to multi domain, semantically rich dialogues. However, most current approaches have difficulty scaling up with domains because of the dependency of the model parameters on the dialogue ontology. In this paper, a novel approach is introduced that fully utilizes semantic similarity between dialogue utterances and the ontology terms, allowing the information to be shared across domains. The evaluation is performed on a recently collected multi-domain dialogues dataset, one order of magnitude larger than currently available corpora. Our model demonstrates great capability in handling multi-domain dialogues, simultaneously outperforming existing state-of-the-art models in single-domain dialogue tracking tasks.

...read moreread less

Journal Article•DOI•

Medical concept normalization in social media posts with recurrent neural networks.

[...]

Elena Tutubalina¹, Zulfat Miftahutdinov¹, Sergey I. Nikolenko², Sergey I. Nikolenko¹, Valentin Malykh³, Valentin Malykh² - Show less +2 more•Institutions (3)

Kazan Federal University¹, Steklov Mathematical Institute², Moscow Institute of Physics and Technology³

12 Jun 2018-Journal of Biomedical Informatics

TL;DR: This work develops end-to-end architectures directly tailored to the task of mapping a disease mention to a concept in a controlled vocabulary, typically to the standard thesaurus in the Unified Medical Language System (UMLS), and develops additional semantic similarity features based on UMLS.

...read moreread less

Journal Article•DOI•

PWCDA: Path Weighted Method for Predicting circRNA-Disease Associations

[...]

Xiujuan Lei¹, Zengqiang Fang¹, Luonan Chen², Fang-Xiang Wu³•Institutions (3)

Shaanxi Normal University¹, Chinese Academy of Sciences², University of Saskatchewan³

31 Oct 2018-International Journal of Molecular Sciences

TL;DR: Experimental results illustrate the reliability and usefulness of the computational method in terms of different validation measures, which indicates PWCDA can effectively predict potential circRNA-disease associations.

...read moreread less

Abstract: CircRNAs have particular biological structure and have proven to play important roles in diseases. It is time-consuming and costly to identify circRNA-disease associations by biological experiments. Therefore, it is appealing to develop computational methods for predicting circRNA-disease associations. In this study, we propose a new computational path weighted method for predicting circRNA-disease associations. Firstly, we calculate the functional similarity scores of diseases based on disease-related gene annotations and the semantic similarity scores of circRNAs based on circRNA-related gene ontology, respectively. To address missing similarity scores of diseases and circRNAs, we calculate the Gaussian Interaction Profile (GIP) kernel similarity scores for diseases and circRNAs, respectively, based on the circRNA-disease associations downloaded from circR2Disease database (http://bioinfo.snnu.edu.cn/CircR2Disease/). Then, we integrate disease functional similarity scores and circRNA semantic similarity scores with their related GIP kernel similarity scores to construct a heterogeneous network made up of three sub-networks: disease similarity network, circRNA similarity network and circRNA-disease association network. Finally, we compute an association score for each circRNA-disease pair based on paths connecting them in the heterogeneous network to determine whether this circRNA-disease pair is associated. We adopt leave one out cross validation (LOOCV) and five-fold cross validations to evaluate the performance of our proposed method. In addition, three common diseases, Breast Cancer, Gastric Cancer and Colorectal Cancer, are used for case studies. Experimental results illustrate the reliability and usefulness of our computational method in terms of different validation measures, which indicates PWCDA can effectively predict potential circRNA-disease associations.

...read moreread less

Proceedings Article•DOI•

Specialising Word Vectors for Lexical Entailment

[...]

Ivan Vulić¹, Nikola Mrkšić¹•Institutions (1)

University of Cambridge¹

15 Feb 2018

TL;DR: This paper proposed LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasize the asymmetric relation of lexical entailment, also known as the IS-A or hyponymy-hypernymy relation.

...read moreread less

Abstract: We present LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of lexical entailment (LE), also known as the IS-A or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into the initial vector space, the LE specialisation procedure brings true hyponymy-hypernymy pairs closer together in the transformed Euclidean space. The proposed asymmetric distance measure adjusts the norms of word vectors to reflect the actual WordNet-style hierarchy of concepts. Simultaneously, a joint objective enforces semantic similarity using the symmetric cosine distance, yielding a vector space specialised for both lexical relations at once. LEAR specialisation achieves state-of-the-art performance in the tasks of hypernymy directionality, hypernymy detection, and graded lexical entailment, demonstrating the effectiveness and robustness of the proposed asymmetric specialisation model.

...read moreread less

Journal Article•DOI•

GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms

[...]

Chenguang Zhao¹, Zheng Wang²•Institutions (2)

University of Southern Mississippi¹, University of Miami²

10 Oct 2018-Scientific Reports

TL;DR: Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities and integration of the tool into bioinformatics pipelines for large-scale calculations.

...read moreread less

Abstract: Measuring the semantic similarity between Gene Ontology (GO) terms is an essential step in functional bioinformatics research. We implemented a software named GOGO for calculating the semantic similarity between GO terms. GOGO has the advantages of both information-content-based and hybrid methods, such as Resnik's and Wang's methods. Moreover, GOGO is relatively fast and does not need to calculate information content (IC) from a large gene annotation corpus but still has the advantage of using IC. This is achieved by considering the number of children nodes in the GO directed acyclic graphs when calculating the semantic contribution of an ancestor node giving to its descendent nodes. GOGO can calculate functional similarities between genes and then cluster genes based on their functional similarities. Evaluations performed on multiple pathways retrieved from the saccharomyces genome database (SGD) show that GOGO can accurately and robustly cluster genes based on functional similarities. We release GOGO as a web server and also as a stand-alone tool, which allows convenient execution of the tool for a small number of GO terms or integration of the tool into bioinformatics pipelines for large-scale calculations. GOGO can be freely accessed or downloaded from http://dna.cs.miami.edu/GOGO/ .

...read moreread less

Journal Article•DOI•

The Experience Elicited by Hallucinogens Presents the Highest Similarity to Dreaming within a Large Database of Psychoactive Substance Reports.

[...]

Camila Sanz¹, Federico Zamberlan¹, Earth Erowid, Fire Erowid, Enzo Tagliazucchi¹ - Show less +1 more•Institutions (1)

University of Buenos Aires¹

22 Jan 2018-Frontiers in Neuroscience

TL;DR: Novel quantitative analyses are applied to a large volume of empirical data to confirm the hypothesis that, among all psychoactive substances, hallucinogen drugs elicit experiences with the highest semantic similarity to those of dreams.

...read moreread less

Abstract: Ever since the modern rediscovery of psychedelic substances by Western society, several authors have independently proposed that their effects bear a high resemblance to the dreams and dreamlike experiences occurring naturally during the sleep-wake cycle Recent studies in humans have provided neurophysiological evidence supporting this hypothesis However, a rigorous comparative analysis of the phenomenology ("what it feels like" to experience these states) is currently lacking We investigated the semantic similarity between a large number of subjective reports of psychoactive substances and reports of high/low lucidity dreams, and found that the highest-ranking substance in terms of the similarity to high lucidity dreams was the serotonergic psychedelic lysergic acid diethylamide (LSD), whereas the highest-ranking in terms of the similarity to dreams of low lucidity were plants of the Datura genus, rich in deliriant tropane alkaloids Conversely, sedatives, stimulants, antipsychotics, and antidepressants comprised most of the lowest-ranking substances An analysis of the most frequent words in the subjective reports of dreams and hallucinogens revealed that terms associated with perception ("see," "visual," "face," "reality," "color"), emotion ("fear"), setting ("outside," "inside," "street," "front," "behind") and relatives ("mom," "dad," "brother," "parent," "family") were the most prevalent across both experiences In summary, we applied novel quantitative analyses to a large volume of empirical data to confirm the hypothesis that, among all psychoactive substances, hallucinogen drugs elicit experiences with the highest semantic similarity to those of dreams Our results and the associated methodological developments open the way to study the comparative phenomenology of different altered states of consciousness and its relationship with non-invasive measurements of brain physiology

...read moreread less

Journal Article•DOI•

Multilabel Prediction via Cross-View Search

[...]

Xiaobo Shen¹, Weiwei Liu², Ivor W. Tsang³, Quansen Sun¹, Yew-Soon Ong⁴ - Show less +1 more•Institutions (4)

Nanjing University of Science and Technology¹, University of New South Wales², University of Technology, Sydney³, Nanyang Technological University⁴

01 Sep 2018-IEEE Transactions on Neural Networks

TL;DR: A formulation for multilabel learning, from the perspective of cross-view learning, that explores the correlations between the input and the output, and jointly learns a semantic common subspace and view-specific mappings within one framework.

...read moreread less

Abstract: Embedding methods have shown promising performance in multilabel prediction, as they are able to discover the label dependence. However, most methods ignore the correlations between the input and output, such that their learned embeddings are not well aligned, which leads to degradation in prediction performance. This paper presents a formulation for multilabel learning, from the perspective of cross-view learning, that explores the correlations between the input and the output. The proposed method, called Co-Embedding (CoE), jointly learns a semantic common subspace and view-specific mappings within one framework. The semantic similarity structure among the embeddings is further preserved, ensuring that close embeddings share similar labels. Additionally, CoE conducts multilabel prediction through the cross-view $k$ nearest neighborhood ( $k$ NN) search among the learned embeddings, which significantly reduces computational costs compared with conventional decoding schemes. A hashing-based model, i.e., Co-Hashing (CoH), is further proposed. CoH is based on CoE, and imposes the binary constraint on continuous latent embeddings. CoH aims to generate compact binary representations to improve the prediction efficiency by benefiting from the efficient $k$ NN search of multiple labels in the Hamming space. Extensive experiments on various real-world data sets demonstrate the superiority of the proposed methods over the state of the arts in terms of both prediction accuracy and efficiency.

...read moreread less

Proceedings Article•DOI•

Controlling Length in Abstractive Summarization Using a Convolutional Neural Network

[...]

Yizhu Liu¹, Zhiyi Luo¹, Kenny Q. Zhu¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Jan 2018

TL;DR: This paper proposes an approach to constrain the summary length by extending a convolutional sequence to sequence model, and shows that this approach generates high-quality summaries with user defined length, and outperforms the baselines consistently in terms of ROUGE score, length variations and semantic similarity.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have met great success in abstractive summarization, but they cannot effectively generate summaries of desired lengths. Because generated summaries are used in difference scenarios which may have space or length constraints, the ability to control the summary length in abstractive summarization is an important problem. In this paper, we propose an approach to constrain the summary length by extending a convolutional sequence to sequence model. The results show that this approach generates high-quality summaries with user defined length, and outperforms the baselines consistently in terms of ROUGE score, length variations and semantic similarity.

...read moreread less

Proceedings Article•DOI•

Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases

[...]

Abdalghani Abujabal¹, Rishiraj Saha Roy¹, Mohamed Yahya², Gerhard Weikum¹•Institutions (2)

Max Planck Society¹, Bloomberg L.P.²

10 Apr 2018

TL;DR: NEQA is presented, a continuous learning paradigm for KB-QA that periodically re-trains its underlying models, allowing it to adapt to the language used after deployment, and its viability is demonstrated.

...read moreread less

Abstract: Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases (KB-QA). Existing methods rely on a clear separation between an offline training phase, where a model is learned, and an online phase where this model is deployed. Two major shortcomings of such methods are that (i) they require access to a large annotated training set that is not always readily available and (ii) they fail on questions from before-unseen domains. To overcome these limitations, this paper presents NEQA, a continuous learning paradigm for KB-QA. Offline, NEQA automatically learns templates mapping syntactic structures to semantic ones from a small number of training question-answer pairs. Once deployed, continuous learning is triggered on cases where templates are insufficient. Using a semantic similarity function between questions and by judicious invocation of non-expert user feedback, NEQA learns new templates that capture previously-unseen syntactic structures. This way, NEQA gradually extends its template repository. NEQA periodically re-trains its underlying models, allowing it to adapt to the language used after deployment. Our experiments demonstrate NEQA's viability, with steady improvement in answering quality over time, and the ability to answer questions from new domains.

...read moreread less

Collapse