scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2019"


Proceedings ArticleDOI
14 Aug 2019
TL;DR: Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.
Abstract: BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

4,020 citations


Posted Content
TL;DR: Experimental results show that ERNIE outperforms other baseline methods, achieving new state-of-the-art results on five Chinese natural language processing tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis and question answering.
Abstract: We present a novel language representation model enhanced by knowledge called ERNIE (Enhanced Representation through kNowledge IntEgration). Inspired by the masking strategy of BERT, ERNIE is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. Entity-level strategy masks entities which are usually composed of multiple words.Phrase-level strategy masks the whole phrase which is composed of several words standing together as a conceptual unit.Experimental results show that ERNIE outperforms other baseline methods, achieving new state-of-the-art results on five Chinese natural language processing tasks including natural language inference, semantic similarity, named entity recognition, sentiment analysis and question answering. We also demonstrate that ERNIE has more powerful knowledge inference capacity on a cloze test.

668 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: A new word replacement order determined by both the wordsaliency and the classification probability is introduced, and a greedy algorithm called probability weighted word saliency (PWWS) is proposed for text adversarial attack.
Abstract: We address the problem of adversarial attacks on text classification, which is rarely studied comparing to attacks on image classification. The challenge of this task is to generate adversarial examples that maintain lexical correctness, grammatical correctness and semantic similarity. Based on the synonyms substitution strategy, we introduce a new word replacement order determined by both the word saliency and the classification probability, and propose a greedy algorithm called probability weighted word saliency (PWWS) for text adversarial attack. Experiments on three popular datasets using convolutional as well as LSTM models show that PWWS reduces the classification accuracy to the most extent, and keeps a very low word substitution rate. A human evaluation study shows that our generated adversarial examples maintain the semantic similarity well and are hard for humans to perceive. Performing adversarial training using our perturbed datasets improves the robustness of the models. At last, our method also exhibits a good transferability on the generated adversarial examples.

501 citations


Posted Content
TL;DR: The authors proposed sentence-BERT (SBERT), a modification of the pretrained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
Abstract: BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

425 citations


Proceedings ArticleDOI
25 Feb 2019
TL;DR: This work presents two limitations of existing ranking-motivated structured losses and proposes a novel ranked list loss to solve both of them and proposes to learn a hypersphere for each class in order to preserve the similarity structure inside it.
Abstract: The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, rankingmotivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we present two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a setbased similarity structure by exploiting all instances in the gallery. The samples are split into a positive set and a negative set. Our objective is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution might be dropped. In contrast, we propose to learn a hypersphere for each class in order to preserve the similarity structure inside it. Our extensive experiments show that the proposed method achieves state-of-the-art performance on three widely used benchmarks.

230 citations


Journal ArticleDOI
17 Jul 2019
TL;DR: This work proposes to learn embeddings that can capture the similarity between entities in different knowledge graphs that achieves consistent improvements over the baseline models by over 50% in terms of hits@1 on the entity alignment task.
Abstract: The task of entity alignment between knowledge graphs aims to find entities in two knowledge graphs that represent the same real-world entity. Recently, embedding-based models are proposed for this task. Such models are built on top of a knowledge graph embedding model that learns entity embeddings to capture the semantic similarity between entities in the same knowledge graph. We propose to learn embeddings that can capture the similarity between entities in different knowledge graphs. Our proposed model helps align entities from different knowledge graphs, and hence enables the integration of multiple knowledge graphs. Our model exploits large numbers of attribute triples existing in the knowledge graphs and generates attribute character embeddings. The attribute character embedding shifts the entity embeddings from two knowledge graphs into the same space by computing the similarity between entities based on their attributes. We use a transitivity rule to further enrich the number of attributes of an entity to enhance the attribute character embedding. Experiments using real-world knowledge bases show that our proposed model achieves consistent improvements over the baseline models by over 50% in terms of hits@1 on the entity alignment task.

214 citations


Journal ArticleDOI
17 Jul 2019
TL;DR: Li et al. as mentioned in this paper presented a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification.
Abstract: The increasing concern with misinformation has stimulated research efforts on automatic fact checking. The recentlyreleased FEVER dataset introduced a benchmark factverification task in which a system is asked to verify a claim using evidential sentences from Wikipedia documents. In this paper, we present a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification. For evidence retrieval (document retrieval and sentence selection), unlike traditional vector space IR models in which queries and sources are matched in some pre-designed term vector space, we develop neural models to perform deep semantic matching from raw textual input, assuming no intermediate term representation and no access to structured external knowledge bases. We also show that Pageview frequency can also help improve the performance of evidence retrieval results, that later can be matched by using our neural semantic matching network. For claim verification, unlike previous approaches that simply feed upstream retrieved evidence and the claim to a natural language inference (NLI) model, we further enhance the NLI model by providing it with internal semantic relatedness scores (hence integrating it with the evidence retrieval modules) and ontological WordNet features. Experiments on the FEVER dataset indicate that (1) our neural semantic matching method outperforms popular TF-IDF and encoder models, by significant margins on all evidence retrieval metrics, (2) the additional relatedness score and WordNet features improve the NLI model via better semantic awareness, and (3) by formalizing all three subtasks as a similar semantic matching problem and improving on all three stages, the complete model is able to achieve the state-of-the-art results on the FEVER test set (two times greater than baseline results).1

192 citations


Proceedings ArticleDOI
01 Jul 2019
TL;DR: This work introduces methods based on sentence mover’s similarity, and finds that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries and human-authored essays.
Abstract: For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover’s similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover’s similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

148 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel strategy to exploit the semantic similarity of the training data and design an efficient generative adversarial framework to learn binary hash codes in an unsupervised manner, and achieves comparable performance with popular supervised hashing methods.
Abstract: Hashing plays a pivotal role in nearest-neighbor searching for large-scale image retrieval. Recently, deep learning-based hashing methods have achieved promising performance. However, most of these deep methods involve discriminative models, which require large-scale, labeled training datasets, thus hindering their real-world applications. In this paper, we propose a novel strategy to exploit the semantic similarity of the training data and design an efficient generative adversarial framework to learn binary hash codes in an unsupervised manner. Specifically, our model consists of three different neural networks: an encoder network to learn hash codes from images, a generative network to generate images from hash codes, and a discriminative network to distinguish between pairs of hash codes and images. By adversarially training these networks, we successfully learn mutually coherent encoder and generative networks, and can output efficient hash codes from the encoder network. We also propose a novel strategy, which utilizes both feature and neighbor similarities, to construct a semantic similarity matrix, then use this matrix to guide the hash code learning process. Integrating the supervision of this semantic similarity matrix into the adversarial learning framework can efficiently preserve the semantic information of training data in Hamming space. The experimental results on three widely used benchmarks show that our method not only significantly outperforms several state-of-the-art unsupervised hashing methods, but also achieves comparable performance with popular supervised hashing methods.

146 citations


Journal ArticleDOI
TL;DR: This paper proposes a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing.
Abstract: Hashing methods have been widely used for efficient similarity retrieval on large scale image database. Traditional hashing methods learn hash functions to generate binary codes from hand-crafted features, which achieve limited accuracy since the hand-crafted features cannot optimally represent the image content and preserve the semantic similarity. Recently, several deep hashing methods have shown better performance because the deep architectures generate more discriminative feature representations. However, these deep hashing methods are mainly designed for supervised scenarios, which only exploit the semantic similarity information, but ignore the underlying data structures. In this paper, we propose the semi-supervised deep hashing approach, to perform more effective hash function learning by simultaneously preserving semantic similarity and underlying data structures. The main contributions are as follows: 1) We propose a semi-supervised loss to jointly minimize the empirical error on labeled data, as well as the embedding error on both labeled and unlabeled data, which can preserve the semantic similarity and capture the meaningful neighbors on the underlying data structures for effective hashing. 2) A semi-supervised deep hashing network is designed to extensively exploit both labeled and unlabeled data, in which we propose an online graph construction method to benefit from the evolving deep features during training to better capture semantic neighbors. To the best of our knowledge, the proposed deep network is the first deep hashing method that can perform hash code learning and feature learning simultaneously in a semi-supervised fashion. Experimental results on five widely-used data sets show that our proposed approach outperforms the state-of-the-art hashing methods.

140 citations


Journal ArticleDOI
TL;DR: This article analyzed the nonlinear dynamics of learning in deep linear networks and found exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, and the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development.
Abstract: An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: A significant improvement in the correlation of YiSi-1’s scores with human judgment is made by using contextual embeddings in multilingual BERT–Bidirectional Encoder Representations from Transformers to evaluate lexical semantic similarity.
Abstract: We present YiSi, a unified automatic semantic machine translation quality evaluation and estimation metric for languages with different levels of available resources. Underneath the interface with different language resources settings, YiSi uses the same representation for the two sentences in assessment. Besides, we show significant improvement in the correlation of YiSi-1’s scores with human judgment is made by using contextual embeddings in multilingual BERT–Bidirectional Encoder Representations from Transformers to evaluate lexical semantic similarity. YiSi is open source and publicly available.

Journal ArticleDOI
TL;DR: A sentiment classification model is proposed that uses the semantic similarity measure in combination with embedding representations and can improve Sentiment Analysis performance over a strong baseline, with improvement statistically significant.
Abstract: Lexical resources are widely popular in the field of Sentiment Analysis, as they represent a resource that directly encodes sentimental knowledge. Usually sentiment lexica are used for polarity estimation through the matching of words contained in a text and their associated lexicon sentiment polarities. Nevertheless, such resources have limitations in vocabulary coverage and domain adaptation. Besides, many recent techniques exploit the concept of distributed semantics, normally through word embeddings. In this work, a semantic similarity metric is computed between text words and lexica vocabulary. Using this metric, this paper proposes a sentiment classification model that uses the semantic similarity measure in combination with embedding representations. In order to assess the effectiveness of this model, we perform an extensive evaluation. Experiments show that the proposed method can improve Sentiment Analysis performance over a strong baseline, being this improvement statistically significant. Finally, some characteristics of the proposed technique are studied, showing that the selection of lexicon words has an effect in cross-dataset performance.

Journal ArticleDOI
TL;DR: A novel approach is addressed using a recently introduced method for quantifying the semantic context of speech and relating it to a commonly used method for indexing low-level auditory encoding of speech to suggest a mechanism that links top-down prior information with bottom-up sensory processing in the context of natural, narrative speech listening.
Abstract: Speech perception involves the integration of sensory input with expectations based on the context of that speech. Much debate surrounds the issue of whether or not prior knowledge feeds back to affect early auditory encoding in the lower levels of the speech processing hierarchy, or whether perception can be best explained as a purely feedforward process. Although there has been compelling evidence on both sides of this debate, experiments involving naturalistic speech stimuli to address these questions have been lacking. Here, we use a recently introduced method for quantifying the semantic context of speech and relate it to a commonly used method for indexing low-level auditory encoding of speech. The relationship between these measures is taken to be an indication of how semantic context leading up to a word influences how its low-level acoustic and phonetic features are processed. We record EEG from human participants (both male and female) listening to continuous natural speech and find that the early cortical tracking of a word's speech envelope is enhanced by its semantic similarity to its sentential context. Using a forward modeling approach, we find that prediction accuracy of the EEG signal also shows the same effect. Furthermore, this effect shows distinct temporal patterns of correlation depending on the type of speech input representation (acoustic or phonological) used for the model, implicating a top-down propagation of information through the processing hierarchy. These results suggest a mechanism that links top-down prior information with the early cortical entrainment of words in natural, continuous speech.SIGNIFICANCE STATEMENT During natural speech comprehension, we use semantic context when processing information about new incoming words. However, precisely how the neural processing of bottom-up sensory information is affected by top-down context-based predictions remains controversial. We address this discussion using a novel approach that indexes a word's similarity to context and how well a word's acoustic and phonetic features are processed by the brain at the time of its utterance. We relate these two measures and show that lower-level auditory tracking of speech improves for words that are more related to their preceding context. These results suggest a mechanism that links top-down prior information with bottom-up sensory processing in the context of natural, narrative speech listening.

Proceedings ArticleDOI
01 Jun 2019
TL;DR: A new deep unsupervised hashing model, called DistilHash, is proposed, which can learn a distilled data set, where data pairs have confident similarity signals and the semantic similarity labels assigned by the optimal Bayesian classifier can be potentially distilled.
Abstract: Due to storage and search efficiency, hashing has become significantly prevalent for nearest neighbor search. Particularly, deep hashing methods have greatly improved the search performance, typically under supervised scenarios. In contrast, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of supervisory similarity signals. To address this problem, in this paper, we propose a new deep unsupervised hashing model, called DistilHash, which can learn a distilled data set, where data pairs have confident similarity signals. Specifically, we investigate the relationship between the initial but noisy similarity signals learned from local structures and the semantic similarity labels assigned by the optimal Bayesian classifier. We show that, under a mild assumption, some data pairs, of which labels are consistent with those assigned by the optimal Bayesian classifier, can be potentially distilled. With this understanding, we design a simple but effective method to distill data pairs automatically and further adopt a Bayesian learning framework to learn hashing functions from the distilled data set. Extensive experimental results on three widely used benchmark datasets demonstrate that our method achieves state-of-the-art search performance.

Journal ArticleDOI
TL;DR: The main objective of ViSEAGO package is to carry out a data mining of biological functions and establish links between genes involved in the study to facilitate functional Gene Ontology (GO) analysis of complex experimental design with multiple comparisons of interest.
Abstract: The main objective of ViSEAGO package is to carry out a data mining of biological functions and establish links between genes involved in the study. We developed ViSEAGO in R to facilitate functional Gene Ontology (GO) analysis of complex experimental design with multiple comparisons of interest. It allows to study large-scale datasets together and visualize GO profiles to capture biological knowledge. The acronym stands for three major concepts of the analysis: Visualization, Semantic similarity and Enrichment Analysis of Gene Ontology. It provides access to the last current GO annotations, which are retrieved from one of NCBI EntrezGene, Ensembl or Uniprot databases for several species. Using available R packages and novel developments, ViSEAGO extends classical functional GO analysis to focus on functional coherence by aggregating closely related biological themes while studying multiple datasets at once. It provides both a synthetic and detailed view using interactive functionalities respecting the GO graph structure and ensuring functional coherence supplied by semantic similarity. ViSEAGO has been successfully applied on several datasets from different species with a variety of biological questions. Results can be easily shared between bioinformaticians and biologists, enhancing reporting capabilities while maintaining reproducibility. ViSEAGO is publicly available on https://bioconductor.org/packages/ViSEAGO .

Journal ArticleDOI
TL;DR: This paper presents an open source CirGO (Circular Gene Ontology) software that visualises non-redundant two-level hierarchically structured ontology terms from gene expression data in a 2D space in an informative, comprehensive and intuitive format.
Abstract: Prioritisation of gene ontology terms from differential gene expression analyses in a two-dimensional format remains a challenge with exponentially growing data volumes. Typically, gene ontology terms are represented as tree-maps that enclose all data into defined space. However, large datasets make this type of visualisation appear cluttered and busy, and often not informative as some labels are omitted due space limits, especially when published in two-dimensional (2D) figures. Here we present an open source CirGO (Circular Gene Ontology) software that visualises non-redundant two-level hierarchically structured ontology terms from gene expression data in a 2D space. Gene ontology terms based on statistical significance were summarised with a semantic similarity algorithm and grouped by hierarchical clustering. This software visualises the most enriched gene ontology terms in an informative, comprehensive and intuitive format that is achieved by organising data from the most relevant to the least, as well as the appropriate use of colours and supporting information. Additionally, CirGO is an easy to use software that supports researchers with little computational background to present their gene ontology data in a publication ready format. Our easy to use open source CirGO Python software package provides biologists with a succinct presentation of terms and functions that are most represented in a specific gene expression data set in a visually appealing 2D format (e.g. for reporting research results in scientific articles). CirGO is freely available at https://github.com/IrinaVKuznetsova/CirGO.git .

Proceedings ArticleDOI
02 Apr 2019
TL;DR: A generative model for a sentence that uses two latent variables, with one intended to represent the syntax of the sentence and the other to represent its semantics, is proposed, which shows it can achieve better disentanglement between semantic and syntactic representations by training with multiple losses.
Abstract: We propose a generative model for a sentence that uses two latent variables, with one intended to represent the syntax of the sentence and the other to represent its semantics. We show we can achieve better disentanglement between semantic and syntactic representations by training with multiple losses, including losses that exploit aligned paraphrastic sentences and word-order information. We evaluate our models on standard semantic similarity tasks and novel syntactic similarity tasks. Empirically, we find that the model with the best performing syntactic and semantic representations also gives rise to the most disentangled representations.

Journal ArticleDOI
TL;DR: This work introduces the largest, reproducible and detailed experimental survey of OM measures and THE AUTHORS models reported in the literature, based on the evaluation of both families of methods on a same software platform, with the aim of elucidating what is the state of the problem.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: Experimental results on three public benchmark datasets show that the proposed end-to-end hierarchical attention network focusing on learning to represent coherent evidence as well as their semantic relatedness with the claim outperforms a set of state-of-the-art baselines.
Abstract: Claim verification is generally a task of verifying the veracity of a given claim, which is critical to many downstream applications. It is cumbersome and inefficient for human fact-checkers to find consistent pieces of evidence, from which solid verdict could be inferred against the claim. In this paper, we propose a novel end-to-end hierarchical attention network focusing on learning to represent coherent evidence as well as their semantic relatedness with the claim. Our model consists of three main components: 1) A coherence-based attention layer embeds coherent evidence considering the claim and sentences from relevant articles; 2) An entailment-based attention layer attends on sentences that can semantically infer the claim on top of the first attention; and 3) An output layer predicts the verdict based on the embedded evidence. Experimental results on three public benchmark datasets show that our proposed model outperforms a set of state-of-the-art baselines.

Journal ArticleDOI
TL;DR: A novel method based on interdependent representations of short texts for determining their degree of semantic similarity and a preprocessing algorithm that chains coreferential named entities together and performs word segmentation to preserve the meaning of phrasal verbs and idioms are presented.
Abstract: We present a novel method based on interdependent representations of short texts for determining their degree of semantic similarity. The method represents each short text as two dense vectors: the former is built using the word-to-word similarity based on pre-trained word vectors, the latter is built using the word-to-word similarity based on external sources of knowledge. We also developed a preprocessing algorithm that chains coreferential named entities together and performs word segmentation to preserve the meaning of phrasal verbs and idioms. We evaluated the proposed method on three popular datasets, namely Microsoft Research Paraphrase Corpus, STS2015 and P4PIN, and obtained state-of-the-art results on all three without using prior knowledge of natural language, e.g., part-of-speech tags or parse tree, which indicates the interdependent representations of short text pairs are effective and efficient for semantic textual similarity tasks.

Book ChapterDOI
11 Apr 2019
TL;DR: This paper combines ideas from “generalised differential privacy” and machine learning techniques for text processing to model privacy for text documents, defining a privacy mechanism that operates at the level of text documents represented as “bags-of-words”.
Abstract: We address the problem of how to “obfuscate” texts by removing stylistic clues which can identify authorship, whilst preserving (as much as possible) the content of the text. In this paper we combine ideas from “generalised differential privacy” and machine learning techniques for text processing to model privacy for text documents. We define a privacy mechanism that operates at the level of text documents represented as “bags-of-words”—these representations are typical in machine learning and contain sufficient information to carry out many kinds of classification tasks including topic identification and authorship attribution (of the original documents). We show that our mechanism satisfies privacy with respect to a metric for semantic similarity, thereby providing a balance between utility, defined by the semantic content of texts, with the obfuscation of stylistic clues. We demonstrate our implementation on a “fan fiction” dataset, confirming that it is indeed possible to disguise writing style effectively whilst preserving enough information and variation for accurate content classification tasks. We refer the reader to our complete paper [15] which contains full proofs and further experimentation details.

Journal ArticleDOI
TL;DR: In this article, a method, OPA2Vec, is proposed to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axiom from the ontology meta-data.
Abstract: Motivation Ontologies are widely used in biology for data annotation, integration and analysis. In addition to formally structured axioms, ontologies contain meta-data in the form of annotation axioms which provide valuable pieces of information that characterize ontology classes. Annotation axioms commonly used in ontologies include class labels, descriptions or synonyms. Despite being a rich source of semantic information, the ontology meta-data are generally unexploited by ontology-based analysis methods such as semantic similarity measures. Results We propose a novel method, OPA2Vec, to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axioms from the ontology meta-data. We apply a Word2Vec model that has been pre-trained on either a corpus or abstracts or full-text articles to produce feature vectors from our collected data. We validate our method in two different ways: first, we use the obtained vector representations of proteins in a similarity measure to predict protein-protein interaction on two different datasets. Second, we evaluate our method on predicting gene-disease associations based on phenotype similarity by generating vector representations of genes and diseases using a phenotype ontology, and applying the obtained vectors to predict gene-disease associations using mouse model phenotypes. We demonstrate that OPA2Vec significantly outperforms existing methods for predicting gene-disease associations. Using evidence from mouse models, we apply OPA2Vec to identify candidate genes for several thousand rare and orphan diseases. OPA2Vec can be used to produce vector representations of any biomedical entity given any type of biomedical ontology. Availability and implementation https://github.com/bio-ontology-research-group/opa2vec. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This paper applies the continuous bag-of-word model to learn word embedding representations based on a data set of three billion microblogs, and proposes using convolutional neural networks, long short-term memory models and their combination LSTM-CNN and the multi-layer perceptron model based on word vector features to extract traffic relevant microblogs.
Abstract: Mining traffic-relevant information from social media data has become an emerging topic due to the real-time and ubiquitous features of social media. In this paper, we focus on a specific problem in social media mining which is to extract traffic relevant microblogs from Sina Weibo, a Chinese microblogging platform. It is transformed into a machine learning problem of short text classification. First, we apply the continuous bag-of-word model to learn word embedding representations based on a data set of three billion microblogs. Compared to the traditional one-hot vector representation of words, word embedding can capture semantic similarity between words and has been proved effective in natural language processing tasks. Next, we propose using convolutional neural networks (CNNs), long short-term memory (LSTM) models and their combination LSTM-CNN to extract traffic relevant microblogs with the learned word embeddings as inputs. We compare the proposed methods with competitive approaches, including the support vector machine (SVM) model based on a bag of n-gram features, the SVM model based on word vector features, and the multi-layer perceptron model based on word vector features. Experiments show the effectiveness of the proposed deep learning approaches.

Proceedings ArticleDOI
01 Aug 2019
TL;DR: An event meta-schema to characterize the semantic relatedness of social events and build an event-based heterogeneous information network (HIN) integrating information from external knowledge base is designed and a novel Pair-wise Popularity Graph Convolutional Network (PP-GCN) based fine-grained social event categorization model is proposed.
Abstract: Events are happening in real-world and real-time, which can be planned and organized occasions involving multiple people and objects. Social media platforms publish a lot of text messages containing public events with comprehensive topics. However, mining social events is challenging due to the heterogeneous event elements in texts and explicit and implicit social network structures. In this paper, we design an event meta-schema to characterize the semantic relatedness of social events and build an event-based heterogeneous information network (HIN) integrating information from external knowledge base, and propose a novel Pair-wise Popularity Graph Convolutional Network (PP-GCN) based fine-grained social event categorization model. We propose a Knowledgeable meta-paths Instances based social Event Similarity (KIES) between events and build a weighted adjacent matrix as input to the PP-GCN model. Comprehensive experiments on real data collections are conducted to compare various social event detection and clustering tasks. Experimental results demonstrate that our proposed framework outperforms other alternative social event categorization techniques.

Journal ArticleDOI
TL;DR: An overview of the main approaches that are being used and the advantages of using quantitative measures of semantic distance in creativity research will be discussed.
Abstract: Semantic distance plays an important role in the creative process: The farther one `moves away’ from a conventional idea, the more creative the new idea will likely be. Although intuitive, the role of semantic distance in creativity has been only indirectly examined due to the challenge of its measurement. Recent studies have started applying quantitative measures of semantic distance in creativity research. Such studies complement standard subjective measures of creativity; provide objective measures of the creative output; and also allow to more directly examine the role of semantic memory, and distance, in creativity. An overview of the main approaches that are being used will be described and the advantages of using such quantitative measures in creativity will be discussed.

Proceedings ArticleDOI
18 Jul 2019
TL;DR: Evaluation results show that table embeddings can significantly improve upon the performance of state-of-the-art baselines and be utilized in table-related tasks, row population, column population, and table retrieval.
Abstract: Tables contain valuable knowledge in a structured form. We employ neural language modeling approaches to embed tabular data into vector spaces. Specifically, we consider different table elements, such caption, column headings, and cells, for training word and entity embeddings. These embeddings are then utilized in three particular table-related tasks, row population, column population, and table retrieval, by incorporating them into existing retrieval models as additional semantic similarity signals. Evaluation results show that table embeddings can significantly improve upon the performance of state-of-the-art baselines.

Posted Content
TL;DR: This paper examines two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability, and shows that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality.
Abstract: Pretrained transformer-based language models have achieved state of the art across countless tasks in natural language processing. These models are highly expressive, comprising at least a hundred million parameters and a dozen layers. Recent evidence suggests that only a few of the final layers need to be fine-tuned for high quality on downstream tasks. Naturally, a subsequent research question is, "how many of the last layers do we need to fine-tune?" In this paper, we precisely answer this question. We examine two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability. We vary the number of final layers that are fine-tuned, then study the resulting change in task-specific effectiveness. We show that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality. Surprisingly, we also find that fine-tuning all layers does not always help.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper investigated the relationship between the initial noisy similarity signals learned from local structures and the semantic similarity labels assigned by a Bayes optimal classifier, and designed a simple yet effective strategy to distill data pairs automatically and further adopt a Bayesian learning framework to learn hash functions from the distilled data set.
Abstract: Due to the high storage and search efficiency, hashing has become prevalent for large-scale similarity search. Particularly, deep hashing methods have greatly improved the search performance under supervised scenarios. In contrast, unsupervised deep hashing models can hardly achieve satisfactory performance due to the lack of reliable supervisory similarity signals. To address this issue, we propose a novel deep unsupervised hashing model, dubbed DistillHash, which can learn a distilled data set consisted of data pairs, which have confidence similarity signals. Specifically, we investigate the relationship between the initial noisy similarity signals learned from local structures and the semantic similarity labels assigned by a Bayes optimal classifier. We show that under a mild assumption, some data pairs, of which labels are consistent with those assigned by the Bayes optimal classifier, can be potentially distilled. Inspired by this fact, we design a simple yet effective strategy to distill data pairs automatically and further adopt a Bayesian learning framework to learn hash functions from the distilled data set. Extensive experimental results on three widely used benchmark datasets show that the proposed DistillHash consistently accomplishes the state-of-the-art search performance.

Journal ArticleDOI
TL;DR: This study examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency, and found that global semantic similarity as reported by word2vec was an important predictor of coherence ratings.
Abstract: This article introduces the second version of the Tool for the Automatic Analysis of Cohesion (TAACO 2.0). Like its predecessor, TAACO 2.0 is a freely available text analysis tool that works on the Windows, Mac, and Linux operating systems; is housed on a user's hard drive; is easy to use; and allows for batch processing of text files. TAACO 2.0 includes all the original indices reported for TAACO 1.0, but it adds a number of new indices related to local and global cohesion at the semantic level, reported by latent semantic analysis, latent Dirichlet allocation, and word2vec. The tool also includes a source overlap feature, which calculates lexical and semantic overlap between a source and a response text (i.e., cohesion between the two texts based measures of text relatedness). In the first study in this article, we examined the effects that cohesion features, prompt, essay elaboration, and enhanced cohesion had on expert ratings of text coherence, finding that global semantic similarity as reported by word2vec was an important predictor of coherence ratings. A second study was conducted to examine the source and response indices. In this study we examined whether source overlap between the speaking samples found in the TOEFL-iBT integrated speaking tasks and the responses produced by test-takers was predictive of human ratings of speaking proficiency. The results indicated that the percentage of keywords found in both the source and response and the similarity between the source document and the response, as reported by word2vec, were significant predictors of speaking quality. Combined, these findings help validate the new indices reported for TAACO 2.0.