scispace - formally typeset
Search or ask a question

Showing papers on "Word embedding published in 2020"


Proceedings Article
01 Apr 2020
TL;DR: A novel adversarial training algorithm is proposed, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples.
Abstract: Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art single-model test accuracies of 85.44% and 67.75% on ARC-Easy and ARC-Challenge. Experiments on CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and boost the performance of RoBERTa-large model on other tasks as well.

313 citations


Journal ArticleDOI
TL;DR: This paper reviews the latest studies that have employed deep learning to solve sentiment analysis problems, such as sentiment polarity, and models using term frequency-inverse document frequency and word embedding have been applied to a series of datasets.
Abstract: The study of public opinion can provide us with valuable information. The analysis of sentiment on social networks, such as Twitter or Facebook, has become a powerful means of learning about the users’ opinions and has a wide range of applications. However, the efficiency and accuracy of sentiment analysis is being hindered by the challenges encountered in natural language processing (NLP). In recent years, it has been demonstrated that deep learning models are a promising solution to the challenges of NLP. This paper reviews the latest studies that have employed deep learning to solve sentiment analysis problems, such as sentiment polarity. Models using term frequency-inverse document frequency (TF-IDF) and word embedding have been applied to a series of datasets. Finally, a comparative study has been conducted on the experimental results obtained for the different models and input features.

273 citations


Book ChapterDOI
01 Jan 2020
TL;DR: It is empirically show that CDA effectively decreases gender bias while preserving accuracy, and it is found that as training proceeds on the original data set with gradient descent the gender bias grows as the loss reduces, indicating that the optimization encourages bias; CDA mitigates this behavior.
Abstract: We examine whether neural natural language processing (NLP) systems reflect historical biases in training data. We define a general benchmark to quantify gender bias in a variety of neural NLP tasks. Our empirical evaluation with state-of-the-art neural coreference resolution and textbook RNN-based language models trained on benchmark data sets finds significant gender bias in how models view occupations. We then mitigate bias with counterfactual data augmentation (CDA): a generic methodology for corpus augmentation via causal interventions that breaks associations between gendered and gender-neutral words. We empirically show that CDA effectively decreases gender bias while preserving accuracy. We also explore the space of mitigation strategies with CDA, a prior approach to word embedding debiasing (WED), and their compositions. We show that CDA outperforms WED, drastically so when word embeddings are trained. For pre-trained embeddings, the two methods can be effectively composed. We also find that as training proceeds on the original data set with gradient descent the gender bias grows as the loss reduces, indicating that the optimization encourages bias; CDA mitigates this behavior.

195 citations


Proceedings Article
30 Apr 2020
TL;DR: After the proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek.
Abstract: We propose procedures for evaluating and strengthening contextual embedding alignment and show that they are useful in understanding and improving multilingual BERT. In particular, after our proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching fully-supervised models for Bulgarian and Greek. Further, using non-contextual and contextual versions of word retrieval, we show that BERT outperforms fastText while being able to distinguish between multiple uses of a word, suggesting that pre-training subsumes word vectors for learning cross-lingual signals. Finally, we use the contextual word retrieval task to gain a better understanding of the strengths and weaknesses of multilingual pre-training.

169 citations


Journal ArticleDOI
TL;DR: In this article, a convolutional neural network based on different word embeddings was evaluated and compared to a classification based on user-level linguistic metadata, which achieved state-of-the-art results in a current early detection task.
Abstract: Depression is ranked as the largest contributor to global disability and is also a major reason for suicide. Still, many individuals suffering from forms of depression are not treated for various reasons. Previous studies have shown that depression also has an effect on language usage and that many depressed individuals use social media platforms or the internet in general to get information or discuss their problems. This paper addresses the early detection of depression using machine learning models based on messages on a social platform. In particular, a convolutional neural network based on different word embeddings is evaluated and compared to a classification based on user-level linguistic metadata. An ensemble of both approaches is shown to achieve state-of-the-art results in a current early detection task. Furthermore, the currently popular $ERDE$ E R D E score as metric for early detection systems is examined in detail and its drawbacks in the context of shared tasks are illustrated. A slightly modified metric is proposed and compared to the original score. Finally, a new word embedding was trained on a large corpus of the same domain as the described task and is evaluated as well.

152 citations


Journal ArticleDOI
TL;DR: This study evaluates several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit, and shows that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures.
Abstract: Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.

149 citations


Posted Content
Guolin Ke1, Di He1, Tie-Yan Liu1
TL;DR: This work investigates the problems in the previous formulations and proposes a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE), which can achieve a higher score than baselines while only using 30% pre-training computational costs.
Abstract: How to explicitly encode positional information into neural networks is important in learning the representation of natural languages, such as BERT Based on the Transformer architecture, the positional information is simply encoded as embedding vectors, which are used in the input layer, or encoded as a bias term in the self-attention module In this work, we investigate the problems in the previous formulations and propose a new positional encoding method for BERT called Transformer with Untied Positional Encoding (TUPE) Different from all other works, TUPE only uses the word embedding as input In the self-attention module, the word contextual correlation and positional correlation are computed separately with different parameterizations and then added together This design removes the addition over heterogeneous embeddings in the input, which may potentially bring randomness, and gives more expressiveness to characterize the relationship between words/positions by using different projection matrices Furthermore, TUPE unties the [CLS] symbol from other positions to provide it with a more specific role to capture the global representation of the sentence Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness and efficiency of the proposed method: TUPE outperforms several baselines on almost all tasks by a large margin In particular, it can achieve a higher score than baselines while only using 30% pre-training computational costs We release our code at this https URL

133 citations


Proceedings Article
30 Apr 2020
TL;DR: This work shows state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence).
Abstract: We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

119 citations


Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work proposes a bi-encoder model that independently embeds the target word with its surrounding context and the dictionary definition, or gloss, of each sense, and demonstrates that rare senses can be more effectively disambiguated by modeling their definitions.
Abstract: A major obstacle in Word Sense Disambiguation (WSD) is that word senses are not uniformly distributed, causing existing models to generally perform poorly on senses that are either rare or unseen during training. We propose a bi-encoder model that independently embeds (1) the target word with its surrounding context and (2) the dictionary definition, or gloss, of each sense. The encoders are jointly optimized in the same representation space, so that sense disambiguation can be performed by finding the nearest sense embedding for each target word embedding. Our system outperforms previous state-of-the-art models on English all-words WSD; these gains predominantly come from improved performance on rare senses, leading to a 31.1% error reduction on less frequent senses over prior work. This demonstrates that rare senses can be more effectively disambiguated by modeling their definitions.

119 citations


Journal ArticleDOI
TL;DR: A novel hybrid deep learning model is proposed that strategically combines different word embedding (Word2Vec, FastText, character-level embedding) with different deep learning methods (LSTM, GRU, BiL STM, CNN) and classifies texts in terms of sentiment.
Abstract: A massive use of social media platforms such as Twitter and Facebook by omnifarious organizations has increased the critical individual feedback on the situation, events, products, and services. However, sentiment classification plays an important role in the user's feedback evaluation. At present, deep learning such as long short-term memory (LSTM), gated recurrent unit (GRU), bidirectionally long short-term memory (BiLSTM) or convolutional neural network (CNN) are prevalently preferred in sentiment classification. Moreover, word embedding such as Word2Vec and FastText is closely examined in text for mapping closely related to the vectors of real numbers. However, both deep learning and word embedding methods have strengths and weaknesses. Combining the strengths of the deep learning models with that of word embedding is the key to high-performance sentiment classification in the field of natural language processing (NLP). In the present study, we propose a novel hybrid deep learning model that strategically combines different word embedding (Word2Vec, FastText, character-level embedding) with different deep learning methods (LSTM, GRU, BiLSTM, CNN). The proposed model extracts features of different deep learning methods of word embedding, combines these features and classifies texts in terms of sentiment. To verify the performance of the proposed model, several deep learning models called basic models were created to perform series of experiments. By comparing, the performance of the proposed model with that of past studies, the proposed model offers better sentiment classification performance.

111 citations


Journal ArticleDOI
TL;DR: This paper reviews deep learning approaches that have been applied to various sentiment analysis tasks and their trends of development, and provides the performance analysis of different deep learning models on a particular dataset at the end of each sentiment analysis task.
Abstract: Nowadays, with the increasing number of Web 2.0 tools, users generate huge amounts of data in an enormous and dynamic way. In this regard, the sentiment analysis appeared to be an important tool that allows the automation of getting insight from the user-generated data. Recently, deep learning approaches have been proposed for different sentiment analysis tasks and have achieved state-of-the-art results. Therefore, in order to help researchers to depict quickly the current progress as well as current issues to be addressed, in this paper, we review deep learning approaches that have been applied to various sentiment analysis tasks and their trends of development. This study also provides the performance analysis of different deep learning models on a particular dataset at the end of each sentiment analysis task. Toward the end, the review highlights current issues and hypothesized solutions to be taken into account in future work. Moreover, based on knowledge learned from previous studies, the future work subsection shows the suggestions that can be incorporated into new deep learning models to yield better performance. Suggestions include the use of bidirectional encoder representations from transformers (BERT), sentiment-specific word embedding models, cognition-based attention models, common sense knowledge, reinforcement learning, and generative adversarial networks.

Proceedings ArticleDOI
28 Jul 2020
TL;DR: A sentiment lexicon expansion method using Word2vec and fastText word embeddings along with rule-based Sentiment Analysis method, which uses expanded lexicons, lists of conjunctions and negational words to predict the sentiments expressed in Tamil texts is proposed.
Abstract: Sentiment Analysis is the process of identifying and categorising the sentiments expressed in a text into positive or negative. The words which carry the sentiments are the keys in sentiment prediction. The SentiWordNet is the sentiment lexicon used to determine the sentiment of texts. There are huge number of sentiment terms that are not in the SentiWordNet limit the performance of Sentiment Analysis. Gathering and grouping such sentiment words manually is a tedious task. In this paper we propose a sentiment lexicon expansion method using Word2vec and fastText word embeddings along with rule-based Sentiment Analysis method. We expand the sentiment lexicon from the initial seed list of 2951 positive and 5598 negative words in two steps: (i) Gathering related words using Word2vec word embedding and (ii) Gathering lexically similar words using fastText word embedding. Our final lexicons UJ_Lex_Pos and UJ_Lex_Neg ended up with 10537 positive and 12664 negative words respectively which are labelled using Word2vec word embedding. Furthermore the rule-based Sentiment Analysis method uses expanded lexicons (UJ_Lex_Pos and UJ_Lex_Neg), lists of conjunctions and negational words to predict the sentiments expressed in Tamil texts. The method is evaluated on UJ_MovieReviews and an accuracy of 88 0.14% is obtained.

Journal ArticleDOI
TL;DR: This paper proposes a hate speech detection approach to identify hatred against vulnerable minority groups on social media and can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo.
Abstract: With the rapid development in mobile computing and Web technologies, online hate speech has been increasingly spread in social network platforms since it's easy to post any opinions. Previous studies confirm that exposure to online hate speech has serious offline consequences to historically deprived communities. Thus, research on automated hate speech detection has attracted much attention. However, the role of social networks in identifying hate-related vulnerable community is not well investigated. Hate speech can affect all population groups, but some are more vulnerable to its impact than others. For example, for ethnic groups whose languages have few computational resources, it is a challenge to automatically collect and process online texts, not to mention automatic hate speech detection on social media. In this paper, we propose a hate speech detection approach to identify hatred against vulnerable minority groups on social media. Firstly, in Spark distributed processing framework, posts are automatically collected and pre-processed, and features are extracted using word n-grams and word embedding techniques such as Word2Vec. Secondly, deep learning algorithms for classification such as Gated Recurrent Unit (GRU), a variety of Recurrent Neural Networks (RNNs), are used for hate speech detection. Finally, hate words are clustered with methods such as Word2Vec to predict the potential target ethnic group for hatred. In our experiments, we use Amharic language in Ethiopia as an example. Since there was no publicly available dataset for Amharic texts, we crawled Facebook pages to prepare the corpus. Since data annotation could be biased by culture, we recruit annotators from different cultural backgrounds and achieved better inter-annotator agreement. In our experimental results, feature extraction using word embedding techniques such as Word2Vec performs better in both classical and deep learning-based classification algorithms for hate speech detection, among which GRU achieves the best result. Our proposed approach can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo. As a result, hatred vulnerable group identification is vital to protect them by applying automatic hate speech detection model to remove contents that aggravate psychological harm and physical conflicts. This can also encourage the way towards the development of policies, strategies, and tools to empower and protect vulnerable communities.

Journal ArticleDOI
TL;DR: The importance of reasoning is emphasized in this paper because it is important for building interpretable and knowledge-driven neural NLP models to handle complex tasks.

Journal ArticleDOI
TL;DR: Two neural network models that integrate traditional bag-of-words as well as the word context and consumer emotions are proposed that perform well on all datasets, irrespective of their sentiment polarity and product category.
Abstract: Fake consumer review detection has attracted much interest in recent years owing to the increasing number of Internet purchases. Existing approaches to detect fake consumer reviews use the review content, product and reviewer information and other features to detect fake reviews. However, as shown in recent studies, the semantic meaning of reviews might be particularly important for text classification. In addition, the emotions hidden in the reviews may represent another potential indicator of fake content. To improve the performance of fake review detection, here we propose two neural network models that integrate traditional bag-of-words as well as the word context and consumer emotions. Specifically, the models learn document-level representation by using three sets of features: (1) n-grams, (2) word embeddings and (3) various lexicon-based emotion indicators. Such a high-dimensional feature representation is used to classify fake reviews into four domains. To demonstrate the effectiveness of the presented detection systems, we compare their classification performance with several state-of-the-art methods for fake review detection. The proposed systems perform well on all datasets, irrespective of their sentiment polarity and product category.

Proceedings ArticleDOI
03 Jun 2020
TL;DR: This paper aims to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method that uses the norm (aka length or module) of a word embedding as a measure of the difficulty of the sentence, the competence of the model, and the weight of the sentences.
Abstract: A neural machine translation (NMT) system is expensive to train, especially with high-resource settings. As the NMT architectures become deeper and wider, this issue gets worse and worse. In this paper, we aim to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method. We use the norm (aka length or module) of a word embedding as a measure of 1) the difficulty of the sentence, 2) the competence of the model, and 3) the weight of the sentence. The norm-based sentence difficulty takes the advantages of both linguistically motivated and model-based sentence difficulties. It is easy to determine and contains learning-dependent features. The norm-based model competence makes NMT learn the curriculum in a fully automated way, while the norm-based sentence weight further enhances the learning of the vector representation of the NMT. Experimental results for the WMT’14 English-German and WMT’17 Chinese-English translation tasks demonstrate that the proposed method outperforms strong baselines in terms of BLEU score (+1.17/+1.56) and training speedup (2.22x/3.33x).

Journal ArticleDOI
TL;DR: The proposed automated classification model and LDA-based network analysis method provide a useful approach to enable machine-assisted interpretation of texts-based accident narratives and can provide managers with much-needed information and knowledge to improve safety on-site.

Journal ArticleDOI
TL;DR: This paper proposes a novel label co-occurrence learning framework based on Graph Convolution Networks (GCNs) to explicitly explore the dependencies between pathologies for the multi-label chest X-ray (CXR) image classification task, which is term the “CheXGCN”.
Abstract: Existing multi-label medical image learning tasks generally contain rich relationship information among pathologies such as label co-occurrence and interdependency, which is of great importance for assisting in clinical diagnosis and can be represented as the graph-structured data. However, most state-of-the-art works only focus on regression from the input to the binary labels, failing to make full use of such valuable graph-structured information due to the complexity of graph data. In this paper, we propose a novel label co-occurrence learning framework based on Graph Convolution Networks (GCNs) to explicitly explore the dependencies between pathologies for the multi-label chest X-ray (CXR) image classification task, which we term the “CheXGCN”. Specifically, the proposed CheXGCN consists of two modules, i.e., the image feature embedding (IFE) module and label co-occurrence learning (LCL) module. Thanks to the LCL model, the relationship between pathologies is generalized into a set of classifier scores by introducing the word embedding of pathologies and multi-layer graph information propagation. During end-to-end training, it can be flexibly integrated into the IFE module and then adaptively recalibrate multi-label outputs with these scores. Extensive experiments on the ChestX-Ray14 and CheXpert datasets have demonstrated the effectiveness of CheXGCN as compared with the state-of-the-art baselines.

Journal ArticleDOI
TL;DR: A new feature extraction method is proposed that uses the word embedding method from natural language processing to generate bidirectional real dense vectors to reflect the contextual relationships between the pixels to fully exploit the deep features of images and to capture the semantic information of the context.
Abstract: In the traditional remote sensing image recognition, the traditional features (e.g., color features and texture features) cannot fully describe complex images, and the relationships between image pixels cannot be captured well. Using a single model or a traditional sequential joint model, it is easy to lose deep features during feature mining. This article proposes a new feature extraction method that uses the word embedding method from natural language processing to generate bidirectional real dense vectors to reflect the contextual relationships between the pixels. A bidirectional independent recurrent neural network (BiIndRNN) is combined with a convolutional neural network (CNN) to improve the sliced recurrent neural network (SRNN) algorithm model, which is then constructed in parallel with graph convolutional networks (GCNs) under an attention mechanism to fully exploit the deep features of images and to capture the semantic information of the context. This model is collectively named an improved SRNN and attention-treated GCN-based parallel (SAGP) model. Experiments conducted on Populus euphratica forests demonstrate that the proposed method outperforms traditional methods in terms of recognition accuracy. The validation done on public data set also proved it.

Proceedings ArticleDOI
26 Nov 2020
TL;DR: In this article, a word embedding-based POS tagger for Tamil language is proposed, where the experiments are conducted with different word embeddings BoW, TF-IDF, Word2vec, fastText and GloVe.
Abstract: This paper proposes a word embedding-based Part of Speech (POS) tagger for Tamil language The experiments are conducted with different word embeddings BoW, TF-IDF, Word2vec, fastText and GloVe that are created using UJ-Tamil corpus Different combinations of eight features with three classifiers linear SVM, Extreme Gradient Boosting and k-Nearest Neighbor are used to build the POS tagger The results are compared against Viterbi algorithm-based POS tagger The results show that word embedding can be used for POS tagging with good performance BoW, TF-IDF and fastText give an impressive performance compared with Word2vec and GloVe The accuracy of 99% is obtained with word embedding of BoW and TF-IDF with unigrams as well as bigrams and with linear SVM classifier POS tag of a given word can be identified with 99% of accuracy using word embeddings based POS tagger in Tamil

Journal ArticleDOI
TL;DR: A benchmark comparison of various deep learning architectures such as Convolutional Neural Networks (CNN) and Long short-term memory (LSTM) recurrent neural networks is presented and several combinations of these models are proposed, and the effect of different pre-trained word embedding models are studied.

Proceedings ArticleDOI
Difei Gao1, Ke Li1, Ruiping Wang1, Shiguang Shan1, Xilin Chen1 
14 Jun 2020
TL;DR: A novel VQA approach that represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively, and introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes.
Abstract: Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

Proceedings Article
30 Apr 2020
TL;DR: This work presents a novel and principled solution for modeling both the global absolute positions of words and their order relationships, and is the first work in NLP to link imaginary numbers in complex-valued representations to concrete meanings (i.e., word order).
Abstract: Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. We present a novel and principled solution for modeling both the global absolute positions of words and their order relationships. Our solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions can be extended to complex-valued variants. We extend CNN, RNN and Transformer NNs to complex-valued versions to incorporate our complex embedding (we make all code available). Experiments on text classification, machine translation and language modeling show gains over both classical word embeddings and position-enriched word embeddings. To our knowledge, this is the first work in NLP to link imaginary numbers in complex-valued representations to concrete meanings (i.e., word order).

Journal ArticleDOI
03 Apr 2020
TL;DR: A brand-new light-weight neural framework to address the distantly supervised relation extraction problem and alleviate the defects in previous selective attention framework is proposed, which achieves a new state-of-the-art performance in terms of both AUC and top-n precision metrics.
Abstract: Distantly supervised relation extraction intrinsically suffers from noisy labels due to the strong assumption of distant supervision. Most prior works adopt a selective attention mechanism over sentences in a bag to denoise from wrongly labeled data, which however could be incompetent when there is only one sentence in a bag. In this paper, we propose a brand-new light-weight neural framework to address the distantly supervised relation extraction problem and alleviate the defects in previous selective attention framework. Specifically, in the proposed framework, 1) we use an entity-aware word embedding method to integrate both relative position information and head/tail entity embeddings, aiming to highlight the essence of entities for this task; 2) we develop a self-attention mechanism to capture the rich contextual dependencies as a complement for local dependencies captured by piecewise CNN; and 3) instead of using selective attention, we design a pooling-equipped gate, which is based on rich contextual representations, as an aggregator to generate bag-level representation for final relation classification. Compared to selective attention, one major advantage of the proposed gating mechanism is that, it performs stably and promisingly even if only one sentence appears in a bag and thus keeps the consistency across all training examples. The experiments on NYT dataset demonstrate that our approach achieves a new state-of-the-art performance in terms of both AUC and top-n precision metrics.

Journal ArticleDOI
TL;DR: This paper proposes an efficient technique to mitigate the problem of resource scarcity for emotion detection in Hindi by leveraging information from a resource-rich language like English, following a deep transfer learning framework which efficiently captures relevant information through the shared space of two languages.
Abstract: Performance of any natural language processing (NLP) system greatly depends on the amount of resources and tools available in a particular language or domain. Therefore, while solving any problem in low-resource setting, it is important to investigate techniques to leverage the resources and tools available in resource-rich languages. In this paper we propose an efficient technique to mitigate the problem of resource scarcity for emotion detection in Hindi by leveraging information from a resource-rich language like English. Our method follows a deep transfer learning framework which efficiently captures relevant information through the shared space of two languages, showing significantly better performance compared to the monolingual scenario that learns in the vector space of only one language. As base learning models, we use Convolution Neural Network (CNN) and Bi-Directional Long Short Term Memory (Bi-LSTM). As there are no available emotion labeled dataset for Hindi, we create a new dataset for emotion detection in disaster domain by annotating sentences of news documents with nine different classes based on Plutchikâ;;s wheel of emotions. To improve the performance of emotion classification in Hindi, we employ transfer learning to exploit the resources available in the related domains. The core of our approach lies in generating a cross-lingual word embedding representation of words in the shared embedding space. The neural networks are trained on the existing datasets, and then weights are fine-tuned following the four different transfer learning strategies for emotion classification in Hindi. We obtain a significant performance gain in our our proposed transfer learning techniques, achieving an F1-score of 0.53 (compared to 0.47)-thereby implying that knowledge from a resource-rich language can be transferred across language and domains. 1

Book ChapterDOI
14 Apr 2020
TL;DR: This paper formulate keyphrase extraction from scholarly articles as a sequence labeling task solved using a BiLSTM-CRF, where the words in the input text are represented using deep contextualized embeddings.
Abstract: In this paper, we formulate keyphrase extraction from scholarly articles as a sequence labeling task solved using a BiLSTM-CRF, where the words in the input text are represented using deep contextualized embeddings. We evaluate the proposed architecture using both contextualized and fixed word embedding models on three different benchmark datasets, and compare with existing popular unsupervised and supervised techniques. Our results quantify the benefits of: (a) using contextualized embeddings over fixed word embeddings; (b) using a BiLSTM-CRF architecture with contextualized word embeddings over fine-tuning the contextualized embedding model directly; and (c) using domain-specific contextualized embeddings (SciBERT). Through error analysis, we also provide some insights into why particular models work better than the others. Lastly, we present a case study where we analyze different self-attention layers of the two best models (BERT and SciBERT) to better understand their predictions.

Journal ArticleDOI
TL;DR: This paper proposes an automatic summarizer using the distributional semantic model to capture semantics for producing high-quality summaries and concludes that usage of semantic as a feature for text summarization provides improved results and helps to further reduce redundancies from the input source.
Abstract: Automatic text summarization essentially condenses a long document into a shorter format while preserving its information content and overall meaning. It is a potential solution to the information overload. Several automatic summarizers exist in the literature capable of producing high-quality summaries, but they do not focus on preserving the underlying meaning and semantics of the text. In this paper, we capture and preserve the semantics of text as the fundamental feature for summarizing a document. We propose an automatic summarizer using the distributional semantic model to capture semantics for producing high-quality summaries. We evaluated our summarizer using ROUGE on DUC-2007 dataset and compare our results with other four different state-of-the-art summarizers. Our system outperforms the other reference summarizers leading us to the conclusion that usage of semantic as a feature for text summarization provides improved results and helps to further reduce redundancies from the input source.

Journal ArticleDOI
TL;DR: The proposed model perfectly cleaned the data and generates word vectors from pre-trained Word2Vec model and use CNN layer to extract better features for short sentences categorization.

Proceedings ArticleDOI
01 Dec 2020
TL;DR: This paper proposes a character-aware pre-trained language model named CharBERT improving on the previous methods (such as BERT, RoBERTa) and proposes a new pre-training task named NLM (Noisy LM) for unsupervised character representation learning.
Abstract: Most pre-trained language models (PLMs) construct word representations at subword level with Byte-Pair Encoding (BPE) or its variations, by which OOV (out-of-vocab) words are almost avoidable. However, those methods split a word into subword units and make the representation incomplete and fragile.In this paper, we propose a character-aware pre-trained language model named CharBERT improving on the previous methods (such as BERT, RoBERTa) to tackle these problems. We first construct the contextual word embedding for each token from the sequential character representations, then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module. We also propose a new pre-training task named NLM (Noisy LM) for unsupervised character representation learning. We evaluate our method on question answering, sequence labeling, and text classification tasks, both on the original datasets and adversarial misspelling test sets. The experimental results show that our method can significantly improve the performance and robustness of PLMs simultaneously.

Proceedings ArticleDOI
Zhen Yang, Bojie Hu1, Ambyera Han, Shen Huang1, Qi Ju1 
17 Sep 2020
TL;DR: Experimental results show that CSP achieves significant improvements over baselines without pre- training or with other pre-training methods, and relieve the pretrain-finetune discrepancy caused by the artificial symbols like [mask].
Abstract: This paper proposes a new pre-training method, called Code-Switching Pre-training (CSP for short) for Neural Machine Translation (NMT). Unlike traditional pre-training method which randomly masks some fragments of the input sentence, the proposed CSP randomly replaces some words in the source sentence with their translation words in the target language. Specifically, we firstly perform lexicon induction with unsupervised word embedding mapping between the source and target languages, and then randomly replace some words in the input sentence with their translation words according to the extracted translation lexicons. CSP adopts the encoder-decoder framework: its encoder takes the code-mixed sentence as input, and its decoder predicts the replaced fragment of the input sentence. In this way, CSP is able to pre-train the NMT model by explicitly making the most of the alignment information extracted from the source and target monolingual corpus. Additionally, we relieve the pretrain-finetune discrepancy caused by the artificial symbols like [mask]. To verify the effectiveness of the proposed method, we conduct extensive experiments on unsupervised and supervised NMT. Experimental results show that CSP achieves significant improvements over baselines without pre-training or with other pre-training methods.