Topic
Word embedding
About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.
Papers
More filters
••
03 Jun 2020TL;DR: This paper aims to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method that uses the norm (aka length or module) of a word embedding as a measure of the difficulty of the sentence, the competence of the model, and the weight of the sentences.
Abstract: A neural machine translation (NMT) system is expensive to train, especially with high-resource settings. As the NMT architectures become deeper and wider, this issue gets worse and worse. In this paper, we aim to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method. We use the norm (aka length or module) of a word embedding as a measure of 1) the difficulty of the sentence, 2) the competence of the model, and 3) the weight of the sentence. The norm-based sentence difficulty takes the advantages of both linguistically motivated and model-based sentence difficulties. It is easy to determine and contains learning-dependent features. The norm-based model competence makes NMT learn the curriculum in a fully automated way, while the norm-based sentence weight further enhances the learning of the vector representation of the NMT. Experimental results for the WMT’14 English-German and WMT’17 Chinese-English translation tasks demonstrate that the proposed method outperforms strong baselines in terms of BLEU score (+1.17/+1.56) and training speedup (2.22x/3.33x).
85 citations
••
TL;DR: In this article, an updated deep neural network for identification of false news was proposed for detecting false news in tweets passing on data with respect to COVID-19 information, and the results obtained with the proposed framework reveal high accuracy in detecting Fake and non-Fake tweets containing COVID19 information.
Abstract: COVID-19 has affected all peoples’ lives Though COVID-19 is on the rising, the existence of misinformation about the virus also grows in parallel Additionally, the spread of misinformation has created confusion among people, caused disturbances in society, and even led to deaths Social media is central to our daily lives The Internet has become a significant source of knowledge Owing to the widespread damage caused by fake news, it is important to build computerized systems to detect fake news The paper proposes an updated deep neural network for identification of false news The deep learning techniques are The Modified-LSTM (one to three layers) and The Modified GRU (one to three layers) In particular, we carry out investigations of a large dataset of tweets passing on data with respect to COVID-19 In our study, we separate the dubious claims into two categories: true and false We compare the performance of the various algorithms in terms of prediction accuracy The six machine learning techniques are decision trees, logistic regression, k nearest neighbors, random forests, support vector machines, and naive Bayes (NB) The parameters of deep learning techniques are optimized using Keras-tuner Four Benchmark datasets were used Two feature extraction methods were used (TF-ID with N-gram) to extract essential features from the four benchmark datasets for the baseline machine learning model and word embedding feature extraction method for the proposed deep neural network methods The results obtained with the proposed framework reveal high accuracy in detecting Fake and non-Fake tweets containing COVID-19 information These results demonstrate significant improvement as compared to the existing state of art results of baseline machine learning models In our approach, we classify the data into two categories: fake or nonfake We compare the execution of the proposed approaches with Six machine learning procedures The six machine learning procedures are Decision Tree (DT), Logistic Regression (LR), K Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), and Naive Bayes (NB) The parameters of deep learning techniques are optimized using Keras-tuner Four Benchmark datasets were used Two feature extraction methods were used (TF-ID with N-gram) to extract essential features from the four benchmark datasets for the baseline machine learning model and word embedding feature extraction method for the proposed deep neural network methods The results obtained with the proposed framework reveal high accuracy in detecting Fake and non-Fake tweets containing COVID-19 information These results demonstrate significant improvement as compared to the existing state of art results of baseline machine learning models
85 citations
•
24 May 2019TL;DR: This paper developed a technique for understanding the origins of bias in word embeddings by identifying how perturbing the corpus will affect the bias of the resulting embedding, which can be used to trace the origin of word embedding bias back to the original training documents.
Abstract: The power of machine learning systems not only promises great technical progress, but risks societal harm. As a recent example, researchers have shown that popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these algorithms in machine learning systems, from automated translation services to curriculum vitae scanners, can amplify stereotypes in important contexts. Although methods have been developed to measure these biases and alter word embeddings to mitigate their biased representations, there is a lack of understanding in how word embedding bias depends on the training data. In this work, we develop a technique for understanding the origins of bias in word embeddings. Given a word embedding trained on a corpus, our method identifies how perturbing the corpus will affect the bias of the resulting embedding. This can be used to trace the origins of word embedding bias back to the original training documents. Using our method, one can investigate trends in the bias of the underlying corpus and identify subsets of documents whose removal would most reduce bias. We demonstrate our techniques on both a New York Times and Wikipedia corpus and find that our influence function-based approximations are very accurate.
84 citations
•
TL;DR: This article evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants, on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.
Abstract: Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing sys- tems. In this paper, we evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants. We trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. We evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks. The obtained results suggest that word analogies are not appropriate for word embedding evaluation; task-specific evaluations appear to be a better option.
84 citations
•
02 Dec 2018TL;DR: The authors proposed an adversarial training method to blur the boundary between the embeddings of high-frequency and low-frequency words and achieved higher performance than the baselines in all tasks.
Abstract: Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In order to mitigate the issue, in this paper, we propose a neat, simple yet effective adversarial training method to blur the boundary between the embeddings of high-frequency words and low-frequency words. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that we achieve higher performance than the baselines in all tasks.
83 citations