Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Norm-Based Curriculum Learning for Neural Machine Translation

[...]

Xuebo Liu¹, Houtim Lai, Derek F. Wong¹, Lidia S. Chao¹•Institutions (1)

University of Macau¹

03 Jun 2020

TL;DR: This paper aims to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method that uses the norm (aka length or module) of a word embedding as a measure of the difficulty of the sentence, the competence of the model, and the weight of the sentences.

...read moreread less

Abstract: A neural machine translation (NMT) system is expensive to train, especially with high-resource settings. As the NMT architectures become deeper and wider, this issue gets worse and worse. In this paper, we aim to improve the efficiency of training an NMT by introducing a novel norm-based curriculum learning method. We use the norm (aka length or module) of a word embedding as a measure of 1) the difficulty of the sentence, 2) the competence of the model, and 3) the weight of the sentence. The norm-based sentence difficulty takes the advantages of both linguistically motivated and model-based sentence difficulties. It is easy to determine and contains learning-dependent features. The norm-based model competence makes NMT learn the curriculum in a fully automated way, while the norm-based sentence weight further enhances the learning of the vector representation of the NMT. Experimental results for the WMT’14 English-German and WMT’17 Chinese-English translation tasks demonstrate that the proposed method outperforms strong baselines in terms of BLEU score (+1.17/+1.56) and training speedup (2.22x/3.33x).

...read moreread less

85 citations

Journal Article•DOI•

CoAID-DEEP: An Optimized Intelligent Framework for Automated Detecting COVID-19 Misleading Information on Twitter

[...]

Diaa Salama Abdelminaam¹, Fatma Helmy Ismail², Mohamed Taha¹, Ahmed Taha¹, Essam H. Houssein³, Ayman Nabil² - Show less +2 more•Institutions (3)

Banha University¹, Misr International University², Minia University³

09 Feb 2021-IEEE Access

TL;DR: In this article, an updated deep neural network for identification of false news was proposed for detecting false news in tweets passing on data with respect to COVID-19 information, and the results obtained with the proposed framework reveal high accuracy in detecting Fake and non-Fake tweets containing COVID19 information.

...read moreread less

Abstract: COVID-19 has affected all peoples’ lives Though COVID-19 is on the rising, the existence of misinformation about the virus also grows in parallel Additionally, the spread of misinformation has created confusion among people, caused disturbances in society, and even led to deaths Social media is central to our daily lives The Internet has become a significant source of knowledge Owing to the widespread damage caused by fake news, it is important to build computerized systems to detect fake news The paper proposes an updated deep neural network for identification of false news The deep learning techniques are The Modified-LSTM (one to three layers) and The Modified GRU (one to three layers) In particular, we carry out investigations of a large dataset of tweets passing on data with respect to COVID-19 In our study, we separate the dubious claims into two categories: true and false We compare the performance of the various algorithms in terms of prediction accuracy The six machine learning techniques are decision trees, logistic regression, k nearest neighbors, random forests, support vector machines, and naive Bayes (NB) The parameters of deep learning techniques are optimized using Keras-tuner Four Benchmark datasets were used Two feature extraction methods were used (TF-ID with N-gram) to extract essential features from the four benchmark datasets for the baseline machine learning model and word embedding feature extraction method for the proposed deep neural network methods The results obtained with the proposed framework reveal high accuracy in detecting Fake and non-Fake tweets containing COVID-19 information These results demonstrate significant improvement as compared to the existing state of art results of baseline machine learning models In our approach, we classify the data into two categories: fake or nonfake We compare the execution of the proposed approaches with Six machine learning procedures The six machine learning procedures are Decision Tree (DT), Logistic Regression (LR), K Nearest Neighbor (KNN), Random Forest (RF), Support Vector Machine (SVM), and Naive Bayes (NB) The parameters of deep learning techniques are optimized using Keras-tuner Four Benchmark datasets were used Two feature extraction methods were used (TF-ID with N-gram) to extract essential features from the four benchmark datasets for the baseline machine learning model and word embedding feature extraction method for the proposed deep neural network methods The results obtained with the proposed framework reveal high accuracy in detecting Fake and non-Fake tweets containing COVID-19 information These results demonstrate significant improvement as compared to the existing state of art results of baseline machine learning models

...read moreread less

85 citations

Proceedings Article•

Understanding the Origins of Bias in Word Embeddings

[...]

Marc-Etienne Brunet¹, Colleen Alkalay-Houlihan, Ashton Anderson¹, Richard S. Zemel¹•Institutions (1)

University of Toronto¹

24 May 2019

TL;DR: This paper developed a technique for understanding the origins of bias in word embeddings by identifying how perturbing the corpus will affect the bias of the resulting embedding, which can be used to trace the origin of word embedding bias back to the original training documents.

...read moreread less

Abstract: The power of machine learning systems not only promises great technical progress, but risks societal harm. As a recent example, researchers have shown that popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these algorithms in machine learning systems, from automated translation services to curriculum vitae scanners, can amplify stereotypes in important contexts. Although methods have been developed to measure these biases and alter word embeddings to mitigate their biased representations, there is a lack of understanding in how word embedding bias depends on the training data. In this work, we develop a technique for understanding the origins of bias in word embeddings. Given a word embedding trained on a corpus, our method identifies how perturbing the corpus will affect the bias of the resulting embedding. This can be used to trace the origins of word embedding bias back to the original training documents. Using our method, one can investigate trends in the bias of the underlying corpus and identify subsets of documents whose removal would most reduce bias. We demonstrate our techniques on both a New York Times and Wikipedia corpus and find that our influence function-based approximations are very accurate.

...read moreread less

84 citations

Posted Content•

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks

[...]

Nathan Siegle Hartmann, Erick Rocha Fonseca, Christopher Shulby, Marcos Vinícius Treviso, Jéssica S. Rodrigues¹, Sandra Maria Aluísio - Show less +2 more•Institutions (1)

Federal University of São Carlos¹

20 Aug 2017-arXiv: Computation and Language

TL;DR: This article evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants, on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.

...read moreread less

Abstract: Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing sys- tems. In this paper, we evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants. We trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. We evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks. The obtained results suggest that word analogies are not appropriate for word embedding evaluation; task-specific evaluations appear to be a better option.

...read moreread less

84 citations

Proceedings Article•

FRAGE: Frequency-Agnostic Word Representation

[...]

Chengyue Gong¹, Di He¹, Xu Tan², Tao Qin², Liwei Wang¹, Tie-Yan Liu² - Show less +2 more•Institutions (2)

Peking University¹, Microsoft²

02 Dec 2018

TL;DR: The authors proposed an adversarial training method to blur the boundary between the embeddings of high-frequency and low-frequency words and achieved higher performance than the baselines in all tasks.

...read moreread less

Abstract: Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In order to mitigate the issue, in this paper, we propose a neat, simple yet effective adversarial training method to blur the boundary between the embeddings of high-frequency words and low-frequency words. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that we achieve higher performance than the baselines in all tasks.

...read moreread less

83 citations

Collapse

Network Information

Performance

Metrics

5,718

Papers

201,647

Citations

No. of papers in the topic in previous years
Year	Papers
2023	317
2022	716
2021	736
2020	1,025
2019	1,078
2018	788

Word embedding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics