Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking

[...]

Patrick H. Chen¹, Si Si², Yang Li², Ciprian Chelba², Cho-Jui Hsieh³ - Show less +1 more•Institutions (3)

University of California, Los Angeles¹, Google², University of California, Davis³

01 Jan 2018

TL;DR: This article proposed GroupReduce, a novel compression method for neural language models based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words).

...read moreread less

Abstract: Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). We start by grouping words into $c$ blocks based on their frequency, and then refine the clustering iteratively by constructing weighted low-rank approximation for each block, where the weights are based the frequencies of the words in the block. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6x compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26x compression rate without losing prediction accuracy.

...read moreread less

49 citations

Posted Content•

Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only

[...]

Robert Litschko¹, Goran Glavaš¹, Simone Paolo Ponzetto¹, Ivan Vulić²•Institutions (2)

University of Mannheim¹, University of Cambridge²

02 May 2018-arXiv: Computation and Language

TL;DR: This paper proposed a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all, which leverages shared word embedding spaces in which terms, queries and documents can be represented, irrespective of their actual language.

...read moreread less

Abstract: We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent.

...read moreread less

49 citations

Journal Article•DOI•

Identifying tweets of personal health experience through word embedding and LSTM neural network.

[...]

Keyuan Jiang¹, Shichao Feng¹, Qunhao Song¹, Ricardo A. Calix¹, Matrika Gupta¹, Gordon R. Bernard² - Show less +2 more•Institutions (2)

Purdue University¹, Vanderbilt University²

13 Jun 2018-BMC Bioinformatics

TL;DR: This study presented an efficient and effective method of identifying health-related personal experience tweets by combining word embedding and an LSTM neural network that outperforms the conventional methods in identifying PETs.

...read moreread less

Abstract: As Twitter has become an active data source for health surveillance research, it is important that efficient and effective methods are developed to identify tweets related to personal health experience. Conventional classification algorithms rely on features engineered by human domain experts, and engineering such features is a challenging task and requires much human intelligence. The resultant features may not be optimal for the classification problem, and can make it challenging for conventional classifiers to correctly predict personal experience tweets (PETs) due to the various ways to express and/or describe personal experience in tweets. In this study, we developed a method that combines word embedding and long short-term memory (LSTM) model without the need to engineer any specific features. Through word embedding, tweet texts were represented as dense vectors which in turn were fed to the LSTM neural network as sequences. Statistical analyses of the results of 10-fold cross-validations of our method and conventional methods indicate that there exist significant differences (p < 0.01) in performance measures of accuracy, precision, recall, F1-score, and ROC/AUC, demonstrating that our approach outperforms the conventional methods in identifying PETs. We presented an efficient and effective method of identifying health-related personal experience tweets by combining word embedding and an LSTM neural network. It is conceivable that our method can help accelerate and scale up analyzing textual data of social media for health surveillance purposes, because of no need for the laborious and costly process of engineering features.

...read moreread less

49 citations

Proceedings Article•DOI•

Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models

[...]

Navid Rekabsaz¹, Mihai Lupu¹, Artem Baklanov², Alexander Dür, Linda Andersson¹, Allan Hanbury¹ - Show less +2 more•Institutions (2)

Vienna University of Technology¹, International Institute for Applied Systems Analysis²

01 Jul 2017

TL;DR: This work investigates the sentiment of annual disclosures of companies in stock markets to forecast volatility and explores the use of recent Information Retrieval term weighting models that are effectively extended by related terms using word embeddings.

...read moreread less

Abstract: Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively extended by related terms using word embeddings. In parallel to textual information, factual market data have been widely used as the mainstream approach to forecast market risk. We therefore study different fusion methods to combine text and market data resources. Our word embedding-based approach significantly outperforms state-of-the-art methods. In addition, we investigate the characteristics of the reports of the companies in different financial sectors.

...read moreread less

49 citations

Journal Article•DOI•

Impact of Stemming and Word Embedding on Deep Learning-Based Arabic Text Categorization

[...]

Huda Abdulrahman Almuzaini¹, Aqil M. Azmi¹•Institutions (1)

King Saud University¹

14 Jul 2020-IEEE Access

TL;DR: Among the deep learning models, the Attention mechanism and the Bidirectional learning gave outstanding performance with Arabic text categorization, and the results of this study indicate that stem- based algorithms perform slightly better compared to root-based algorithms.

...read moreread less

Abstract: Document classification is a classical problem in information retrieval, and plays an important role in a variety of applications. Automatic document classification can be defined as content-based assignment of one or more predefined categories to documents. Many algorithms have been proposed and implemented to solve this problem in general, however, classifying Arabic documents is lagging behind similar works in other languages. In this paper, we present seven deep learning-based algorithms to classify the Arabic documents. These are: Convolutional Neural Network (CNN), CNN-LSTM (LSTM = Long Short-Term Memory), CNN-GRU (GRU = Gated Recurrent Units), BiLSTM (Bidirectional LSTM), BiGRU, Att-LSTM (Attention-based LSTM), and Att-GRU. And for word representation, we applied the word embedding technique (Word2Vec). We tested our approach on two large datasets–with six and eight categories–using ten-fold cross-validation. Our objective was to study how the classification is affected by the stemming strategies and word embedding. First, we looked into the effects of different stemming algorithms on the document classification with different deep learning models. We experimented with eleven different stemming algorithms, broadly falling into: root-based and stem-based, and no stemming. We performed ANOVA test on the classification results using the different stemmers, which helps assure if the results are significant. The results of our study indicate that stem-based algorithms perform slightly better compared to root-based algorithms. Among the deep learning models, the Attention mechanism and the Bidirectional learning gave outstanding performance with Arabic text categorization. Our best performance is $F\text {-score} = 97.96\%$ , achieved using the Att-GRU model with stem-based algorithm. Next, we looked into different controlling parameters for word embedding. For Word2Vec, both skip-gram and bag-of-words (CBOW) perform well with either stemming strategies. However, when using a stem-based algorithm, skip-gram achieves good results with a vector of smaller dimension, while CBOW requires a larger dimension vector to achieve a similar performance.

...read moreread less

49 citations

Collapse

Network Information

Performance

Metrics

5,718

Papers

201,647

Citations

No. of papers in the topic in previous years
Year	Papers
2023	317
2022	716
2021	736
2020	1,025
2019	1,078
2018	788

Word embedding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics