scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Proceedings ArticleDOI
13 Mar 2019
TL;DR: The results show that the glove method is the best word embedding method for hotel review data, and the words2vec Continuous Bag of Words CBOW, word2vec skip-gram, doc2vec, and glove are compared.
Abstract: Development of information technology makes the production of data increase dramatically. We can get lots of data from the internet, including data reviews about a product or service. The more data obtained, the system is needed to process it. Sentiment analysis is a text processing of Natural Language Processing (NLP) that can help someone to see the quality of service offered, including hotel services. This paper uses hotel review data to carry out sentiment analysis obtained from the Traveloka website. The data classified using the Long Short-Term Memory (LSTM) algorithm. To get better results, the authors use word embedding to convert words into vectors. This study aims to compare the performance of several word embedding, while word embedding compared is word2vec Continuous Bag of Words CBOW, word2vec skip-gram, doc2vec, and glove. From the experiment conducted, the results show that the glove method has the highest accuracy of 95.52% while the word2vec skip-gram model has the lowest accuracy of 91.81%, so it concluded that the glove method is the best word embedding method for hotel review data.

19 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a framework called Semantic Feature Learning via Dual Sequences (SFLDS), which can capture the semantic and structural information in the Abstract Syntax Tree (AST) for feature generation.
Abstract: Software defect prediction (SDP) can help developers reasonably allocate limited resources for locating bugs and prioritizing their testing efforts. Existing methods often serialize an Abstract Syntax Tree (AST) obtained from the program source code into a token sequence, which is then inputted into the deep learning model to learn the semantic features. However, there are different ASTs with the same token sequence, and it is impossible to distinguish the tree structure of the ASTs only by a token sequence. To solve this problem, this paper proposes a framework called Semantic Feature Learning via Dual Sequences (SFLDS), which can capture the semantic and structural information in the AST for feature generation. Specifically, based on the AST, we select the representative nodes in the AST and convert the program source code into a simplified AST (S-AST). Our method introduces two sequences to represent the semantic and structural information of the S-AST, one is the result of traversing the S-AST node in pre-order, and another is composed of parent nodes. Then each token in the dual sequences is encoded as a numerical vector via mapping and word embedding. Finally, we use a bi-directional long short-term memory (BiLSTM) based neural network to automatically generate semantic features from the dual sequences for SDP. In addition, to leverage the statistical characteristics contained in the handcrafted metrics, we also propose a framework called Defect Prediction via SFLDS (DP-SFLDS) which combines the semantic features generated from SFLDS with handcrafted metrics to perform SDP. In our empirical studies, eight open-source Java projects from the PROMISE repository are chosen as our empirical subjects. Experimental results show that our proposed approach can perform better than several state-of-the-art baseline SDP methods.

19 citations

Journal ArticleDOI
TL;DR: It was observed that the algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.
Abstract: With the growth of online information and sudden expansion in the number of electronic documents provided on websites and in electronic libraries, there is difficulty in categorizing text documents. Therefore, a rule-based approach is a solution to this problem; the purpose of this study is to classify documents by using a rule-based. This paper deals with the rule-based approach with the embedding technique for a document to vector (doc2vec) files. An experiment was performed on two data sets Reuters-21578 and the 20 Newsgroups to classify the top ten categories of these data sets by using a document to vector rule-based (D2vecRule). Finally, this method provided us a good classification result according to the F-measures and implementation time metrics. In conclusion, it was observed that our algorithm document to vector rule-based (D2vecRule) was good when compared with other algorithms such as JRip, One R, and ZeroR applied to the same Reuters-21578 dataset.

19 citations

Posted Content
TL;DR: This paper shows that the arithmetic mean of two distinct word embedding sets yields a performant meta-embedding that is comparable or better than more complex meta- embedding learning methods.
Abstract: Creating accurate meta-embeddings from pre-trained source embeddings has received attention lately. Methods based on global and locally-linear transformation and concatenation have shown to produce accurate meta-embeddings. In this paper, we show that the arithmetic mean of two distinct word embedding sets yields a performant meta-embedding that is comparable or better than more complex meta-embedding learning methods. The result seems counter-intuitive given that vector spaces in different source embeddings are not comparable and cannot be simply averaged. We give insight into why averaging can still produce accurate meta-embedding despite the incomparability of the source vector spaces.

19 citations

Proceedings ArticleDOI
20 Dec 2019
TL;DR: It is shown that many gender-neutral words in Hindi are mapped to vectors which are inclined towards one gender or the other in multi-dimensional space and a new debiasing algorithm is proposed that can be applicable in the context of any language.
Abstract: Word-embedding is a major machine learning technique for computational applications of languages. For a given corpus, the process of word-embedding is to embed each word onto multi-dimensional space such that semantic similarities between similar words are retained. While learning the similarity as encapsulated in the training corpus, the embedding process inadvertently captures many other inherent features present in the corpus. One such thing is the bias arising out of stereotyping present in almost all the corpus no matter how extensively used and trusted they are. We study this aspect of word-embedding in the context of Hindi language. We show that many gender-neutral words in Hindi are mapped to vectors which are inclined towards one gender or the other in multi-dimensional space. We propose a new algorithm of debiasing and demonstrate its efficacy in the context of Hindi language. Further, we build a SVM-based classifier that determines whether a gender-neutral word is classified as neutral or otherwise. We corroborate our claim with experimental results on large number of individual words. This work is first ever result on debiasing in Hindi Language and our new debiasing algorithm can be applicable in the context of any language.

19 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788