scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Proceedings ArticleDOI
14 May 2017
TL;DR: The main idea is to construct a weighted graph from knowledge bases (KBs) to represent structured relationships among words/concepts and propose a GCBOW model and a GSkip-gram model respectively by integrating such a graph into the original CBOW and Skip-gram models via graph regularization.
Abstract: Word embedding in the NLP area has attracted increasing attention in recent years. The continuous bag-of-words model (CBOW) and the continuous Skip-gram model (Skip-gram) have been developed to learn distributed representations of words from a large amount of unlabeled text data. In this paper, we explore the idea of integrating extra knowledge to the CBOW and Skip-gram models and applying the new models to biomedical NLP tasks. The main idea is to construct a weighted graph from knowledge bases (KBs) to represent structured relationships among words/concepts. In particular, we propose a GCBOW model and a GSkip-gram model respectively by integrating such a graph into the original CBOW model and Skip-gram model via graph regularization. Our experiments on four general domain standard datasets show encouraging improvements with the new models. Further evaluations on two biomedical NLP tasks (biomedical similarity/relatedness task and biomedical Information Retrieval (IR) task) show that our methods have better performance than baselines.

26 citations

Proceedings ArticleDOI
26 May 2019
TL;DR: The impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution are investigated.
Abstract: Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.

26 citations

Proceedings ArticleDOI
Quanzhi Li1, Sameena Shah1, Armineh Nourbakhsh1, Xiaomo Liu1, Rui Fang1 
24 Oct 2016
TL;DR: A new approach of recommending hashtags for tweets is presented that uses Learning to Rank algorithm to incorporate features built from topic enhanced word embeddings, tweet entity data, hashtag frequency, hashtag temporal data and tweet URL domain information.
Abstract: In this paper, we present a new approach of recommending hashtags for tweets. It uses Learning to Rank algorithm to incorporate features built from topic enhanced word embeddings, tweet entity data, hashtag frequency, hashtag temporal data and tweet URL domain information. The experiments using millions of tweets and hashtags show that the proposed approach outperforms the three baseline methods -- the LDA topic, the tf.idf based and the general word embedding approaches.

26 citations

Journal ArticleDOI
Youli Fang1, Hong Wang1, Zhao Lili1, Yu Fengping1, Wang Caiyu1 
TL;DR: A dynamic knowledge graph-based method for fake-review detection based on the characteristics of online product reviews, which surpassed the state-of-the-art results in experimental evaluations.
Abstract: Online product reviews are an important driver of customers’ purchasing behavior. Fake reviews seriously mislead consumers, challenging the fairness of the online shopping environment. Although the detection of fake reviews has progressed, several problems remain. First, fake comment recognition ignores the correlation between time and the semantics of the comment texts, which is always hidden in the context of the reviews. Second, the impact of multi-source information on fake comment recognition is not considered, as it constitutes a complex, high-dimensional, heterogeneous relationship between reviewers, reviews, stores and commodities. To overcome these problems, the present paper proposes a dynamic knowledge graph-based method for fake-review detection. Based on the characteristics of online product reviews, it first extracts four types of entities using a developed neural network model called sentence vector/twin-word embedding conditioned bidirectional long short-term memory. Time series related features are then added to the knowledge graph construction process, forming dynamic graph networks. To enhance the fake-review detection, four indicators are newly defined for determining the relationships among the four types of nodes. In experimental evaluations, our method surpassed the state-of-the-art results.

26 citations

Journal ArticleDOI
TL;DR: This paper proposes an approach which utilizes neural network model based on dependency-based word embedding to automatically learn significant features from raw input for trigger classification, and achieves the semantic distributed representation of every trigger word.
Abstract: In biomedical research, events revealing complex relations between entities play an important role. Biomedical event trigger identification has become a research hotspot since its important role in biomedical event extraction. Traditional machine learning methods, such as support vector machines (SVM) and maxent classifiers, which aim to manually design powerful features fed to the classifiers, depend on the understanding of the specific task and cannot generalize to the new domain or new examples. In this paper, we propose an approach which utilizes neural network model based on dependency-based word embedding to automatically learn significant features from raw input for trigger classification. First, we employ Word2vecf, the modified version of Word2vec, to learn word embedding with rich semantic and functional information based on dependency relation tree. Then neural network architecture is used to learn more significant feature representation based on raw dependency-based word embedding. Meanwhile, we dynamically adjust the embedding while training for adapting to the trigger classification task. Finally, softmax classifier labels the examples by specific trigger class using the features learned by the model. The experimental results show that our approach achieves a micro-averaging F1 score of 78.27 and a macro-averaging F1 score of 76.94 % in significant trigger classes, and performs better than baseline methods. In addition, we can achieve the semantic distributed representation of every trigger word.

25 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788