scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Proceedings ArticleDOI
06 Jul 2020
TL;DR: This work addresses the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings and applies, extend and evaluates different meta- embedding methods from the word embedding literature at the sentence level, including dimensionality reduction and generalized Canonical Correlation Analysis.
Abstract: We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schutze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bollegala and Bao, 2018). Our sentence meta-embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12-STS16 datasets, with gains of between 3.7% and 6.4% Pearson’s r over single-source systems.

23 citations

Journal ArticleDOI
Mamdouh Farouk1
TL;DR: The proposed approach combines different similarity measures in the calculation of sentence similarity and exploits sentence semantic structure to improve the accuracy of the sentence similarity calculation.

23 citations

Journal ArticleDOI
Min Dong1, Li Yongfa1, Xue Tang1, Jingyun Xu1, Sheng Bi1, Yi Cai1 
TL;DR: A convolutional neural network based on multiple convolutions and pooling for text sentiment classification (variable convolution and pooled convolution neural network, VCPCNN) is proposed.
Abstract: With the popularity of the internet, the expression of emotions and methods of communication are becoming increasingly abundant, and most of these emotions are transmitted in text form. Text sentiment classification research mainly includes three methods based on sentiment dictionaries, machine learning and deep learning. In recent years, many deep learning-based works have used TextCNN (text convolution neural network) to extract text semantic information for text sentiment analysis. However, TextCNN only considers the length of the sentence when extracting semantic information. It ignores the semantic features between word vectors and only considers the maximum feature value of the feature image in the pooling layer without considering other information. Therefore, in this paper, we propose a convolutional neural network based on multiple convolutions and pooling for text sentiment classification (variable convolution and pooling convolution neural network, VCPCNN). There are three contributions in this paper. First, a multiconvolution and pooling neural network is proposed for the TextCNN network structure. Second, four convolution operations are introduced in the word embedding dimension or direction, which are helpful for mining the local features on the semantic dimensions of word vectors. Finally, average pooling is introduced in the pooling layer, which is beneficial for saving the important feature information of the extracted features. The verification test was carried out on four emotional datasets, including English emotional polarity, Chinese emotional polarity, Chinese subjective and objective emotion and Chinese multicategory. Our apporach is effective in that its result was up to 1.97% higher than that of the TextCNN network.

23 citations

Journal ArticleDOI
TL;DR: This study improves technological information-based text mining by structuring the word-to-word link information in technological documents based on an automated process by proposing a methodology for designing a TechWord-based lexical database based on the lexical characteristics of technological words that are differentiated from general words.
Abstract: The role of text mining based on technological documents such as patents is important in the research field of technology intelligence for technology R&D planning. In addition, WordNet, an English-based lexical database, is widely used for pre-processing text data such as word lemmatization and synonym search. However, technological vocabulary information is complex and specific, and WordNet’s ability to analyze technological information is limited in its reflecting technological features. Thus, to improve the text mining performance of technological information, this study proposes a methodology for designing a TechWord-based lexical database that is based on the lexical characteristics of technological words that are differentiated from general words. To do this, we define TechWord, a technology lexical information, and construct a TechSynset, a synonym set between TechWords. First, through dependency parsing between words, TechWord, a unit word that describes a technology, is structured and identifies nouns and verbs. The importance of connectivity is investigated by a network centrality index analysis based on the dependency relations of words. Subsequently, to search for synonyms suitable for the target technology domain, a TechSynset is constructed through synset information, with an additional analysis that calculates cosine similarity based on a word embedding vector. Applying the proposed methodology to the actual technology-related information analysis, we collect patent data on the technological fields of the automotive field, and present the results of the TechWord and TechSynset. This study improves technological information-based text mining by structuring the word-to-word link information in technological documents based on an automated process.

23 citations

Proceedings ArticleDOI
02 Jul 2017
TL;DR: The impact of initializing the training of a neural network natural language processing algorithm with pre-defined clinical word embeddings to improve feature extraction and relationship classification between entities is studied.
Abstract: Electronic Health Records (EHR) narratives are a rich source of information, embedding high-resolution information of value to secondary research use. However, because the EHRs are mostly in natural language free-text and highly ambiguity-ridden, many natural language processing algorithms have been devised around them to extract meaningful structured information about clinical entities. The performance of the algorithms however, largely varies depending on the training dataset as well as the effectiveness of the use of background knowledge to steer the learning process.In this paper we study the impact of initializing the training of a neural network natural language processing algorithm with pre-defined clinical word embeddings to improve feature extraction and relationship classification between entities. We add our embedding framework to a bi-directional long short-term memory (Bi-LSTM) neural network, and further study the effect of using attention weights in neural networks for sequence labelling tasks to extract knowledge of Adverse Drug Reactions (ADRs). We incorporate unsupervised word embeddings using Word2Vec and GloVe from widely available medical resources such as Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) II corpora, Unified Medical Language System (UMLS) as well as embed pharmaco lexicon from available EHRs. Our algorithm, implemented using two datasets, shows that our architecture outperforms baseline Bi-LSTM or Bi-LSTM networks using linear chain and Skip-Chain conditional random fields (CRF).

23 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788